Using simulations to show ROI for predictive models in python

Two resources I have been consuming lately I would highly recommend:

Keith’s perspective is nearly a 100% match to my experiences, e.g. should aim for projects that have around $1 million in expected revenue to justify a data science person/team, up front estimates should be on the low end, the easiest projects you can formulate as micro-decisions and you use a model to improve those binary decisions, etc. How to measure anything fits right into this as well, where Hubbard basically says get a prior distribution on expected outcomes, and then do simulations to see possible outcomes.

Here I am going to show an example that is very close to several of the projects I have done to show the potential increase in revenue from taking a model based approach using simulations in python.

Background

So the point in the data science project I am going to be illustrating is you have already decided to do an initial pilot model, and you have historical cases and then predicted probabilities from your model. Here I am thinking of the case of auditing some type transaction (it can be whatever you want, tax-returns, bank transactions, insurance claims, etc.). Here I am going to simulate some fake data to illustrate the later ROI estimates, but in real life you would use your own data for the business.

Here the variables I simulate are:

  • 5000 transactions, total_cases
  • a model based predicted probability, prob
  • a dollar value for the transaction, dollar
  • a historical marker whether a transaction was audited, audit
  • a historical marker whether the transaction was bad, hit

To be clear, this would be data you would normally already have for your business use case (e.g. historical transactions). To just illustrate my point I am making 100% fake data for everyone to follow along.

####################################
# Simulating data, probabilities
# and money values

from scipy.stats import norm
from scipy.stats import binom
from scipy.stats import beta
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

np.random.seed(10)
total_cases = 5000

# Beta(1,5), to generate the probs
prob = beta.rvs(1, 5, size=total_cases)

# Lognormal for the dollar values, clipped
dollar = np.exp(norm.rvs(7,2,size=total_cases)).clip(500,25000)

# Historical auditing process, all cases over 15000
audit = (dollar > 15000)*1

# Out of these, random 10% are hits
hit = binom.rvs(1, 0.10, size=total_cases)

# Putting into a dataframe
cases = pd.concat([pd.Series(dollar),pd.Series(audit),
                   pd.Series(prob), pd.Series(hit)], 
                   axis=1)
cases.columns = ['value','audit','prob', 'hit']
cases['revenue'] = cases['hit']*cases['value']*cases['audit']

cases['revenue'].sum() # about 1.1 million

cases.head()
####################################

These are all simulated from various probability distributions to look somewhat like real data. Probabilities and dollar values are right skewed. They are independent here, but it is ok if in your real data they are not.

Here I pretend the historical audit selection process is they automatically audit all large transactions, over $15k. And these historical audits have a 10% probability of finding a hit (think of it as fraud if you want). So the context is given our model estimates prob, how much more money do we think we can make if you use these model based decision as opposed to our simple threshold that is the current process?

Revenue Simulations

So here for my revenue simulations, what I am going to do is pretend I can audit the same number of cases (471), based on my model estimates, audit_total.

audit_total = audit.sum() #pretend we get to model the same
                          #number of cases
cases['model_expected'] = cases['prob']*cases['value']
cases['model_rank'] = cases['model_expected'].rank(method='first', ascending=False)
cases['model_audit'] = 1*(cases['model_rank'] >= audit_total)

# Expected revenue from our model based approach
(cases['model_audit']*cases['model_expected']).sum()
# About 1.3 million

So if our model is well calibrated, we can take those predicted probabilities and estimate what we think should happen if we used our model to audit 471 cases. Here we think we would make around 1.3 million, so about a lift of over $200k.

But, these models are probabilistic estimates. So I like to use simulations to hedge a bit when I am presenting to the business. Here I do 5000 simulations where I select my 471 cases, use a binomial random number generator to flip the coin whether the case results in a hit or not, and then calculate the total revenue.

# Simulating binomial process, seeing what the revenue is
cases_audit = cases[cases['model_audit'] == 1].copy()
rev_sim = [] #doing 5000 simulations
for i in range(5000):
    hit_sim = binom.rvs(1, cases_audit['prob'])
    sim_outs = hit_sim * cases_audit['value']
    rev_sim.append( (sim_outs.sum(), hit_sim.mean()) )

rev_sim = pd.DataFrame(rev_sim, columns=['RevSim','HitRateSim'])

We can then turn this into a nice graph of simulated potential outcomes. In our model approach, on average we would expect to make $1.3 million (versus the actual revenue of $1.1 million), but we have variance around that estimate:

# making a nice graph
actual_rev = cases['revenue'].sum()/1000000
ax = (rev_sim['RevSim']/1000000).hist(bins=100, alpha=0.8, color='grey')
ax.grid(False)
ax.axvline(actual_rev, color='r', linewidth=3)
ax.set_xlabel('Audit Revenue in $1,000,000')
plt.text(actual_rev + 0.008, 150, 'Actual Revenue', color='r')
plt.title('Simulated Revenue when using Model')
plt.show()

So you can see on a very few occasions we make less than the revenue under the current strategy of audit all large cases. But in just as many circumstances we are making over $400k in additional profit.

You may ask why 5000 simulations instead of more or less? Well these are small enough I can easily do them quickly, so I could up the simulations to a higher value if I wanted. Long story short, if you look at the histogram of outcomes and it is still quite bumpy, you should probably do more simulations. Here 5000 is plenty, although 1000 was clearly more bumpy.

If you don’t want to present the histogram, or have more complicated scenarios and prefer a table laying those scenarios out, you can pull out simulated confidence intervals of the additional revenue outcomes:

# If you want to put a confidence interval on it
# Per 1000 dollars
diff = (rev_sim['RevSim'] - cases['revenue'].sum())/1000
diff.describe()

# 95% confidence interval
diff.quantile([0.025,0.975])

One of the benefits of having a model, even if the revenue is not increased, is that you can generate estimates for other types of interventions. In the auditing case, you can potentially justify more auditors (e.g. we can hire more people to investigate 400 more cases and still expect to make a profit). (Here I have a related criminal justice example for bail decisions.) Or you can apply the models as a potential sales pitch to a new client. E.g. if you hire us to do these audits, given your data and our model, we think we can make the $X dollars.

Model based approaches also allow you to meet more constraints, such as increasing the hit rate, or meeting fairness constraints. Here in this simulation if we use a model based approach, the hit rate goes up to around 15% as opposed to 10%. Which may be worth it for your investigators or clients depending on the situation.

Recent Papers on Hot Spots of Crime in Dallas

So I have two different papers that were published recently. Both are on hot spots in Dallas, so might as well discuss them together.

For each I have posted the code to replicate the results (and that spreadsheet has links to preprints as well).

For each as a bit of a background as to the motivation for the projects, Dallas has had official hot spots, named TAAG (Target Area Action Grid). These were clearly larger than what would be considered best practice in identifying hot spots (they were more like entire neighborhoods). I realize ‘best practices’ is a bit wishy-washy, but the TAAG areas ended up covering around 20% of the city (a smidge over 65 square miles). Here is a map of the 2017 areas. There were 54 TAAG areas that covered, so on average each is alittle over 1 square mile.

Additionally I knew the Dallas police department was interested in purchasing the RTM software to do hot spots. And a separate group, the Dallas Crime Task Force was interested in using the software as well for non-police related interventions.

So I did these projects on my own (with my colleagues Wouter and Sydney of course). It wasn’t paid work for any of these groups (I asked DPD if they were interested, and had shared my results with folks from CPAL before that task force report came out, but nothing much came of it unfortunately). But my results for Dallas data are very likely to generalize to other places, so hopefully they will be helpful to others.

Machine Learning to Predict and Understand Hot Spots

So I see the appeal for folks who want to use RTM. It is well validated in both theory and practice, and Joel has made a nice software as a service app. But I knew going in that I could likely improve upon the predictions compared to RTM.

RTM tries to find a middle ground between prediction and causality (which isn’t a critique, it is sort of what we are all doing). RTM in the end spits out predictions that are like “Within 800 feet of a Subway Entrances is Risk Factor 1” and “The Density of Bars within 500 Feet is Risk Factor 2”. So it prefers simple models, that have prognostic value for PDS (or other agencies) to identify potential causal reasons for why that location is high crime. And subsequently helps to not only identify where hot spots are, but frame the potential interventions an agency may be interested it.

But this simplicity has a few drawbacks. One is that it is a global model, e.g. “800 feet within a subway entrance” applies to all subway entrances in the city. Most crime generators have a distribution that makes it so most subway entrances are relatively safe, only a few end up being high crime (for an example). Another is that it forces the way that different crime generators predict crime to be a series of step functions, e.g. “within 600 ft” or “a high density within 1000 ft”. In reality, most geographic processes follow a distance decay function. E.g. if you are looking at the relationship between check-cashing stores and street robbery, there are likely to be more very nearby the store, and it tails off in a gradual process the further away you get.

So I fit a more complicated random forest model that has neither of those limitations and can learn much more complicated functions, both in terms of distance to crime generators as well as spatially varying over the city. But because of that you don’t get the simple model interpretation – they are fundamentally conflicting goals. In terms of predictions either my machine learning model or a simpler comparison of using prior crime = future crime greatly outperforms RTM for several different predictive metrics.

So this shows the predictions are better for RTM no matter how you slice the hot spot areas, but again you lose out the prognostic value of RTM. To replace that, I show local interpretability scores for hot spots. I have an online map here for an example. If you click on one of the high crime predicted areas, it gives you a local breakdown of the different variables that contributes to the risk score.

So it is still more complicated than RTM, but gets you a local set of factors that potentially contribute to why places are hot spots. (It is still superficial in terms of causality, but PDs aren’t going to be able to get really well identified causal relationships for these types of predictions.)

Return on Investment for Hot Spots Policing

The second part of this is that Dallas is no doubt in a tight economic bind. And this was even before all the stuff about reforming police budgets. So policing academics have been saying PDs should shift many more resources from reactive to proactive policing for years. But how to make the argument that it is in police departments best interest to shift resources or invest in additional resources?

To do this I aimed to calculate a return on investment on investing in hot spots policing. Priscilla Hunt (from RAND) recently came up with labor cost estimates for crime specifically relevant for police departments. So if an aggravated assault happens PDs (in Texas) typically spend around $8k in labor costs to respond to the crime and investigate (it is $125k for a homicide). So based on this, you can say, if I can prevent 10 agg assaults, I then save $80k in labor costs. I use this logic to estimate a return on investment for PDs to do hot spots policing.

So first I generate hot spots, weighting for the costs of those crimes. Here is an interactive map to check them out, and below is a screenshot of the map.

I have an example of then calculating a return on investment for the hot spot area that captured the most crime. I get this estimate by transforming meta-analysis estimates of hot spots policing, estimating an average crime reduction, and then backing out how much labor costs that would save a police department. So in this hot spot, an ROI for hot spots policing (for 1.5 years) is $350k.

That return would justify at least one (probably more like two) full time officers just to be assigned to that specific hot spot. So if you actually hire more officers, it will be around net-zero in terms of labor costs. If you shift around current officers it should be a net gain in labor resources for the PD.

So most of the hot spots I identify in the study if you do this ROI calculation likely aren’t hot enough to justify hot spots policing from this ROI perspective (these would probably never justify intensive overtime that is typical of crackdown like interventions). But a few clearly are, and definitely should be the targets of some type of hot spot intervention.