aggregate retention/churn models in python

Instead of having so much code just randomly floating around in blog posts, I need to start making packages (both in R and python) more often. I took it as a challenge to make a simple python package, here retenmod (pypi, github). I got the idea after answering a question on crossvalidated. (The resources I leveraged the most were these two sites/tutorials, packaging projects and minimal example.)

It is a simple port of the R package foretell that provides several different models to forecast churn based on aggregate survival probabilities. So it only has three functions, and I did not focus too much on extras (like building sphinx docs). Buit it has just the amount of complexity to make a nice intro get my feet wet example.

So you can now download/install the package via pip:

pip install retenmod

And it will automatically install scipy and numpy if you do not have them already installed. For a very simple example, I don’t have retention probabilities for any police department offhand, but this document has estimates for how many staff positions police tend to retain after increases.

Here is a simple example of using the library, in particular the BdW model.

import retenmod
import matplotlib.pyplot as plt

large = [100,66,56,52,49,47,44,42]
time = [0,1,2,3,4,5,10,15]

# Only fitting with the first 3 values
train, ext = 4, 15
lrg_bdw = retenmod.bdw(large[0:train],ext - train + 1)

# Showing predicted vs observed
pt = list(range(16))
fig, ax = plt.subplots()
ax.plot(pt[1:], lrg_bdw.proj[1:],label='Predicted', 
        c='k', linewidth=2, zorder=-1)
           edgecolor='k', c='r', s=50, zorder=1)
ax.axvline(train - 0.5, label='Train', color='grey', 
           linestyle='dashed', linewidth=2, zorder=-2)
ax.set_ylabel('% Retaining Position')
ax.legend(facecolor='white', framealpha=1)

So you can see even with only fitting the data to the first three years, years 4 and 5 were forecasted quite well. It underestimates retention further out at 10 and 15 years (the model has a hard time going down very fast from 100 to 66 and then flattening out in a reasonable way). But even so the super far out forecasts are not that crazy given only three data points.

I will have to work on an example later of showing how to translate this to cost-benefit analysis (although would prefer actual retention data from a PD). Essentially you can calculate the benefit of trying to save officers (retain them) vs hiring new officers and training them up based on just aggregate data. If you wanted to do something like estimate if retention is going down due to recent events, I would probably use micro-level data and estimate a survival model directly.

Next up I will try to turn my Exact distribution tests (R Code) for day of week/Benford’s analysis into a simple R package and see if I can get it on Cran. Posting to pypi is quite easy.

KDE plots for predicted probabilities in python

So I have previously written about two plots post binary prediction models – calibration plots and ROC curves. One addition to these I am going to show are kernel density estimate plots, broken down by the observed value vs predicted value. One thing in particular I wanted to make these for is to showcase the distribution of the predicted probabilities themselves, which can be read off of the calibration chart, but is not as easy.

I have written about this some before – transforming KDE estimates from logistic to probability scale in R. I will be showing some of these plots in python using the seaborn library. It will be easier instead of transforming the KDE to use edge weighting statistics to get unbiased estimates near the borders for the way the seaborn library is set up.

To follow along, you can download the data I will be using here. It is the predicted probabilities from the test set in the calibration plot blog post, predicting recidivism using several different models.

First to start, I load my python libraries and set my matplotlib theme (which is also inherited by seaborn charts).

Then I load in my data. To make it easier I am just working with the test set and several predicted probabilities from different models.

import pandas as pd
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# My theme

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}


And here I am reading in the data (just have the CSV file in my directory where I started python).

# Reading in the data with predicted probabilites
# Test from

pp_data = pd.read_csv(r'PredProbs_TestCompas.csv',index_col=0)


So you can see this data has the observed outcome Recid30 – recidivism after 30 days (although again this is the test dataset). And then it also has the predicted probability for three different models (XGBoost, RandomForest, and Logit), and then demographic breakdowns for sex and race.

The plot I am interested in seeing is a KDE estimate for the probabilities, broken down by the observed 0/1 for recidivism. Here is the default graph using seaborn:

# Original KDE plot by 0/1
sns.kdeplot(data=pp_data, x="Logit", hue="Recid30", 
            common_norm=False, bw_method=0.15)

One problem you can see with this plot though is that the KDE estimates are smoothed beyond the data. You cannot have a predicted probability below 0 or above 1. Because we are using a gaussian kernel, we can just reweight observations that are close to the edge, and then clip the KDE estimate. So a predicted probability of 0 would get a weight of 1/0.5 – so it gets double the weight. Note to do this correctly, you need to set the bandwidth the same for the seaborn kdeplot as well as the weights calculation – here 0.15.

# Weighting and clipping
# Amount of density below 0 & above 1
below0 = norm.cdf(x=0,loc=pp_data['Logit'],scale=0.15)
above1 = 1- norm.cdf(x=1,loc=pp_data['Logit'],scale=0.15)
pp_data['edgeweight'] = 1/ (1 - below0 - above1)

sns.kdeplot(data=pp_data, x="Logit", hue="Recid30", 
            common_norm=False, bw_method=0.15,
            clip=(0,1), weights='edgeweight')

This results in quite a dramatic difference, showing the model does a bit better than the original graph. The 0’s were well discriminated, so have many very low probabilities that were smoothed outside the legitimate range.

Another cool plot you can do that again shows calibration is to use seaborn’s fill option:

cum_plot = sns.kdeplot(data=pp_data, x="Logit", hue="Recid30", 
                       common_norm=False, bw_method=0.15,
                       clip=(0,1), weights='edgeweight', 
                       multiple="fill", legend=True)
cum_plot.legend_._set_loc(4) #via

As expected this shows an approximate straight line in the graph, e.g. 0.2 on the X axis should be around 0.2 for the orange area in the chart.

Next seaborn has another good function here, violin plots. Unfortunately you cannot pass a weight function here. But another option is to simply resample your data a large number of times, using the weights you provided earlier.

n = 1000000 #larger n will result in more accurate KDE
resamp_pp = pp_data.sample(n=n,replace=True, weights='edgeweight',random_state=10)

viol_sex = sns.violinplot(x="Sex", y="XGB", hue="Recid30",
                          data=resamp_pp, split=True, cut=0, 
                          bw=0.15, inner=None,
                          scale='count', scale_hue=False)
viol_sex.legend_.set_bbox_to_anchor((0.65, 0.95))

So here you can see we have more males in the sample, and they have a larger high risk blob that was correctly identified. Females have a risk profile more spread out, although there is a small clump of basically 0 risk that the model identifies.

You can also generate the graph so the areas for the violin KDE’s are normalized, so in both the original and resampled data we have fewer females, and more black individuals.

# Values for Sex for orig/resampled

# Values for Race orig/resampled

But if we set scale='area' in the chart the violins are the same size:

viol_race = sns.violinplot(x="Race", y="XGB", hue="Recid30",
                           data=resamp_pp, split=True, cut=0, 
                           bw=0.15, inner=None,
                           scale='area', scale_hue=True)
viol_race.legend_.set_bbox_to_anchor((0.81, 0.95))

I will have to see if I can make some time to contribute to seaborn to make it so you can pass in weights to the violinplot function.

Using simulations to show ROI for predictive models in python

Two resources I have been consuming lately I would highly recommend:

Keith’s perspective is nearly a 100% match to my experiences, e.g. should aim for projects that have around $1 million in expected revenue to justify a data science person/team, up front estimates should be on the low end, the easiest projects you can formulate as micro-decisions and you use a model to improve those binary decisions, etc. How to measure anything fits right into this as well, where Hubbard basically says get a prior distribution on expected outcomes, and then do simulations to see possible outcomes.

Here I am going to show an example that is very close to several of the projects I have done to show the potential increase in revenue from taking a model based approach using simulations in python.


So the point in the data science project I am going to be illustrating is you have already decided to do an initial pilot model, and you have historical cases and then predicted probabilities from your model. Here I am thinking of the case of auditing some type transaction (it can be whatever you want, tax-returns, bank transactions, insurance claims, etc.). Here I am going to simulate some fake data to illustrate the later ROI estimates, but in real life you would use your own data for the business.

Here the variables I simulate are:

  • 5000 transactions, total_cases
  • a model based predicted probability, prob
  • a dollar value for the transaction, dollar
  • a historical marker whether a transaction was audited, audit
  • a historical marker whether the transaction was bad, hit

To be clear, this would be data you would normally already have for your business use case (e.g. historical transactions). To just illustrate my point I am making 100% fake data for everyone to follow along.

# Simulating data, probabilities
# and money values

from scipy.stats import norm
from scipy.stats import binom
from scipy.stats import beta
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

total_cases = 5000

# Beta(1,5), to generate the probs
prob = beta.rvs(1, 5, size=total_cases)

# Lognormal for the dollar values, clipped
dollar = np.exp(norm.rvs(7,2,size=total_cases)).clip(500,25000)

# Historical auditing process, all cases over 15000
audit = (dollar > 15000)*1

# Out of these, random 10% are hits
hit = binom.rvs(1, 0.10, size=total_cases)

# Putting into a dataframe
cases = pd.concat([pd.Series(dollar),pd.Series(audit),
                   pd.Series(prob), pd.Series(hit)], 
cases.columns = ['value','audit','prob', 'hit']
cases['revenue'] = cases['hit']*cases['value']*cases['audit']

cases['revenue'].sum() # about 1.1 million


These are all simulated from various probability distributions to look somewhat like real data. Probabilities and dollar values are right skewed. They are independent here, but it is ok if in your real data they are not.

Here I pretend the historical audit selection process is they automatically audit all large transactions, over $15k. And these historical audits have a 10% probability of finding a hit (think of it as fraud if you want). So the context is given our model estimates prob, how much more money do we think we can make if you use these model based decision as opposed to our simple threshold that is the current process?

Revenue Simulations

So here for my revenue simulations, what I am going to do is pretend I can audit the same number of cases (471), based on my model estimates, audit_total.

audit_total = audit.sum() #pretend we get to model the same
                          #number of cases
cases['model_expected'] = cases['prob']*cases['value']
cases['model_rank'] = cases['model_expected'].rank(method='first', ascending=False)
cases['model_audit'] = 1*(cases['model_rank'] >= audit_total)

# Expected revenue from our model based approach
# About 1.3 million

So if our model is well calibrated, we can take those predicted probabilities and estimate what we think should happen if we used our model to audit 471 cases. Here we think we would make around 1.3 million, so about a lift of over $200k.

But, these models are probabilistic estimates. So I like to use simulations to hedge a bit when I am presenting to the business. Here I do 5000 simulations where I select my 471 cases, use a binomial random number generator to flip the coin whether the case results in a hit or not, and then calculate the total revenue.

# Simulating binomial process, seeing what the revenue is
cases_audit = cases[cases['model_audit'] == 1].copy()
rev_sim = [] #doing 5000 simulations
for i in range(5000):
    hit_sim = binom.rvs(1, cases_audit['prob'])
    sim_outs = hit_sim * cases_audit['value']
    rev_sim.append( (sim_outs.sum(), hit_sim.mean()) )

rev_sim = pd.DataFrame(rev_sim, columns=['RevSim','HitRateSim'])

We can then turn this into a nice graph of simulated potential outcomes. In our model approach, on average we would expect to make $1.3 million (versus the actual revenue of $1.1 million), but we have variance around that estimate:

# making a nice graph
actual_rev = cases['revenue'].sum()/1000000
ax = (rev_sim['RevSim']/1000000).hist(bins=100, alpha=0.8, color='grey')
ax.axvline(actual_rev, color='r', linewidth=3)
ax.set_xlabel('Audit Revenue in $1,000,000')
plt.text(actual_rev + 0.008, 150, 'Actual Revenue', color='r')
plt.title('Simulated Revenue when using Model')

So you can see on a very few occasions we make less than the revenue under the current strategy of audit all large cases. But in just as many circumstances we are making over $400k in additional profit.

You may ask why 5000 simulations instead of more or less? Well these are small enough I can easily do them quickly, so I could up the simulations to a higher value if I wanted. Long story short, if you look at the histogram of outcomes and it is still quite bumpy, you should probably do more simulations. Here 5000 is plenty, although 1000 was clearly more bumpy.

If you don’t want to present the histogram, or have more complicated scenarios and prefer a table laying those scenarios out, you can pull out simulated confidence intervals of the additional revenue outcomes:

# If you want to put a confidence interval on it
# Per 1000 dollars
diff = (rev_sim['RevSim'] - cases['revenue'].sum())/1000

# 95% confidence interval

One of the benefits of having a model, even if the revenue is not increased, is that you can generate estimates for other types of interventions. In the auditing case, you can potentially justify more auditors (e.g. we can hire more people to investigate 400 more cases and still expect to make a profit). (Here I have a related criminal justice example for bail decisions.) Or you can apply the models as a potential sales pitch to a new client. E.g. if you hire us to do these audits, given your data and our model, we think we can make the $X dollars.

Model based approaches also allow you to meet more constraints, such as increasing the hit rate, or meeting fairness constraints. Here in this simulation if we use a model based approach, the hit rate goes up to around 15% as opposed to 10%. Which may be worth it for your investigators or clients depending on the situation.

Fitting a pytorch model

Out of the box when fitting pytorch models we typically run through a manual loop. So typically something like this:

# Example fitting a pytorch model
# mod is the pytorch model object
opt = torch.optim.Adam(mod.parameters(), lr=1e-4)
crit = torch.nn.MSELoss(reduction='mean')
for t in range(20000):
    y_pred = mod(x)   #x is tensor of independent vars
    loss = crit(y_pred,y) #y is tensor of outcomes

And this would use backpropogation to adjust our model parameters to minimize the loss function, here just the mean square error, over 20,000 iterations. Best practices are to both evaluate the loss in-sample and wait for it to flatten out, as well as evaluate out of sample.

I recently wrote some example code to make this process somewhat more like the sklearn approach, where you instantiate an initial model object, and then use a, y) function call to fit the pytorch model. For an example use case I will just use a prior Compas recidivism data I have used for past examples on the blog (see ROC/Calibration plots, and Balancing False Positives). Here is the prepped CSV file to download to follow along.

So first, I load the libraries and then prep the recidivism data before I fit my predictive models.

# Front end libraries/data prep

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch

# Setting seeds

# Prepping the Compas data and making train/test
recid = pd.read_csv('PreppedCompas.csv')

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
dum_race = pd.get_dummies(recid['race'])

# White for reference category
for d in list(dum_race):
    if d != 'Caucasion':
        recid_prep[d] = dum_race[d]

# reference category is separated/unknown/widowed
dum_mar = pd.get_dummies(recid['marital_status'])
recid_prep['Single'] = dum_mar['Single']
recid_prep['Married'] = dum_mar['Married'] + dum_mar['Significant Other']

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1].copy()
recid_test = recid_prep[recid_prep['Train'] == 0].copy()

#Independant variables
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'African-American','Asian','Hispanic','Native American','Other',

# Dependent variable
y_var = 'Recid30'

Now next part is more detailed, but it is the main point of the post. Typically we will make a pytorch model object something like this. Here I have various switches, such as the activation function (tanh or relu or pass in your own function), or the final function to limit predictions to 0/1 (either sigmoid or clamp or again pass in your own function).

# Initial pytorch model class
class logit_pytorch(torch.nn.Module):
    def __init__(self, nvars, device, activate='relu', bias=True,
        Construct parameters for the coefficients 
        activate - either string ('relu' or 'tanh', 
                   or pass in your own torch function
        bias - whether to include bias (intercept) in model
        final - use either 'sigmoid' to squash to probs, or 'clamp'
                or pass in your own torch function
        device - torch device to construct the tensors
                 default cuda:0 if available
        super(logit_pytorch, self).__init__()
        # Creating the coefficient parameters
        self.coef = torch.nn.Parameter(torch.rand((nvars,1),
        # If no bias it is 0
        if bias:
            self.bias = torch.nn.Parameter(torch.zeros(1,
            self.bias = torch.zeros(1, device=device)
        # Various activation functions
        if activate == 'relu':
            self.trans = torch.nn.ReLU()
        elif activate == 'tanh':
            self.trans = torch.nn.Tanh()
            self.trans = activate
        if final == 'sigmoid':
   = torch.nn.Sigmoid()
        elif final == 'clamp':
            # Defining my own clamp function
            def tclamp(input):
                return torch.clamp(input,min=0,max=1)
   = tclamp
            # Can pass in your own function
   = final
    def forward(self, x):
        predicted probability
        output = self.bias +, self.trans(self.coef))

To use this though again we need to specify the number of coefficients to create, and then do a bunch of extras like the optimizer, and stepping through the function (like described at the beginning of the post). So here I have created a second class that behaves more like sklearn objects. I create the empty object, and only when I pass in data to the .fit() method it spins up the actual pytorch model with all its tensors of the correct dimensions.

# Creating a class to instantiate model to data and then fit
class pytorchLogit():
    def __init__(self, loss='logit', iters=25001, 
                 activate='relu', bias=True, 
                 final='sigmoid', device='gpu',
        loss - either string 'logit' or 'brier' or own pytorch function
        iters - number of iterations to fit (default 25000)
        activate - either string ('relu' or 'tanh', 
                   or pass in your own torch function
        bias - whether to include bias (intercept) in model
        final - use either 'sigmoid' to squash to probs, or 'clamp'
                or pass in your own torch function. Should not use clamp
                with default logit loss
        opt - ?optimizer? should add an option for this
        device - torch device to construct the tensors
                 default cuda:0 if available
        printn - how often to check the fit (default 1000 iters)
        super(pytorchLogit, self).__init__()
        if loss == 'logit':
            self.loss = torch.nn.BCELoss()
            self.loss_name = 'logit'
        elif loss == 'brier':
            self.loss = torch.nn.MSELoss(reduction='mean')
            self.loss_name = 'brier'
            self.loss = loss
            self.loss_name = 'user defined function'
        # Setting the torch device
        if device == 'gpu':
                self.device = torch.device("cuda:0")
                print(f'Torch device GPU defaults to cuda:0')
                print('Unsuccessful setting to GPU, defaulting to CPU')
                self.device = torch.device("cpu")
        elif device == 'cpu':
            self.device = torch.device("cpu")
            self.device = device #can pass in whatever
        self.iters = iters
        self.mod = None
        self.activate = activate
        self.bias = bias = final
        self.printn = printn
        # Other stats to carry forward
        self.loss_metrics = []
        self.epoch = 0
    def fit(self, X, y, outX=None, outY=None):
        x_ten = torch.tensor(X.to_numpy(), dtype=torch.float,
        y_ten = torch.tensor(pd.DataFrame(y).to_numpy(), dtype=torch.float,
        # Only needed if you pass in an out of sample to check as well
        if outX is not None:
            x_out_ten = torch.tensor(outX.to_numpy(), dtype=torch.float,
            y_out_ten = torch.tensor(pd.DataFrame(outY).to_numpy(), dtype=torch.float,
        self.epoch += 1
        # If mod is not already created, create a new one, else update prior
        if self.mod is None:
            loc_mod = logit_pytorch(nvars=X.shape[1], activate=self.activate, 
            self.mod = loc_mod
            loc_mod = self.mod
        opt = torch.optim.Adam(loc_mod.parameters(), lr=1e-4)
        crit = self.loss
        for t in range(self.iters):
            y_pred = loc_mod(x_ten)
            loss = crit(y_pred,y_ten)
            if t % self.printn == 0:
                if outX is not None:
                    pred_os = loc_mod(x_out_ten)
                    loss_os = crit(pred_os,y_out_ten)
                    res_tup = (self.epoch, t, loss.item(), loss_os.item())
                    print(f'{t}: insample {res_tup[2]:.4f}, outsample {res_tup[3]:.4f}')
                    res_tup = (self.epoch, t, loss.item(), None)
                    print(f'{t}: insample {res_tup[2]:.5f}')
    def predict_proba(self, X):
        x_ten = torch.tensor(X.to_numpy(), dtype=torch.float,
        res = self.mod(x_ten)
        pp = res.cpu().detach().numpy()
        return np.concatenate((1-pp,pp), axis=1)
    def loss_stats(self, plot=True, select=0):
        pd_stats = pd.DataFrame(self.loss_metrics, columns=['epoch','iteration',
        if plot:
            pd_stats2 = pd_stats.rename(columns={'insamploss':'In Sample Loss', 'outsamploss':'Out of Sample Loss'})
            pd_stats2 = pd_stats2[pd_stats2['iteration'] > select].copy()
            ax = pd_stats2[['iteration','In Sample Loss','Out of Sample Loss']].plot.line(x='iteration', 
                            ylabel=f'{self.loss_name} loss')
        return pd_stats

Again it allows you to pass in various extras, which here are just illustrations for binary predictions (like the loss function as the Brier score or the more typical log-loss). It also allows you to evaluate the fit for just in-sample, or for out of sample data as well. It also allows you to specify the number of iterations to fit.

So now that we have all that work done, here as some simple examples of its use.

# Creating a model and fitting
mod = pytorchLogit()[ind_vars], recid_train[y_var])

So you can see that this is very similar now to sklearn functions. It will print at the console fit statistics over the iterations:

So it defaults to 25k iterations, and you can see that it settles down much before that. I created a predict_proba function, same as most sklearn model objects for binary predictions:

# Predictions out of sample
predprobs = mod.predict_proba(recid_test[ind_vars])
predprobs # 1st column is probability 0, 2nd prob 1

And this returns a numpy array (not a pytorch tensor). Although you could modify to return a pytorch tensor if you wanted it to (or give an option to specify which).

Here is an example of evaluating out of sample fit as well, in addition to specifying a few more of the options.

# Evaluating predictions out of sample, more iterations
mod2 = pytorchLogit(activate='tanh', iters=40001, printn=100)[ind_vars], recid_train[y_var], recid_test[ind_vars], recid_test[y_var])

I also have an object function, .loss_stats(), which gives a nice graph of in-sample vs out-of-sample loss metrics.

# Making a nice graph
dp = mod2.loss_stats()

We can also select the loss function to only show later iterations, so it is easier to zoom into the behavior.

# Checking out further along

And finally like I said you could modify some of your own functions here. So instead of any activation function I pass in the identity function – so this turns the model into something very similar to a vanilla logistic regression.

# Inserting in your own activation (here identity function)
def ident(input):
    return input

mod3 = pytorchLogit(activate=ident, iters=40001, printn=2000)[ind_vars], recid_train[y_var], recid_test[ind_vars], recid_test[y_var])

And then if you want to access the coefficients weights, it is just going down the rabbit hole to the pytorch object:

# Can get the coefficients/intercept
print( mod3.mod.coef )
print( mod3.mod.bias )

This type of model can of course be extended however you want, but modifying the pytorchLogit() and logit_pytorch class objects to specify however detailed switches you want. E.g. you could specify adding in hidden layers.

One thing I am not 100% sure the best way to accomplish is loss functions that take more parameters, as well as the best way to set up the optimizer. Maybe use *kwargs for the loss function. So for my use cases I have stuffed extra objects into the initial class, so they are there later if I need them.

Also here I would need to think more about how to save the model to disk. The model is simple enough I could dump the tensors to numpy, and on loading re-do them as pytorch tensors.

ROC and calibration plots for binary predictions in python

When doing binary prediction models, there are really two plots I want to see. One is the ROC curve (and associated area under the curve stat), and the other is a calibration plot. I have written a few helper functions to make these plots for multiple models and multiple subgroups, so figured I would share, binary plots python code. To illustrate their use, I will use the same Compas recidivism data I have used in the past, (CSV file here). So once you have downloaded those two files you can follow along with my subsequent code.

Front Prep

First, I have downloaded the file and the PreppedCompas.csv file to a particular folder on my machine, D:\Dropbox\Dropbox\Documents\BLOG\binary_plots. To import these functions, I append that path using sys, and change the working directory using os. The other packages are what I will be using the fit the models.

# Front end prep

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import os
import sys

my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\binary_plots'

# Can append to path
import binary_plots

np.random.seed(10) #setting the seed for the random
# split for train/test

Next up I prepare the data, this is just boring stuff turning categorical variables into various dummies and making a train/test split for the data (which can be done in a variety of ways).

# Prepping Compas Data

#For notes on data source, check out 
recid = pd.read_csv('PreppedCompas.csv')

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")

print( recid['race'].value_counts() )
dum_race = pd.get_dummies(recid['race'])
# White for reference category
for d in list(dum_race):
    if d != 'Caucasion':
        recid_prep[d] = dum_race[d]

print( recid['marital_status'].value_counts() )
dum_mar = pd.get_dummies(recid['marital_status'])
recid_prep['Single'] = dum_mar['Single']
recid_prep['Married'] = dum_mar['Married'] + dum_mar['Significant Other']
# reference category is separated/unknown/widowed

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1].copy()
recid_test = recid_prep[recid_prep['Train'] == 0].copy()

#Independant variables
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'African-American','Asian','Hispanic','Native American','Other',

# Dependent variable
y_var = 'Recid30'

Next, the sklearn library makes it quite easy to fit a set of multiple models. Most of the time I start with XGBoost, random forests, and a normal logistic model with no coefficient penalty. I just stuff the base model object in a dictionary, pipe in the same training data, and fit the models. Then I can add in the predicted probabilities from each model into the test dataset. (These plots I show you should only show on the test dataset, of course the data will be calibrated on the training dataset.)

# Training three different models, Logit,
# Random Forest, and XGBoost

final_models = {}
final_models['XGB'] = XGBClassifier(n_estimators=100, max_depth=5)
final_models['RF'] = RandomForestClassifier(n_estimators=1000, max_depth=10, min_samples_split=50)
final_models['Logit'] = LogisticRegression(penalty='none', solver='newton-cg')

# Iterating over each model and fitting on train
for nm, mod in final_models.items():[ind_vars], recid_train[y_var])

# Adding predicted probabilities back into test dataset
for nm, mod in final_models.items():
    # Predicted probs out of sample
    recid_test[nm] =  mod.predict_proba(recid_test[ind_vars])[:,1]

This is fairly tiny data, so I don’t need to worry about how long this takes or run out of memory. I’d note you can do the same model, but different hyperparameters in this approach. Such as tinkering with the depth for tree based models is one I by default limit quite a bit.

AUC Plots

First, my goto metric to see the utility of a particular binary prediction model is the AUC stat. This has one interpretation in terms of the concordance stat, an AUC of 0.7 means if you randomly picked a 0 case and a 1 case, the 1 case would have a higher value 70% of the time. So AUC is all about how well your prediction discriminates between the two classes.

So with my binary_plots function, you can generate an ROC curve for the test data for a single column of predictions as so:

# A single column
binary_plots.auc_plot(recid_test, y_var, ['Logit'], save_plot='AUC1.png')

As I have generated predictions for multiple models, I have also generated a similar graph, but stuff the AUC stats in the matplotlib legend:

# Multiple columns to show different models
pred_prob_cols = list(final_models.keys()) #variable names
binary_plots.auc_plot(recid_test, y_var, pred_prob_cols, save_plot='AUC2.png')

It is also the case you want to do these plots for different subgroups of data. In recidivism research, we are often interested in sex and racial breakdowns. Here is the Logit model AUC broken down by Males (1) and Females (0).

# By subgroups in the data
binary_plots.auc_plot_long(recid_test, y_var, 'Logit', group='Male', save_plot='AUC3.png')

So this pulls the labels from the data, but you can pass in strings to get nicer labels. And finally, I show how to put both of these together, both by models and by subgroups in the data. Subgroups are different panels, and you can pass in a fontsize moniker to make the legends smaller for each subplot, and a size for each subplot (they are squares).

# Lets make nicer variable names for Male/Females and Racial Groups
recid_test['Sex'] = recid_test['Male'].replace({0: 'Female', 1:'Male'})
recid_test['Race'] = recid[recid_prep['Train'] == 0]['race']
recid_test['Race'] = recid_test['Race'].replace({'Hispanic': 'Other', 'Asian':'Other', 'Native American':'Other', 'African-American':'Black', 'Caucasian':'White'})

# Now can do AUC plot by gender and model type
binary_plots.auc_plot_wide_group(recid_test, y_var, pred_prob_cols, 'Sex', size=4, leg_size='x-small', save_plot='AUC4.png')

The plots have a wrap function (default wrap at 3 columns), so you can plot as many subgroups as you want. Here is an example combing the sex and race categories:

One limitation to note in these plots, ROC plots are normalized in a way that the thresholds for each subgroup may not be at the same area of the plot (e.g. a FPR of 0.1 for one subgroup implies a predicted probability of 30%, whereas for another subgroup it implies a predicted probability of 40%).

ROC/AUC is definitely not a perfect stat, most of the time we are only interested in the far left hand side of the ROC curve (how well we can identify high risk cases without a ton of false positives). That is why I think drawing the curves are important – one model may have a higher AUC, but it is in an area of the curve not relevant for how you will use the predictions in practice. (For tree based models with limited depth and limited variables, it can produce flat areas in the ROC curve for example.)

But I find the ROC curve/AUC metric the most useful default for both absolute comparisons (how well is this model doing overall), as well as relative model comparisons (is Model A better than Model B).

Most models I work with I can get an AUC of 0.7 without much work, and once I get an AUC of 0.9 I am in the clearly diminishing returns category to tinkering with my model (this is true for both criminology related models I work with, as well as healthcare related models in my new job).

This is of course data dependent, and even an AUC of 0.9 is not necessarily good enough to use in practice (you need to do a cost-benefit type analysis given how you will use the predictions to figure that out).

Calibration Charts

For those with a stat background, these calibration charts I show are a graphical equivalent of the Hosmer-Lemeshow test. I don’t bother conducting the Chi-square test, but visually I find them informative to not only see if an individual model is calibrated, but also to see the range of the predictions (my experience XGBoost will be more aggressive in the range of predicted probabilities, but is not always well calibrated).

So we have the same three types of set ups as with the ROC plots, a single predicted model:

# For single model
binary_plots.cal_data('XGB', y_var, recid_test, bins=60, plot=True, save_plot='Cal1.png')

For multiple models, I always do these on separate subplots, they would be too busy to superimpose. And because it is a single legend, I just build the data and use seaborn to do a nice small multiple. (All of these functions return the dataframe I use to build the final plot in long format.) The original plot was slightly noisy with 60 bins, so I reduce it to 30 bins here, but it is still slightly noisy (but each model is pretty well calibrated). XGBoost has a wider range of probabilities, random forests lowest bin is around 0.1 and max is below 0.8. Logit has lower probabilities but none above 0.8.

# For multiple models
binary_plots.cal_data_wide(pred_prob_cols, y_var, recid_test, bins=30, plot=True, save_plot='Cal2.png')

For a single model, but by subgroups in the data. The smaller other race group is more noisy, but again each model appears to be approximately calibrated.

# For subgroups and one model
binary_plots.cal_data_group('XGB', y_var, 'Race', recid_test, bins=20, plot=True, save_plot='Cal3.png')

And a combo of subgroup data and multiple models. Again the smaller subgroup Females appear more noisy, but all three models appear to be doing OK in this quick example.

# For subgroups and multiple models
binary_plots.cal_data_wide_group(pred_prob_cols, y_var, 'Sex', recid_test, bins=20, plot=True, save_plot='Cal4.png')

Sometimes people don’t bin the data (Peter Austin likes to do a smoothed plot), but I find the binned data easier to follow and identify deviations above/below predicted probabilities. In real life you often have some fallout/dropoff if there is feedback between the model and how other people respond to the model (e.g. the observed is always 5% below the predicted).

Python f string number formatting and SPSS break long labels

Another quick blog post, as moving is not 100% crazy all the time now, but I need a vacation after all that work. So two things in this blog post: formatting numeric f strings in python, and breaking long labels in SPSS meta-data.

Python f-string numeric formatting

This is super simple, but I can never remember it (so making a quick blog post for my own reference). As of python 3.6, you can use f-strings to do simple text substitution. So if you do:

x = 2/3
sub_str = f'This proportion is {x}'

Then we will get printed out This proportion is 0.6666666666666666. So packing global items inside of {} expands within the f string. While for more serious string subsitution (like creating parameterized SQL queries), I like to use string templates, these f-strings are very nice to print short messages to the console or make annotations in graphs.

Part of this note is that I never remember how to format these strings. If you are working with integers it is not a big deal, but as you can see above I often do not want to print out all those decimals inside my particular message. A simple way to format the strings are:

f'This proportion is {x:.2f}'

And this prints out to two decimal places 'This proportion is 0.67'. If you have very big numbers (say revenue), you can do something like:

f'This value is ${x*10000:,.0f}'

Which prints out 'This value is $6,667' (so you can modify objects in place, to say change a proportion to a percentage).

Note also to folks that you can have multi-line f-strings by using triple quotes, e.g.:

f'''This is a super
long f-string for {x:.2f}
on multiple lines!'''

But one annoying this is that you need to keep the whitespace correct inside of functions even inside the triple string. So those are cases I like using string templates. But another option is to break up the string and use line breaks via \n.

long_str = (f'This is line 1\n'
            f'Proportion is {x:.1f}\n'
            f'This is line 3')

Which prints out:

This is line 1
Proportion is 0.7
This is line 3

You could do the line breaks however, either at the beginning of each line or at the end of each line.

SPSS break long labels

This was in reference to a project where I was working with survey data, and for various graphs I needed to break up long labels. So here is an example to illustrate the problem.

* Creating simple data to illustrate.
1 1
2 2
3 3
4 4
  1 'This is a reallllllllly long label'
  2 'short label'
  3 'Super long unnecessary label that is long'
  4 'Again another long label what is up with this'
  X 'Short variable label'
  Y 'This is also a super long variable label that is excessive!'

  SOURCE: s=userSource(id("g"))
  DATA: X=col(source(s), name("X"), unit.category())
  DATA: Y=col(source(s), name("Y"))
  COORD: rect(dim(1,2), transpose())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Value"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(X*Y))

So you can see, SPSS shrinks the data to accommodate the long labels. (I don’t know how to control the behavior in the graph or the chart template itself, so not sure why only this gets wrapped for the first label.) So we can use the \n line break trick again in SPSS to get these to split where we prefer. Here are some python functions to do the splitting (which I am sure can be improved upon), as well as to apply the splits to the current SPSS dataset. You can decide the split where you want the line to be broken, and so if a word goes above that split level it wraps to the next line.

* Now some python to wrap long labels.
import spss, spssaux

# Splits a long string with line breaks
def long_str(x,split):
    split_str = x.split(" ")
    cum = len(split_str[0])
    cum_str = split_str[0]
    for s in split_str[1:]:
        cum += len(s) + 1
        if cum <= split:
            cum_str += " " + s
            cum_str += r"\n" + s
            cum = len(s)
    return cum_str

# This grabs all of the variables in the current SPSS dataset
varList = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]

# This looks at the VALUE LABELS and splits them up on multiple lines
def split_vallab(vList, lsplit):
    vardict = spssaux.VariableDict()
    for v in vardict:
        if v in vList:
            vls= v.ValueLabels.keys()
            if vls:
                for k in vls:
                    ss = long_str(v.ValueLabels[k], lsplit)
                    if ss != v.ValueLabels[k]:
                        vn = v.VariableName
                        cmd = '''ADD VALUE LABELS %(vn)s %(k)s \'%(ss)s\'.''' % ( locals() )

# I run this to split up the value labels
split_vallab(varList, 20)

# This function is for VARIABLE LABELS
def split_varlab(vList,lsplit):
    for i,v in enumerate(vList):
        vlab = spss.GetVariableLabel(i)
        if len(vlab) > 0:
            slab = long_str(vlab, lsplit)
            if slab != vlab:
                cmd = '''VARIABLE LABELS %(v)s \'%(slab)s\'.''' % ( locals() )

# I don't run this right now, as I don't need it
split_varlab(varList, 30)

And now we can re-run our same graph command, and it is alittle nicer:

  SOURCE: s=userSource(id("g"))
  DATA: X=col(source(s), name("X"), unit.category())
  DATA: Y=col(source(s), name("Y"))
  COORD: rect(dim(1,2), transpose())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Value"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(X*Y))

And you can also go to the variable view to see my inserted line breaks:

SPSS still does some auto-intelligence when to wrap lines in tables/graphs (so if you do DISPLAY DICTIONARY. it will still wrap the X variable label in my default tables, even though I have no line break). But this gives you at least a slight bit of more control over charts/tables.

Some ACS download helpers and Research Software Papers

The blog has been a bit sparse recently, as moving has been kicking my butt (hanging up curtains and recycling 100 boxes today!). So just a few quick notes.

Downloading ACS Data

First, I have posted some helper functions to work with American Community Survey data (ACS) in python. For a quick overview, if you import/define those functions, here is a quick example of downloading the 2019 Texas micro level files (for census tracts and block groups) from the census FTP site. Can pipe in another year (if available) and and whatever state into the function.

# Python code to download American Community Survey data
base = r'??????' #put your path here where you want to download data
temp = os.path.join(base,'2019_5yr_Summary_FileTemplates')
data = os.path.join(base,'tables')


Some locations have census tract data to download, I think the FTP site is the only place to download block group data though. And then based on those files you downloaded, you can then grab the variables you want, and here I show selecting out the block groups from those fields:

interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
labs, comp_tabs = merge_tabs(interest,temp,data)
bg = comp_tabs['NAME'].str.find('Block Group') == 0

Then based on that data, I have an additional helper function to calculate proportions given two lists of the numerators and denominators that you want:

top = ['B17010_002',['B11003_016','B11003_013'],'B08141_002']
bot = ['B17010_001',        'B11002_001'       ,'B08141_001']
nam = ['PovertyFamily','SingleHeadwithKids','NoCarWorkers']
prep_sdh = prop_prep(bg, top, bot, nam)

So here to do Single Headed Households with kids, you need to add in two fields for the numerator ['B11003_016','B11003_013']. I actually initially did this example with census tract data, so not sure if all of these fields are available at the block group level.

I have been doing some work on demographics looking at the social determinants of health (see SVI data download, definitions), hence the work with census data. I have posted my prior example fields I use from the census, but criminologists may just use the social-vulnerability-index from the CDC – it is essentially the same as how people typically define social disorganization.

Peer Review for Criminology Software

Second, jumping the gun a bit on this, but in the works is an overlay journal for CrimRxiv. Part of the contributions we will accept are software contributions, e.g. if you write an R package to do some type of analysis function common in criminology.

It is still in the works, but we have some details up currently and a template for submission (I need to work on a markdown template, currently just a word doc). High level I wanted something like the Journal of Statistical Software or the Journal of Open Source Software (I do not think the level of detail of JSS is necessary, but wanted an example use case, which JoSS does not have).

Just get in touch if you have questions whether your work is on topic. Aim is to be more open to contributions at first. Really excited about this, as publicly sharing code is currently a thankless prospect. Having a peer reviewed venue for such code contributions for criminologists fills a very important role that traditional journals do not.

Future Posts?

Hopefully can steal some time to continue writing posts here and there, but will definitely be busy getting the house in order in the next month. Hoping to do some work on mapping grids and KDE in python/geopandas, and writing about the relationship between healthcare data and police incident report data are two topics I hope to get some time to work on in the near future for the blog.

If folks have requests for particular topics on the blog though feel free to let me know in the comments or via email!

Minimum detectable effect sizes for place based designs

So I was reading Blattman et al.’s (2018) work on a hot spot intervention in Bogotá the other day. It is an excellent piece, but in a supplement to the paper Blattman makes the point that while his study is very high powered to detect spillovers, most other studies are not. I am going to detail here why I disagree with his assessment on that front.

In appendix A he has two figures, one for the direct effect comparing the historical hot spot policing studies (technically he uses the older 2014 Braga study, but here is the cite for the update Braga et al., 2020).

The line signifies a Cohen’s D of 0.17, and here is the same graph for the spillover estimates:

So you can see Blattman’s study in total number of spatial units of analysis breaks the chart so to speak. You can see however there are plenty of hot spot studies in either chart that reported statistically significant differences, but do not meet the 0.17 threshold in Chris’s chart. How can this be? Well, Chris is goal switching a bit here, he is saying using his estimator the studies appear underpowered. The original studies on the graph though did not necessarily use his particular estimator.

The best but not quite perfect analogy I can think of is this. Imagine I build a car that gets better gas mileage compared to the current car in production. Then someone critiques this as saying the materials that go into production of the car have worse carbon footprints, so my car is actually worse for the environment. It would be fine to argue a different estimate of total carbon footprint is reasonable (here Chris could argue his estimator is better than the ones the originally papers used). It is wrong though to say you don’t actually improve gas mileage. So it is wrong for Chris to say the original articles are underpowered using his estimator, they may be well powered using a different estimator.

Indeed, using either my WDD estimator (Wheeler & Ratcliffe, 2018) or Wilson’s log IRR estimator (Wilson, 2021), I will show how power does not grow with more experimental units, but with a larger baseline number of crimes for those estimators. They both only have two spatial units of analysis, so in Chris’s chart will never gain more power.

One way I think about the issue for spatial designs is this – you could always split up a spatial lattice into ever finer and finer spatial units of analysis. For example Chris could change his original design to use addresses instead of street segments, and split up the spillover buffers into finer slices as well. Do you gain something for doing nothing though? I doubt it.

I describe in my dissertation how finer spatial units of analysis allow you to check for finer levels of spatial spillovers, e.g. can check if crime spills over from the back porch to the front stoop (Wheeler, 2015). But when you do finer spatial units, you get more cold floor effects as well due to the limited nature of crime counts – they cannot go below 0. So designs with lower baseline crime rates tend to show lower power (Hinkle et al., 2013).

MDE for the WDD and log IRR

For minimum detectable effect (MDE) sizes for OLS type estimators, you need to specify the variance you expect the underlying treated/control groups to have. With the count type estimators I will show here, the variance is fixed according to the count. So all I need to specify is the alpha level of the test. Here I will do a default of 0.05 alpha level (with different lines for one-tailed vs two-tailed). The other assumption is the distribution of crime counts between treated/control areas. Here I assume they are all equal, so 4 units (pre/post and treated/control). For my WDD estimator this actually does not matter, for the later IRR estimator though it does (so the lines won’t really be exact for his scenario).

So here is the MDE for mine and Jerry’s WDD estimator:

What this means is that if you have an average of 20 crimes in the treated/control areas for each time period separately, you would need to find a reduction of 15 crimes to meet this threshold MDE for a one-tailed. It is pretty hard when starting with low baselines! For an example close to this, if the treated area went from 24 to 9, and the control area was 24 to 24, this would meet the minimal treated reduction of 15 crimes in this example.

And here is the MDE for the log IRR estimator. The left hand Y axis has the logged effect, and the right hand side has the exponentiated IRR (incident rate ratio).

Since the IRR is commonly thought of as a percent reduction, this suggests even with baselines of 200 crimes, for Wilson’s IRR estimator you need percent reductions of over 20% relative to control areas.

So I have not gone through the more recent Braga et al. (2020) meta-analysis. I do not know if they have the data readily available to draw the points on this plot the same as in the Blattman article. To be clear, it may be Blattman is right and these studies are underpowered using either his or my estimator, I am not sure. (I think they probably are quite underpowered to detect spillover, since this presumably will be an even smaller amount than the direct effect. But that would not explain estimates of diffusion of benefits commonly found in these studies!)

I also do not know if one estimator is clearly better or not – for example Blattman could use my estimator if he simply pools all treated/control areas. This is not obviously better than his approach though, and foregoes any potential estimates of treatment effect variance (I will be damned if I can spell that word starting with het even close enough for autocorrect). But maybe the pooled estimate is OK, Blattman does note that he has cold floor effects in his linear estimator – places with higher baselines have larger effects. This suggests Wilson’s log IRR estimator with the pooled data may be just fine and dandy for example.

Python code

Here is the python code in its entirety to generate the above two graphs. You can see the two functions to calculate the MDE given an alpha level and average crime counts in each area if you are planning your own study, the wdd_mde and lirr_mde functions.

Estimating minimum detectable effect sizes
for place based crime interventions

Andy Wheeler

import numpy as np
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
import os
my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\min_det_effect'

#Settings for matplotlib

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}


# Functions for MDE for WDD and logIRR estimator

def wdd_mde(avg_counts,alpha=0.05,tails='two'):
    se = np.sqrt( avg_counts*4 )
    if tails == 'two':
        a = 1 - alpha/2
    elif tails == 'one':
        a = 1 - alpha
    z = norm.ppf(a)
    est = z*se
    return est

def lirr_mde(avg_counts,alpha=0.05,tails='two'):
    se = np.sqrt( (1/avg_counts)*4 )
    if tails == 'two':
        a = 1 - alpha/2
    elif tails == 'one':
        a = 1 - alpha
    z = norm.ppf(a)
    est = z*se
    return est

# Generating regular grid from 10 to 200
cnts = np.arange(10,201)
wmde1 = wdd_mde(cnts, tails='one')
wmde2 = wdd_mde(cnts)

imde1 = lirr_mde(cnts, tails='one')
imde2 = lirr_mde(cnts)

# Plot for WDD MDE
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(cnts, wmde1,color='k',linewidth=2, label='One-tailed')
ax.plot(cnts, wmde2,color='blue',linewidth=2, label='Two-tailed')
ax.set_xlabel('Average Number of Crimes in Treated/Control')
ax.set_ylabel('Crime Count Reduction')
ax.legend(loc='upper left')
plt.title("WDD MDE alpha level 0.05")
plt.savefig('WDD_MDE.png', dpi=500, bbox_inches='tight')

# Plot for IRR MDE
fig, ax = plt.subplots(figsize=(8,6))
ax2 = ax.secondary_yaxis("right", functions=(np.exp, np.log))
ax.plot(cnts,-1*imde1,color='k',linewidth=2, label='One-tailed')
ax.plot(cnts,-1*imde2,color='blue',linewidth=2, label='Two-tailed')
ax.set_xlabel('Average Number of Crimes in Treated/Control')
ax.set_ylabel('log IRR')
ax.set_ylim(-0.16, -1.34)
ax.legend(loc='upper right')
plt.title("IRR MDE alpha level 0.05")
plt.savefig('IRR_MDE.png', dpi=500, bbox_inches='tight')



Comparing the WDD vs the Wilson log IRR estimator

So this is maybe my final post on the WDD estimator for the time being (Wheeler & Ratcliffe, 2018). Recently David Wilson had an article in JQC that proposes a different estimator using the same basic information, just pre-post crime counts for treated and control areas (Wilson, 2021). So say we had the table:

         Pre   Post
Treated   50     30
Control   60     55

So in this scenario, my WDD estimate is -20 in the treated area, and -5 in the control area, so the overall estimate is -20 – -5 = -15.

30 - 50 - (55 - 60) = -15

So an estimated reduction of -15 crimes overall. David’s estimator is the logged incident rate ratio (IRR), and so is just like above, except logs all of the values:

log(30) - log(50) - ( log(55) - log(60) ) = -0.4238142

This is a logged incident rate adjustment, so most of the time people exponentiate this value, which is exp(-0.4238142) = 0.6545455. So this suggests crime is reduced by approximately 35% in the treated area relative to the control area in this hypothetical. Or another way to write it is (30/50)/(55/60) = 0.6545455.

So instead of a linear estimate of the total numbers of crimes reduced, this is an estimate of the overall rate reduction. So this begs the question when would you prefer my WDD vs the IRR? I will try to answer that below – in short I think David’s estimator makes sense for meta-analyses (as I have said before in reference to the work in Braga & Weisburd, 2020). But for an individual agency doing an experimental evaluation I much prefer my estimator. The skinny of this logic is that we only really care about the overall crime reduction estimate from a cost-benefit analysis perspective. Backing out this total crime reduction count estimate from David’s IRR estimate can result in some funny business for an individual study.

Identifying Assumptions

So there are really two different assumptions my WDD estimator and David’s IRR estimator make. To generate a standard error estimate around the point estimate for either estimator, both require the data are Poisson distributed. So that makes no difference between the two. The assumption that really distinguishes between the WDD and the IRR estimate is the parallel trends assumption. The WDD assumes parallel trends are on the linear scale, whereas the IRR assumes parallel trends are on the ratio scale.

What exactly does this mean? Imagine we have a treated and control area, but look at the crime trends per time period before the treatment occurred. This set of areas has a set of parallel trends on the linear scale:

Time Treated Control
 0     50      60
 1     40      50
 2     45      55
 3     50      60

When the treated area goes down by 10 crimes, the control area goes down by 10 crimes. That is a parallel on the linear scale. Whereas this scenario is parallel on the ratio scale:

Time Treated Control
 0     50      60
 1     40      48
 2     45      54
 3     50      60

When crime goes down by 20% in the treated area, it goes down by 20% in the control area.

So while this gives a potential way to say you should use the WDD (parallel on the linear scale), or the IRR (parallel on the ratio scale), in practice it is not so simple. For one thing, if you only has the pre/post counts of crime, you cannot distinguish between these two scenarios. You can only tell in the case you have historical data to examine.

For a second part of this, you typically can choose your own control area (see for example the synthetic control estimator). So in most scenarios you could choose a control area to obey the linear or the ratio parallel trends assumption if you wanted to. However it may be in many scenarios there is a natural/easy control area, and you may see what is a better fit in that case for linear/ratio.

A final wee bit of a perverse aspect about this I will mention – pretend we have a treated/control area have approximately the same baseline crime counts/rates:

Time Treated Control
 0      30     30
 1      25     25
 2      20     20
 3      25     25

You actually cannot tell in this scenario whether the parallel trends are on the linear scale for my WDD or the ratio scale for the IRR estimate. They are consistent with either! In practice I think in many cases it will be like this – with noisy data, if you choose a control area that has approximately the same baseline crime counts, it will be quite hard to tell whether the linear parallel trends makes more sense or the ratio parallel trends makes more sense.

There are situations where the linear changes do not make sense, but they tend to be scenarios such as the control area has very little crime (so cannot go below 0 to match larger ups/downs in the treated area). So in that case sure the IRR is plausible and the WDD is not, but those are cases where the control area itself is quite questionable. Also note the IRR is not defined for any cells with 0 crimes – but again it is not good for either of our estimators in that case (although mine won’t fail to spit out a number, the power is so low the number it spits out won’t be worth much).


So I have adapted the same simulation code I used in prior studies/blog posts to evaluate the null distribution and the coverage of David’s IRR estimator. I partly did not pursue it initially back when me and Jerry were discussing this idea, because I thought it would be biased. Generalized linear models are based on maximum likelihood estimators, which are only asymptotically valid. In short it appears I was wrong here and David’s IRR estimator is fine even with just four observations, at least for the handful of scenarios I have tried it (have not looked at very tiny counts of crime, it is undefined if any cell has 0 crimes, as you cannot take the log of 0).

Python code pasted at the very end of the blog post, but for example if we generate a set of null no changes pre/post with a baseline of 50 crimes, the logged irr estimate (converted into a z-score here) is just fine and dandy and has a very close to standard normal distribution based on 10k simulations.

So lets look at the scenario where the control area doesn’t change, but the treated area goes from 50 to 30. We can see again the point estimate in this scenario is spot on the money.

And then we can see the coverage of the logged irr estimator is spot on as well:

So if you are interested in slightly different baseline scenarios, you can use that same simulation code to check out the behavior of David’s estimator and conduct simulated power analysis the same way I have shown for the WDD estimator in prior blog posts.

So if both are unbiased and have good coverage again, why would you prefer the WDD estimator over the IRR estimator (or vice-versa)? Well, lets take the 35% reduction I talked about at the beginning of the post, and the department needs to spend $250k on extra officers to conduct whatever hot spot policing intervention. A 35% reduction may be worth it if we start with a baseline of 200 crimes (so would expect to go down to 130, for a reduction of 70 crimes). If the baseline is 20 crimes, it goes down to 13 crimes (so only a reduction of 7 crimes). The actual benefit of the IRR estimate is entirely dependent on the baseline count of crimes it is applied to.

Even if the IRR estimate is itself unbiased and has proper coverage, for even an individual study backing out the estimated reduction in total crimes from the IRR is biased. So here in this same simulated data (50 to 30 in treated, and 50 to 50 in control areas). The true count reduction is -20, and here is the point estimate on the X axis and the length of the confidence interval for each simulation on the Y axis for my WDD test. You can see they are nicely centered on -20, and the length of the confidence intervals has a very tiny variance – they are mostly just a smidge over 50 in total length. So that is probably tough to wrap your head around, but the variance of the variance estimates for the WDD are small.

Now lets do the same graph for the IRR estimate, but translated back out to a count crime reduction based on the simulated values:

We either have a ton of bias in this estimate (if the estimate of the count reduction is too large, the confidence interval is too small). Or the opposite, the estimate of the count reduction is too small, and the confidence interval is crazy wide. In Andrew Gelman’s terminology, it can result in pretty large type M (magnitude) errors in this simulated example (Gelman & Carlin, 2014). So the variance of the variance estimates in this scenario are quite large.

To be clear – if you are interested in estimating a percent reduction, by all means use David’s IRR estimator. If you however want to translate that percent reduction into an estimate of the total crimes reduced though you should use my WDD estimator in that case. You should not back out a total crimes reduced estimate from the IRR.

Final Thoughts

So I have said a few times I think the IRR estimator makes more sense for meta-analyses. Why do I think that? Well, imagine we have an underlying causal process through which a hot spots policing experiment can randomly deter/prevent a particular proportion of crimes. That underlying causal process suggests an IRR effect. And also the problem I mention with translating back to crime counts I believe should get smaller with tighter estimates.

For a causal process that is more akin to my WDD estimator, imagine some crimes will always be deterred/prevented from a hot spots policing experiment, and some will never be. And we don’t know up-front which is which, so the observed reduction is based on whatever mixture of the two we have at that particular location.

The proportion reduction seems to make more sense to me for active patrol type interventions (which are ephemeral) vs permanent CPTED like interventions which should prevent certain criminal acts in perpetuity. But of course any situation in the real world could have both occurring at the same time.

When you go and look at the meta-analysis of hot spots policing, those interventions are all over the place (Hinkle et al., 2020). I think my WDD estimate would not make sense to mash up into a final meta-analytic estimate. The IRR may not make sense either in the end, but it is plausibly more relevant to compare the IRRs from a study with a baseline of 200 crimes vs one with 40 crimes at baseline. I am not sure it makes sense to compare WDDs in that scenario. But that being said, a few of my blog posts have discussed the WDD normalized per unit area or per unit time. Those normalized estimates are probably more apples to apples in the 200 vs 40 scenario.

A final note I have not discussed here is that David discusses a correction for overdispersion, so that is a potential feather in the cap for his estimator vs the WDD. I’d be a bit hesitant though with that – only four observations to estimate the dispersion term is slicing it a bit thin IMO. But I was wrong about the original estimator, so I may be wrong about that as well. It will take simulation evidence to determine that though – David’s paper just provides the correction term, he doesn’t provide evidence for its utility with small sample data.

And to be fair I have not done simulations to see how my estimator behaves in the presence of overdispersion either. I believe it will simply just cause the standard errors to be too small, so like in Wheeler (2016), I imagine it will just require upping the interval (e.g. use a z-score of 3 instead of 2) to get proper coverage for real crime data.


Other Posts of Interest

Python simulation code

Here is a copy-pasted chunk of the entire python simulation code.

Comparing WDD to log(IRR) from Wilson's
recent paper,

Andy Wheeler

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import poisson
from scipy.stats import uniform
import matplotlib
import matplotlib.pyplot as plt
import os
my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\wdd_vs_irr'

#Settings for matplotlib

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}


#This works for the scipy functions as well

# A function to generate the WDD estimate for simulated data
def wdd_sim(treat0,treat1,cont0,cont1,pre,post):
    tr_cr_0 = poisson.rvs(mu = treat0, size=int(pre)).sum()
    co_cr_0 = poisson.rvs(mu = cont0, size=int(pre)).sum()
    tr_cr_1 = poisson.rvs(mu = treat1, size=int(post)).sum()
    co_cr_1 = poisson.rvs(mu = cont1, size=int(post)).sum()
    # WDD estimates
    est = ( tr_cr_1/post - tr_cr_0/pre ) - ( co_cr_1/post - co_cr_0/pre )
    post2 = (1/post)**2
    pre2 = (1/pre)**2
    var_est = tr_cr_0*pre2 + tr_cr_1*post2 + co_cr_0*pre2 + co_cr_1*post2
    true_val = ( treat1 - treat0 ) - ( cont1 - cont0 )
    z_score = est / np.sqrt(var_est)
    # Wilson log IRR estimates
    true_logirr = np.log( (treat1*cont0) / (cont1*treat0) )
    est_logirr = np.log( ((tr_cr_1/post)*(co_cr_0/pre)) / ( (co_cr_1/post)*(tr_cr_0/pre) ) )
    se_logirr = np.sqrt( 1/tr_cr_1 + 1/co_cr_0 + 1/co_cr_1 + 1/tr_cr_0 )
    z_logirr = est_logirr / se_logirr
    return (tr_cr_0, co_cr_0, tr_cr_1, co_cr_0, est, var_est, true_val, z_score, true_logirr, est_logirr, se_logirr, z_logirr)

def make_data(n, treat0, treat1, cont0, cont1, pre, post):
    base = pd.DataFrame( range(n), columns=['index'])
    base['treat0'] = treat0
    if treat1 is not None:
        base['treat1'] = treat1
        base['treat1'] = base['treat0']
    if cont0 is not None:
        base['cont0'] = cont0
        base['cont0'] = base['treat0']
    if cont1 is not None:
        base['cont1'] = cont1
        base['cont1'] = base['cont0']
    base['pre'] = pre
    base['post'] = post
    sim_vals = base.apply(lambda x: wdd_sim(**x), axis=1, result_type='expand')
    sim_vals.columns = ['sim_t0','sim_c0','sim_t1','sim_c1','est','var_est','true_val','z_score',
    return pd.concat([base,sim_vals], axis=1)

# Coverage of the log irr estimate
# Lets look at the coverage rate for a decline from 40 to 20
def cover_logirr(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*data['se_logirr']
    low = data['est_logirr'] - dif
    high = data['est_logirr'] + dif
    cover = ( data['true_logirr'] > low) & ( data['true_logirr'] < high )
    return cover

# Length of ci for WDD
def len_ci(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    return low, high, high - low

# Length of ci for IRR estimate on count scale
# This depends on the baseline estimate to multiply
# The IRR by, using the baseline average of the 
# Treatment area

def len_irr(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*data['se_logirr']
    low = data['est_logirr'] - dif
    high = data['est_logirr'] + dif
    baseline = data['sim_t0']/data['pre']
    # Even if you use hypothetical, the variance is quite high
    #baseline = data['treat0']
    est_count = baseline*np.exp(data['est_logirr']) - baseline
    c1 = baseline*np.exp(low) - baseline
    c2 = baseline*np.exp(high) - baseline
    return est_count, c1, c2, np.abs(c2 - c1)

# Example with no change, lets look at the null distribution
sim_n = 10000
no_diff = make_data(sim_n, 50, 50, 50, 50, 1, 1)

# Example with equal time periods, a reduction from 50 to 30 and 50 to 50 in control area
sim_dat = make_data(sim_n, 50, 30, 50, 50, 1, 1)

cl = cover_logirr(sim_dat)

# Compare length of CI for IRR vs WDD

# WDD length
lowdd, highwdd, lwdd = len_ci(sim_dat)

# IRR length on the count scale
est_cnt_irr, lo_irr, hi_irr, ln_irr = len_irr(sim_dat)

# Scatterplot of estimated count reduction vs
# Length of CI
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(est_cnt_irr, ln_irr, c='k', 
            alpha=0.1, s=4)
ax.set_xlabel('Estimated Count Reduction [IRR]')
ax.set_ylabel('Length of CI on count scale [IRR]')
plt.savefig('IRR_Len_Est.png', dpi=500, bbox_inches='tight')

# Lets compare to the WDD estimate
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(sim_dat['est'], lwdd, c='k', 
            alpha=0.1, s=4)
ax.set_xlabel('Estimated Count Reduction [WDD]')
ax.set_ylabel('Length of CI on count scale [WDD]')
plt.savefig('WDD_Len_Est.png', dpi=500, bbox_inches='tight')

Clumpy hotspots

Read an article by Tim Hart the other day (part of a special issue I will have an article in as well here soon). In it he evaluated hot spot methods not only by how well they forecast crime, but also by the clumpiness of the hot spot method. Some hot spot methods, such as risk terrain modeling (Caplan et al., 2020; Fox et al., 2021), machine learning models (Wheeler & Steenbeek, 2020), or self-exciting point process models (Mohler et al., 2018) can by their nature produce discontinuous hot spots. Here is an example of a RTM map I made in Yoo & Wheeler (2019) for homeless related crime in Los Angeles, and you can see this is quite spotty in the ups/downs in the high risk areas:

Other hot spot methods, like hierarchical clustering (Wheeler & Reuter, 2020) or kernel density maps however this is not as big an issue. Here is an example kernel density map also from Yoo & Wheeler (2019) based on the same data:

So you can see how the hot spots in the kernel density map are spatially contiguous, whereas the RTM example can be little hot spots all over the jurisdiction. So it is obviously easier to patrol a single contiguous area than many islands over the entire jurisdiction. So it may make sense to trade off a contiguous area that captures somewhat fewer crimes than speckled areas that are all over the map.

Adepeju et al. (2016) was the first to use a particular statistic, the clumpiness index, to evaluate different hot spot methods. Their figure below is a pretty good depiction of the idea – count up the number of internal edges to a hot spot (when a hot spot grid-cell neighbors another hot spot), and the number of external edges. Then it is just a particular formula to make the index range from -1 to 1 given different sized hot spots.

So here I flip this idea on its head abit – instead of using a particular hot spot technique and see its clumpiness, I formulate a linear program given a prediction to trade off a smaller number of predicted crimes in the hot spot vs making the hot spot areas more clumpy. I illustrate my clumpy hot spots using just prior data to predict future data, in particular thefts from motor vehicles in Raleigh North Carolina.

I have posted the data/code on github here. It is a bit too long to embed the code directly in the blog post, but just see the file. The crime data and Raleigh border I downloaded from the Raleigh open data website.

A Linear Program to Clump Hot Spots

So for some quick and dirty math in text, the linear program I formulate is:

Maximize { Sum[ theta*S_i*Crime_i + (1 - theta)*E_i ] }
Subject To:
    1) Sum( S_i ) = k
    2) E_i <= Sum(S_n for n in neighbors(i) ) for each i
    3) E_i <= S_i for each i
    4) S_i element of {0,1}, E_i >= 0 (and can be continuous)

The idea behind this is that if theta=1, this is the same as just taking whatever your input areas are and ranking them to pick the top k areas. So if you have 10000 500 by 500 foot grid cells as your spatial units of analysis, and you wanted the top 1% of the city, that is 100 grid cells. So you would choose k=100 in that scenario. Crime_i here I use as prior counts of crime in the grid cell, but it could be the predicted value from whatever model as well. That is the first constraint in this model approach – you need to choose the total area you want. S_i are the decision variables for the final selected hot spot areas.

The second and third constraints determine the values for the second set of decision variables, E_i. These are the decision variables that encode the interconnected links when a selected grid cell touches another grid cell. Constraint 2 sets E_i to the total number of neighbors of i that are selected, except constraint 3 says if S_i is 0 E_i needs to be 0 as well.

In this formulation, S_i need to be integer variables, but the E_i are defined by the sum of S_i, so they can be continuous. In this formulation if you have N grid cells (or whatever spatial units of analysis), this results in 2*N decision variables, and 2*N + 1 constraints. You could maybe save a few constraints here by working with an undirected graph instead of a directed one (in essence this double counts, a-b and b-a would count as two links). But this will just make it 1.5*N constraints instead of 2*N. So not a big deal probably. I did have some issues solving this using pulps default coin/GLPK solver. But CPLEX solved it no problem. (My example is a total of 20,986 500 by 500 foot grid cells, and I use rook contiguity like the Adepeju article as well. And using CPLEX it solves the models in just a few seconds.)

In this formulation you can think of theta as trading off crimes in the hot spot vs interior edges. So imagine you had theta=0.9, and you had a solution with 200 crimes and 100 interior edges. The objective function in that scenario would be 0.9*200 + 0.1*100 = 190. Now imagine you had an alternative scenario with 190 crimes, but 200 internal edges, the objective function would be 0.9*190 + 0.1*200 = 191. So you are saying, it is ok to have hot spots capture a smaller number of crimes, if they are more connected.

Normal Hotspots vs Clumpy Ones in Raleigh

The open data I use for Raleigh, North Carolina for the NIBRS dataset goes back to June 2014, and has data updated in the beginning of March 2021. I pull out larcenies from motor vehicles, and for the historical train dataset use car larcenies from 2014 through 2019 (n = 17,681). For the test dataset I use car larcenies in 2020 and what is available so far in 2021 (n = 3,376). Again these are grid cells generated over the city boundaries at 500 by 500 foot intervals. For illustration I grab out the top 1% of the city (209 grid cells). I use a train/test dataset as out of sample test data will typically result in reduced predictions. Here are the PAI stats for train vs test when selecting the top 1%.

For all subsequent selections I always use the historical training data to select the hot spots, and the test dataset to evaluate the PAI.

If we do the typical approach of just taking the highest crime grid cells based on the historical data, here are the results both for the PAI and the CI (clumpy index). For those not familiar, PAI is % Crime Capture/% Area, so if the denominator is 1%, and the PAI (for the test data) is 17, that means the hot spots capture 17% of the total thefts from vehicles. The CI ranges from -1 (spread apart) to 1 (entirely clustered). Here it is just over 0, suggesting these are basically randomly distributed in terms of clustering.

You may think that almost spatial randomness in terms of clumping seems at odds with that crime clusters! But it is not really – a consistent relationship with crime hot spots is that they are intensely localized, and often you can go down the street and be in a low crime area (Harries, 2006). The same idea when people say high crime neighborhoods often are spotty interior – they tend to have mostly low crime areas and just a few specific hot spots.

OK, so now to show off my linear program. So what happens if we use theta=0.9?

The total crime numbers are here for the historical data, and it ends up capturing the exact same number of crimes as the select top 1% does (3,664). But, it switches the selection of one of the areas. So what happens here is that we have ties – even with basically little weight assigned to the interior connections, it will prioritize tied crime areas to be connected to other chosen hot spots (whereas before the ties are just random in the way I chose the top 1%). So if you have many ties at the threshold for your hot spot, this is a great way to prioritize particular tied areas.

What happens if we turn down theta to 0.5? So this is saying you would trade off one for one – one interior edge is equal to one crime.

You can see that it changed the selections slightly more here, traded off 24 areas compared to the original just rank solution. Lets check out the map and the CI:

The CI value is now 0.17 (up from 0.08). You can see some larger blobs, but it is still pretty spread apart. But the reduction in the total number of crimes captured is pretty small, going from a PAI of 17 to now a PAI of 16. How about if we crank down theta even more to 0.2?

This trades off a much larger number of areas and total amount of crime – over half of the chosen grid cells are flipped in this scenario. In the subsequent map you can see the hot spots are much more clumpy now, and have a CI of 0.64.

The PAI of 12.6 is a bit of a hit as well, but is not too shabby still. I typically take a PAI of 10 to be the ballpark of what is reasonable based on Weisburd’s Law of Crime Concentration – 5% of the areas contain 50% of the crime (which is a PAI of 10).

So this shows one linear programming approach to trade off clumpy chosen areas vs disconnected speckles over the map. It may be the case though that other approaches are more reasonable, such as using some type of clustering to begin with. E.g. I could use DBSCAN on the gridded predicted values (Wheeler & Reuter, 2020) as see how clumpy those hot spots are. This approach is nice though if you have a fixed area you want to cover though.

Why Raleigh?

For a bit of personal news, I will be moving to the Raleigh area here shortly. I recently negotiated to be 100% remote at my job – so I will still be at HMS (or since we were recently purchased I might be employed by Gainwell I guess by the time I move). So looking forward to the new adventure back on the east coast but still in more temperate climates than PA or NY!