Resources of interest for criminologists and crime analysts

I tend to get about one email per week asking for help. Majority of folks are either students asking for general research advice, or individuals who came across my webpage asking for advice about code.

This is great, and everyone should always feel open to send me an email. The utility of me answering these questions (for everyone) are likely greater than spending time working on a paper, so I do not mind at all. I can currently keep up with the questions given the volume (but not by much, and is dependent on how busy I am with other work/family things). Worst case I will send an email response that says sorry I cannot respond to this anytime soon.

Many times there are other forums though for people to post questions that are ultimately better. One, I participate in many of these, so it is not like sending an email just to me, it is like sending an email to me + 40 other people who can answer your question. Also from my perspective it is better to answer a question once in one of these forums, than repeat the same answer a dozen different times. (Many times I write a blog post if I get the same question multiple times.)

While the two groups overlap a bit, I separate out resources aimed at criminologists (as typically more interested in research and are current master/PhD students), whereas crime analysts are embedded in a criminal justice agency.

For Criminologists

For resources on where to ask questions, Jacob Kaplan recently created a slack channel, It has been joined by a variety of criminologists, folks in think tanks/research institutes, current graduate students, and some working crime analysts. It is new, but you can go and peruse the topics so far, they are pretty wide in scope.

So that forum you can really ask about anything crim related, the remaining resources are more devoted towards programming/statistical analysis.

If you are interested in statistical or programming questions, I used to participate in StackOverflow, Cross Validated (the stats site), and the GIS site. They are good places to check out prior answers, and are worth a shot asking a question on occasion. For tricky python or R coding questions that are small in scope, StackOverflow is excellent. Anything more complicated it is more hit or miss.

Many programming languages have their own question boards. Stata and SPSS are ones I am familiar with and tend to receive good responses (I still actively participate in the SPSS board). If I’m interested in learning some new command/library in Stata, I often just search the forum for posts related to it to check it out in the wild.

For programming questions, it is often useful to create a minimal reproducible example to describe the problem, show what the input data looks like and how you want the output data to look like. (In fact on the forums I link to you will almost always be asked explicitly to do that.)

For Crime Analysts

In similar spirit to the crim slack channel, Police Rewired has a Discord group for crime analysts (not 100% sure who started it, Andreas Varotsis is one of the people involved though). So that was founded by some UK analysts, but there are US analysts participating as well (and the problems folks deal with are very similar, so no real point in making a distinction between US/UK).

For crime analysts in the US, you should likely join either the IACA or a local crime analyst network. Many of the local ones come bundled, so if you join the Texas analyst network TXLEAN you also automatically get an IACA membership. To join is cheap (especially for current students). IACA has also started a user question forum as well.

For folks looking to get an entry level gig, the IACA has a job board that is really good. So it is worth the $10 just for that. They have various other intro resources though as well. For current BA/Masters students who are looking to get a job, I also suggest applying to private sector analyst jobs as well. They are mostly exchangeable with a crime analyst role. (Think more excel jockey than writing detailed statistic programming.)

How I learn to code

What prompted this blog post is that I’ve gotten asked by maybe 5 different people in the past month or so asking for resources to learn about statistical programming. And honestly I do not have a good answer, I’ve never really sat down with a book and learned a statistical software (tried on a few occasions and failed). I’m always just project focused.

So I wanted to do an example conjunctive analysis, or deep learning with pytorch, or using conformal prediction intervals to generate synthetic control estimates, etc. So I just sat down and figured out how to do those specific projects using various resources around the internet. One of my next personal projects is going to estimate prediction intervals for logistic multilevel models using Julia (based on this very nice set of intros to the language). I also need to get a working familiarity with Tableau. (Both are related to projects I am doing at work.) So expect to see a Tableau dashboard on the blog sometime in the near future!

Also many statistical programming languages are pretty much exchangeable for the vast majority of tasks people do. You can see that I have example blog posts for Excel, Access/SQL, R, SPSS, Stata, python, and ArcGIS. Just pick one and figure it for a particular project.

For criminologists, I have posted my Phd research course materials, and for Crime Analysts I have posted my GIS Class and my Crime Analysis course materials (although the GIS course is already out of date, it uses Arc Desktop instead of ArcPro). I don’t suggest you sit down and go through the courses though page-by-page. What I more suggest is look at the table of contents, see if anything strikes your fancy, read that particular lecture/code, and if you want to apply it to your own projects try to work it out. (At least that is how I go about learning coding.)

If you want more traditional learning materials for learning how to do code (e.g. textbooks or online courses), I suggest you ask folks on those forums I mentioned. They will likely be able to provide much better advice than I would.

To end it is totally normal to want to ask questions, get advice, or get feedback. Both my experience in Academia and in Crime Analysis it can be very lonely (I was in a small department, so was the only analyst). Folks on these forums are happy to help and connect.

Using Association Rules to Conduct Conjunctive Analysis

I’ve suggested to folks a few times in the past that a popular analysis in CJ, called conjunctive analysis (Drawve et al., 2019; Miethe et al., 2008; Hart & Miethe, 2015), could be automated in a fashion using a popular machine learning technique called association rules. So I figured a blog post illustrating it would be good.

I was motivated by some recent work by Nix et al. (2019) examining officer involved injuries in NIBRS data. So I will be doing a relevant analysis (although not as detailed as Justin’s) to illustrate the technique.

This ended up being quite a bit of work. NIBRS is complicated, and I had to do some rewrites of finding frequent itemsets to not run out of memory. I’ve posted the python code on GitHub here. So this blog post will be just a bit of a nicer walkthrough. I also have a book chapter illustrating geospatial association rules in SPSS (Wheeler, 2017).

A Brief Description of Conjunctive Analysis

Conjunctive analysis is more of an exploratory technique examining high cardinality categorical sets. Or in other words, you search though a database of cases that have many categories to find “interesting” patterns. It is probably easier to see an example than for me to describe it. Here is an example from Miethe et al. (2008):

You can see that here they are looking at characteristics of drug offenders, and then trying to identify particular sets of characteristics that influence the probability of a prison sentence. So this is easy to do in one dimension, it gets very difficult in multiple dimensions though.

Association rules were created for a very different type of problem – identifying common sets of items that shoppers buy together at the same time. But you can borrow that work to aid in conducting conjunctive analysis.

Data Prep for NIBRS

So here I am using 2012 NIBRS data to conduct analysis. Like I mentioned, I was motivated by the Nix and company paper examining officer injuries. They were interested in specifically examining officer involved injuries, and whether the perception that domestic violence cases were more dangerous for officers was justified.

For brevity I only ended up examining five different variable sets in NIBRS (Justin has quite a few more in his paper):

  • assault (or injury) type V4023
  • victim/off relationship V4032
  • ucr type V2006
  • drug use V2009 (also includes computer use!)
  • weapon V2017

All of these variables have three different item sets in the NIBRS codes, and many categories. You will have to dig into the python code, in the GitHub page to see how I recoded these variables.

Also maybe of interest I have some functions to do one-hot encoding of wide data. So a benefit of NIBRS is that you can have multiple crimes in one incident. So e.g. you can have one incident in which an assault and a burglary occurs. I do the analysis in a way that if you have common co-crimes they would pop out.

Don’t take this as very formal though. Justin’s paper which used 2016 NIBRS data only had 1 million observations, whereas here I have over 5 million (so somewhere along the way me and Justin are using different units of analysis). Also Justin’s incorporates dozens of other different variables into the analysis I don’t here.

It ends up being that with just these four variables (and the reduced sets of codes I created), there still end up being 34 different categories in the data.

Analysis of Frequent Item Sets

The first part of conjunctive analysis (or association rules) is to identify common item sets. So the work of Hart/Miethe is always pretty vague about how you do this. Association rules has the simple approach that you find any item sets, categories in which a particular itemset meets an arbitrary threshold.

So the way you represent the data is exactly how the prior Miethe et al. (2008) data showed, you create a series of dummy 0/1 variables. Then in association rules you look for sets in which for different cases all of the dummy variables take the value of 1.

The code on GitHub shows this going from the already created dummy variable data. I ended up writing my own function to do this, as I kept getting out of memory errors using the mlextend library. (I don’t know if this is due to my data is large N but smaller number of columns.) You can see my freq_sets function to do this.

Typically in association rules you identify item sets that meet a particular support threshold. Support here just means the proportion of cases that those items co-occur. E.g. if 20% of cases of assault also have a weapon of fists listed. Instead though I wrote the code to have a minimum N, which I choose here to be 1000 cases. (So out of 5 million cases, this is a support of 1/5000.)

I end up finding a total of 411 frequent item sets in the data that have at least 1000 cases (out of the over 5 million). Here are a few examples, with the frequencies to the left. So there are over 2000 cases in the 2012 NIBRS data that had a known relationship between victim/offender, resulted in assault, the weapon used was fists (or kicking), and involved computer use in some way. I only end up finding two itemsets that have 5 categories and that is it, there are no higher sets of categories that have at least 1000 cases in this dataset.

3509    {'rel_Known', 'ucr_Assault', 'weap_Fists', 'ucr_Drug'}
2660    {'rel_Known', 'ucr_Assault', 'weap_Firearm', 'ucr_WeaponViol'}
2321    {'rel_Known', 'ucr_Assault', 'weap_Fists', 'drug_ComputerUse'}
1132    {'rel_Known', 'ucr_Assault', 'weap_Fists', 'weap_Knife'}
1127    {'ucr_Assault', 'weap_Firearm', 'weap_Fists', 'ucr_WeaponViol'}
1332    {'rel_Known', 'ass_Argument', 'rel_Family', 'ucr_Assault', 'weap_Fists'}
1416    {'rel_Known', 'rel_Family', 'ucr_Assault', 'weap_Fists', 'ucr_Vandalism'}

Like I said I was interested in using NIBRS because of the Nix example. One way we can then examine what variables are potentially related to officer involved injuries during a commission of a crime would be to just pull out any itemsets which include the variable of interest, here ass_LEO_Assault.

4039    {'ass_LEO_Assault'}
1232    {'rel_Known', 'ass_LEO_Assault'}
4029    {'ucr_Assault', 'ass_LEO_Assault'}
1856    {'ass_LEO_Assault', 'weap_Fists'}
1231    {'rel_Known', 'ucr_Assault', 'ass_LEO_Assault'}
1856    {'ucr_Assault', 'ass_LEO_Assault', 'weap_Fists'}

So we see there are a total of just over 4000 officer assaults in the dataset. Unsurprisingly almost all of these also had an UCR offense of assault listed (4029 out of 4039).

Analysis of Association Rules

Sometimes just identifying the common item sets is what is of main interest in conjunctive analysis (see Hart & Miethe, 2015 for an example of examining the geographic characteristics of crime events).

But the apriori algorithm is one way to find particular rules that are of the form if A occurs then B occurs quite often, but swap out more complicated itemsets in the antecedent (A) and consequent (B) in the prior statement, and different ways of quantifying ‘quite often’.

I prefer conditional probability notation to the more typical association rule one, but for typical rules we have (here I use A for antecedent and B for consequent):

  • confidence: P(A & B) / P(B). So if the itemset of just B occurs 20% of the time, and the itemset of A and B together occurs 10% of the time, the confidence would be 50%. (Or more simply the probability of B conditional on A, P(B | A)).
  • lift: confidence(A,B) / P(B). This is a ratio of the baseline a category occurs for the denominator, and the numerator is the prior confidence category. So if you have a baseline B occurring 25% of the time, and the confidence of A & B is 50%, you would then have a lift of 2.

There are other rules as well that folks use, but those are the most common two I am interested in.

So for example in this data if I draw out rules that have a lift of over 2, I find rules like {'ucr_Vandalism', 'rel_Family'} -> {'ass_Argument'} produces a lift of over 6. (I can use the mlextend implementation here in this code, it was only the frequent itemsets code that was giving me problems.) So it ends up being arguments are listed in the injury codes around 1.6% of the time, but when you have a ucr crime of vandalism, and the relationship between victim/offender are family members, injury type of argument happens around 10.5% of the time (so 10.5/1.6 ~= 6).

The original use case for this is recommender systems/market analysis (so say if you see someone buy A, give them a coupon for B). So this ends up being not so interesting in this NIBRS example when you have you have more clear cause-effect type relationships criminologists would be interested in. But I describe in the next section some further potential machine learning models that may be more relevant, or how I might in the future amend the apriori algorithm for examining specific outcomes.

Further Notes

If you have a particular outcome you are interested in a specific outcome from the get go (so not so much totally exploratory analysis as here), there are a few different options that may make more sense than association rules.

One is the RuleFit algorithm, which basically just uses a regularized regression to find simple models and low order interactions. An example of this idea using police stop data is in Goel et al. (2016). These are very similar in the end to simple decision trees, you can also have continuous covariates in the analysis and it splits them into binary above/below rules. So you could say do RTM distance analysis, and still have it output a rule if < 1000 ft predict high risk. But they are fit in a way that tend to behave better out of sample than doing simple decision trees.

Another is fitting a more complicated model, say random forests, and then having reduced form summaries to describe those models. I have some examples of using shapely values for spatial crime prediction in Wheeler & Steenbeek (2020), but for a more if-then type sets of rules you could look at Scoped Rules.

I may need to dig into the association rules code some more though, and try to update the code to take the sample sizes and statistical significance into account for a particular outcome variable. So if you find higher lift in a four set predicting a particular outcome, you search the tree for more sets with a smaller support in the distribution. (I should probably also work on some cool network viz. to look at all the different rules.)



An intro to linear programming for criminologists

Erik Alda made the point the other day on twitter that we are one of the few crim folks that do anything related to linear programming. I think it is crazy useful – much more so than say teaching myself some new regression technique or a programming language.

I don’t quite remember the motivation to learn it. I think I kept seeing repeated applications in papers I read, but was also totally baffled by it; I did not understand peoples notation for it at all. In retrospect that was because it is not statistics. You are optimizing a function by estimating some parameters (there is nothing stochastic about it, so there is no statistical inference). So it is more like finding the min/max of a function in calculus.

I think the best way to think about linear programming is in terms of decision analysis. We have a set of options among which we need to choose some action. So we make the choices that either maximize or minimize some objective, but also take into account constraints on the decisions we can make.

For social scientists here is an example that hopefully illustrates the difference between statistics and linear programming. Say we are interested in conducting a hot spots policing randomized experiment. So we define our top 20 crime hot spots in the city, and randomly assign 10 of them to receive the hot spots treatment. Linear programming is basically the obverse of this, given our 20 hot spot areas, which are the best 10 locations to choose for our intervention.

This problem as stated you might be thinking is trivial – just rank each of the 20 hot spots by the total number of crimes, and then choose the top 10. Where linear programming really helps though is if you have constraints on the final choices you make. Say you did not want to choose hot spots that are within 1 mile of each other (to spread out the hot spot interventions throughout the city). There is no simple way to sort your hot spots to obey that constraint, but you can encode that in the linear program and have the computer solve it quite easily.

There is no shortage of ways you could expand the complexity of this example hot spot decision analysis. Say you had two different types of hot spot treatments, and that they had different efficacy in different areas (one was good for property crime, and the other was better for violent crime). You might think of this as doing two separate decision analyses, where a constraint is that an area can only be assigned one of the two interventions.

Here I will provide some code examples in python using the pulp library to illustrate some more examples using data you can see in action, as well as different ways to think about linear programming problems in practice. (Technically the examples I give are all mixed integer linear programs, as the decision variables are binary 0/1.)

Formulating Objectives and Constraints

For this example I will be simulating data, but imagine a case you are an analyst for the IRS, and you want to determine which business tax returns to audit. We want to audit cases that both have a high probability of being fraudulent, as well as cases in which the total amount of the underpayment is large. (If you want a more typical criminology example, imagine assigning criminal cases to detectives, some cases have more costs, e.g. homicide vs burglary, and some cases have different probabilities of being solvable. This type of decision problem is very common in my experience – pretty much any time you have to make a binary choice, and those choices have variable costs/benefits.)

First I start off by simulating some data (the only libraries we need are numpy and pulp). So I simulate 1000 business tax returns, which have an estimate of the probability they are fraud, prob_fraud, and an estimate of the amount they underpayed, underpay_est.

import numpy as np
import pulp

#Simulate data for costs and probabilities

total_cases = 1000
underpay_est = np.random.uniform(1000,100000,total_cases)
prob_fraud = np.random.uniform(0,1,total_cases)
exp_return = prob_fraud*underpay_est


The objective we will be maximizing then is the expected return of auditing a tax return, exp_return, which is simply the multiplication of the probability of fraud multiplied by the amount of the underpayment. For a simple example, say we have a case where fraud is estimated to be 50%, and the estimate of the underpayment amount is $10,000. So our expected return for auditing that case is $5,000.

We need these two estimates external to our linear programming problem, and they themselves can be informed by predictive models (or simpler estimates, e.g. underpayment is proportional ~30% of deductions or something like that).

Now we have all we need to set up our linear programming problem. I am going to choose 100 cases out of these 1000 to audit. Hopefully that code is documented enough to see creating the decision variables (each tax return either gets a 1 if it is chosen, or a 0 if it is not), the original objective function that we are maximizing, and the results.

#Setting up the problem
case_index = list(range(total_cases))
tot_audit = 100

#Basic Problem
P = pulp.LpProblem("Choosing Cases to Audit", pulp.LpMaximize)
D = pulp.LpVariable.dicts("Decision Variable", [i for i in case_index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Objective Function
P += pulp.lpSum( D[i]*exp_return[i] for i in case_index)
#Constraint on total number of cases audited
P += pulp.lpSum( D[i] for i in case_index ) == tot_audit
#Solve the problem
#Get the decision variables
dec_list = [D[i].varValue for i in case_index]
dec_np = np.asarray(dec_list)
#Expected return
print( (dec_np * exp_return).sum() )
#Should be the same
print( pulp.value(P.objective) )
#Hit rate of cases
print( (dec_np * prob_fraud).sum()/tot_audit )

If you are following along in python, you will see that the total expected return is 7,287,915, and the estimated hit rate (or clearance rate) of the audits is around 0.88.

This example would be no different if we just chose the top 100 cases based on the expected return. Say that you thought the hit rate though of 88% was too low. You will choose cases that are big dollar amounts, but not necessarily a very high probability. So you may say I want my clearance rate to be over 90% overall. That is a simple constraint to add into the above model.

#Updating the problem to constrain on the hit rate
#Above a particular threshold
hit_rate = 0.9
P += pulp.lpSum( D[i]*prob_fraud[i] for i in case_index ) >= hit_rate*tot_audit
#Get the decision variables
dec_list = [D[i].varValue for i in case_index]
dec_np = np.asarray(dec_list)
#Expected return is slightly lower than before
print( pulp.value(P.objective) )
#Hit rate of cases
print( (dec_np * prob_fraud).sum()/tot_audit )

So now the total expected return is lower than without the constraint, 7,229,140 (so a reduction of about $60k), but our expected hit rate is just above 90%.

You may be thinking, “why not just eliminate cases with a probability of lower than 90%”, and then amongst those left over select the highest expected return. That meets your constraints, but has a lower expected return than this program! Think of this program as more tit-for-tat. High expected return / lower probability audits can still be selected with this model, but you need to balance them out with some high probability cases in response to tip the scales to meet the overall hit rate objective.

Trade-Offs and the Frontier Curve

So you may be thinking, ok the trade-off to get a 90% clearance was not too bad in terms of total extra taxes collected. So why not set the constraint to 95%. When you create constraints, they always lower the objective function (lower or equal to be more precise). The question then becomes quantifying that trade off.

You can subsequently vary the hit rate constraint, and see how much it changes the total expected return. Here is an example of doing that, each model only takes around a second to complete.

#Drawing the trade-off in hit rate vs expected return

hit_rate = np.linspace(0.85, 0.95, 30)
total_return = []

#Function to estimate the model
def const_hit_rate(er, prob, tot_n, hr):
    c_index = range(len(er))
    Prob = pulp.LpProblem("Choosing Cases to Audit", pulp.LpMaximize)
    Dec = pulp.LpVariable.dicts("Decision Variable", [i for i in c_index], lowBound=0, upBound=1, cat=pulp.LpInteger)
    Prob += pulp.lpSum( Dec[i]*er[i] for i in c_index)
    Prob += pulp.lpSum( Dec[i] for i in c_index ) == tot_n
    Prob += pulp.lpSum( Dec[i]*prob[i] for i in c_index ) >= hr*tot_n
    dec_li = [Dec[i].varValue for i in c_index]
    dec_np = np.asarray(dec_li)
    return pulp.value(Prob.objective), dec_np

for h in hit_rate:
    print(f'Estimating hit rate {h}')
    obj, dec_res = const_hit_rate(exp_return, prob_fraud, 100, h)


For this simulated data example, there end up being pretty severe trade-offs in the total return after you get above 91% hit rates, so from this it may not be worth the trade-off to get a much higher hit rate in practice. Just depends on how much you are willing to trade-off one for the other.

There are other ways to formulate this trade off (via bi-objective functions/soft-constraints, or weighted ranking schemes), but the blog post is long enough as is!

Other Potential Applications

So in terms of my work, I have examples of using linear programs to make spatial location decisions, encode fairness constraints into predictive policing, and allocate treatment assignment with network spillovers.

Erik Alda and Joseph Ferrandino have conducted frontier analysis of different criminal justice organizations, which is based on estimating the frontier curve above from data (instead of a pre-specified objective function).

That is about it for criminologists that I know of, but there are plenty of applications towards criminal justice topics using linear programming (or related concepts). It is most popular among operations researchers, of which Laura Albert is one of my favorites. (Criminal Justice as a field might not exist for Albert Blumstein, who was also a very influential operations researcher.)

One of the things that makes this different from more traditional quantitative work in the social sciences is that again it is not statistics – we are not testing hypotheses. The contribution is simply formulating the decision problem in a tractable way that can be solved, and the drawing of the trade-offs I showed above.

It is one of the ways I really like it though – unlike saying how your regression model can be used to inform decisions, this much more explicitly shows the utility of the results of those models in some practice.

Conjoint Analysis of Crime Rankings

So part of my recent research mapping crime harm spots uses cost of crime estimates relevant to police departments (Wheeler & Reuter, 2020). But a limitation of this is that cost of crime estimates are always somewhat arbitrary.

For a simple example, those cost estimates are based mostly on people time by the PD to respond to crimes and devote investigative resources. Many big city PDs entirely triage crimes like breaking into vehicles though. So based on PD response the cost of those crimes are basically $0 (especially if PDs have an online reporting system).

But I don’t think the public would agree with that sentiment! So in an act of cognitive dissonance with my prior post, I think asking the public is likely necessary for police to be able to ultimately serve the publics interest when doing valuations. For some ethical trade-offs (like targeting hot spots vs increasing disproportionate minority contact, Wheeler, 2019) I am not sure there is any other reasonable approach than simply getting a bunch of peoples opinions.

But that being said, I suspected that these different metrics would provide pretty similar rankings for crime severity overall. So while it is criminology 101 that official crime and normative perceptions of deviance are not a perfect 1 to 1 mapping, most folks (across time and space) have largely similar agreement on the severity of different crimes, e.g. that assault is worse than theft.

So what I did was grab some survey ranking of crime data from the original source of crime ranking that I know of, Marvin Wolfgang’s supplement to the national crime victimization survey (Wolfgang et al., 2006). I have placed all the code in this github folder to replicate. And in particular check out this Jupyter notebook with the main analysis.

Conjoint Analysis of Crime Ranks

This analysis is often referred to as conjoint analysis. There are a bunch of different ways to conduct conjoint analysis – some ask folks to create a ranked list of items, others ask folks to choose between a list of a few items, and others ask folks to rank problems on a Likert item 1-5 scale. I would maybe guess Likert items are the most common in our field, see for example Spelman (2004) using surveys of asking people about disorder problems (and that data is available to, Taylor, 2008).

The Wolfgang survey I use here is crazy complicated, see the codebook, but in a nutshell they had an anchoring question where they assigned stealing a bike to a value of 10, and then asked folks to give a numeric score relative to that theft for a series of 24 other crime questions. Here I only analyze one version of the questionnaire, and after eliminating missing data there are still over 4,000 responses (in 1977!).

So you could do analyze those metric scores directly, but I am doing the lazy route and just doing a rank ordering (where ties are the average rank) within person. Then conjoint analysis is simply a regression predicting the rank. See the notebook for a more detailed walkthrough, so this just produces the same analysis as looking at the means of the ranks.

About the only thing I do different here than typical conjoint analysis is that I rescale the frequency weights (just changes the degrees of freedom for standard error estimates) to account for the repeated nature of the observations (e.g. I treat it like a sample of 4000 some observations, not 4000*25 observations). (I don’t worry about the survey weights here.)

To test my assertion of whether these different ranking systems will be largely in agreement, I take Jerry’s crime harm paper (Ratcliffe, 2015), which is based on sentencing guidelines, and map them as best I could to the Wolfgang questions (you could argue with me some though on those assements – and some questions don’t have any analog, like a company dumping waste). I rescaled the Wolfgang rankings to be in a range of 1-14, same as Jerry’s, instead of 1-25.

Doing a more deep dive into the Wolfgang questions, there are definately different levels in the nature of the questions you can tease out. Folks clearly take into account both harm to the victim and total damages/theft amounts. But overall the two systems are fairly correlated. So if an analyst wants to make crime harm spots now, I think it is reasonable to use one of these ranking systems, and then worry about getting the public perspective later on down the line.

The Wolfgang survey is really incredible. In this regression framework you can either adjust for other characteristics (e.g. it asks about all the usual demographics) or look at interactions (do folks who were recently victimized up their scores). So this is really just scratching the surface. I imagine if someone redid it with current data many of the metrics would be similar as well, although if I needed to do this I don’t think I would devise something as complicated as this, and would ask people to rank a smaller set of items directly.


  • Ratcliffe, J.H. (2015). Towards an index for harm-focused policing. Policing: A Journal of Policy and Practice, 9(2), 164-182.
  • Spelman, W. (2004). Optimal targeting of incivility-reduction strategies. Journal of Quantitative Criminology, 20(1), 63-88.
  • Taylor, R.B. (2008). Impacts of Specific Incivilities on Responses to Crime and Local Commitment, 1979-1994: [Atlanta, Baltimore, Chicago, Minneapolis-St. Paul, and Seattle].
  • Wheeler, A.P., & Reuter, S. (2020). Redrawing hot spots of crime in Dallas, Texas.
  • Wheeler, A.P. (2019). Allocating police resources while limiting racial inequality. Justice Quarterly, Online First.
  • Wolfgang, M.E., Figlio, R.M., Tracy, P.E., and Singer, S.I. (2006). National Crime Surveys: Index of Crime Severity, 1977.

Nearby Analysis Example (Excel)

The other day on Twitter I made a comment to Joel Caplan about how I would solve analysis with multiple buffers and not counting overlaps. A typical GIS workflow would go:

  • take your points of interest and create buffers
  • join the points to the buffer polygons, and get a count of the crimes of interest

I often do the analysis in different way though – I do a spatial join of the location of interest to the point features, in which you get a field that is the distance to the nearest feature, and then subsequently do analysis on that distance field. In that workflow, it makes it much easier to change the size of the buffer for sensitivity analysis, or conduct analysis on different subsets of data.

To start I am going to be working with a set of robberies in Dallas (from the open data, not quite 16k), and DART stations (n = 74). (DART is the Dallas above ground train.) You can access the Excel file I am doing analysis with here. Using excel as I often suggest it for undergrads/masters for projects who aren’t up to speed with programming – so this is a good illustration of that buffer analysis workflow.

Distance to Nearest

To start, I would typically use a GIS system (or R/python/SQL) to calculate the distance to a nearest object. But I don’t have access to Arc anymore, so I am going to show you a way to do this right in Excel. This only works for projected data (not latitude/longitude), and calculating distance from point-to-point.

So first, to figure out the distance between two points in Euclidean space, we can just use the Pythagorean theorem that you learned in grade school, Distance = sqrt( (x1 - x2)^2 + (y1 - y2)^2 ). Because we are doing this in an Excel spreadsheet and want to find the nearest Dart station to the robbery, we will use a little array formula magic. I named my table of Dart locations Dart, and so the array formula to find the nearest distance in Excel is:

=MIN( SQRT( (B2 - Dart[X])^2  + (C2 - Dart[Y])^2))

When you enter this formula, hit Ctrl + Shift + Enter, else it just returns the distance to the first Dart station. If you did this right, you will see the formula have {} brackets around it in the formula bar.

Distance will be defined in whatever the projected units are in – here they are in feet. But by using MIN with the array, it returns the distance to the nearest station. To get the ID of the associated station, we need to do a similar formula (and this only works with numeric ID fields). You can basically do an array IF formula, and the only station this is true for will be the MAX of that array. (Again hit Ctrl + Shift + Enter when finishing off this cell calculation instead of just Enter.)

=MAX(IF(F2=SQRT((B2 - Dart[X])^2  + (C2 - Dart[Y])^2), Dart[DartID],0))

User beware – this runs super fast on my machine (surprisingly) but it is quite a few computations under the hood. For much larger data again use a GIS/database/Stat program to do these calculations.

Using Pivot Tables to do Buffer Analysis

So now that we have those distance fields, it is easy to do a formula along the lines of you want to count up the robberies within 1000 feet. You can do another IF formula that is something like IF([@Distance] < 1000, 1, 0).

And then go ahead and make a pivot table, and put the DartID as the rows, and the Within distance field you just made as the values (to sum in the pivot table).

Then bam, you have your buffer analysis. Here I sorted the pivot table so you can see the highest crime Dart is 12. (I haven’t looked up which one this is, you can use Excel though to map them out).

So say you wanted to change the buffer size? It is as simple as changing out the 1000 in the prior formula to a different value. One thing I like to do though is to make a lookup table to define different bins. You can see I named that table BuffTable (naming the tables makes it easier to refer to them later in array formulas, also I shifted down the pivot table to not accidently overwrite it later).

And now I use a combination of MATCH to find what row it falls into for this table, and INDEX to return the row label I want. So first I have =MATCH([@Distance],BuffTable[Within Bounds],1). This is very similar to VLOOKUP, and will match to the row that the distance is less than.

This just returns the row number of the match though – I want to pipe in those nicer labels I made. To do that, I nest the match results within index, =INDEX(BuffTable, MATCH([@Distance],BuffTable[Within Bounds],1)+1, 2). And voila, I get my binned data.

Now we can do our pivot table so the columns are the new field we just made (make sure to refresh the pivot table).

And we can do our buffer analysis and varying buffers. Just update the tables to however you want the buffers, hit refresh, and everything will be updated. (I should have done the labels so they are ordered a bit more nicely in the pivot table.)

I like this approach for students, as it is easy to pivot/filter on other characteristics as well. Want to get arrest rates around areas? Want to see changes in crimes nearby different DART stations over time? It is just a few formulas/filters and a pivot table away in this format.

Distance to Nearest Analysis for DART stations

Another analysis I think is useful is to look at the cumulative distance analysis. I got this idea from a paper of Jerry Ratcliffe’s.

So what you can do is to round the distance data, e.g. using a formula like this will round the data to every 300 feet.


And then you can make a pivot table of the rounded counts. Here I also did additional stats to calculate the spatial density of the points, and show the distance decay curve.

Jerry’s paper I linked to looks for change points – I think that idea is somewhat misleading though. It does look like a change point in Jerry’s graphs, but that is a function of the binning I think (see this Xu/Griffiths paper, same method, finer bins, and shows a more smooth decay).

So here I tied the round function to a cell, and again I can just update the value to a different bin size, and everything get auto-updated in the spreadsheet. Here is a bin size of 100 feet, which introduces some volatility in the very nearby locations, but you can see still pretty much follows that smooth distance decay effect.

Actually the Xu/Griffiths paper looks at the street network distance, which I think makes more sense. (And again need a GIS to do that analysis.) The buffer areas can behave funny, and won’t have a direct relationship to the street length exposure, so I think the typical Euclidean analysis can be misleading in some cases. I will need to save that for another blog post though!

The Failed Idea Bin: Temporal Aggregation and the Crime/Stop Relationship

A recent paper by the Hipp/Kim/Wo trio analyzing robbery at very fine temporal scales in NYC reminded me on a failed project I never quite worked out to completion. This project was about temporal aggregation bias. We talk about spatial aggregation bias quite a bit, which I actually don’t think is that big of deal for many projects (for reasons discussed in my dissertation).

I think it is actually a bigger deal though when dealing with temporal relationships, especially when we are considering endogenous relationships between crime and police action in response to crime. This is because they are a countervailing endogenous relationship – most endogenous relationships are positively correlated, but here we think police do more stuff (like arrests and stops) in areas with more crime, and that crime falls in response.

I remember the first time I thought about the topic was when I was working with the now late Dennis Smith and Robert Purtell as a consultant for the SQF litigation in NYC. Jeff Fagan had some models predicting the number of stops in an area, conditional on crime and demographic factors at the quarterly level. Dennis and Bob critiqued this as not being at the right temporal aggregation – police respond to crime patterns much faster than at the quarterly level. So Jeff redid his models at the monthly level and found the exact same thing as he did at the quarterly level. This however just begs the question of whether monthly is the appropriate temporal resolution.

So to try to tackle the problem I took the same approach as I did for my dissertation – I pretend I know what the micro level equation looks like, and then aggregate it up, and see what happens. So I start with two endogenous equations:

crime_t1 = -0.5*(stops_t0) + e_c
stops_t1 =  0.5*(crime_t0) + e_s

And then aggregation is just a sum of the micro level units:

Crime_T = (crime_t1 + crime_t0)
Stops_T = (stops_t1 + stops_t0)

And then what happens when we look at the aggregate relationship?

Crime_T = Beta*(Stops_T)

Intuitively here you may see where this is going. Since crime and stops have the exact same countervailing effects on each other, they cancel out if you aggregate up one step. I however show in the paper if you aggregate up more than two temporal units in this situation the positive effect wins. The reason is that back substitution for prior negative time series relationships oscillates (so a negative covariance at t-1 is a positive covariance at t-2). And in the aggregate the positive swamps the negative relationship. Even estimating Crime_T = Beta*(Stops_T-1) does not solve the problem. These endogenous auto-regressive relationships actually turn into an integrated series quite quickly (a point that cannot be credited to me, Clive Granger did a bunch of related work).

So this presented a few hypotheses. One, since I think short run effects for stops and crime are more realistic (think the crackdown literature), the covariance between them at higher resolutions (say monthly) should be positive. You should only be able to recover the deterrent effect of stops at very short temporal aggregations I think. Also crime and stops should be co-integrated at large temporal aggregations of a month or more.

Real life was not so convenient for me though. Here I have the project data and code saved. I have the rough draft of the theoretical aggregation junk here for those interested. Part of the reason this is in the failed idea bin is that neither of my hypotheses appears to be true with the actual crime and stop data. For the NYC citywide data I broke up stops into radio-runs and not-radio-runs (less discretion for radio runs, but still should have similar deterrent effects), and crimes as Part 1 Violent, Part 1 Non-Violent, and Part 2. More recently I handed it off to Zach Powell, and he ran various vector auto-regression models at the monthly/weekly/daily/hourly levels. IIRC it was pretty weak sauce evidence that stops at the lower temporal aggregations showed greater evidence of reducing crime.

There of course is a lot going on that could explain the results. Others have found deterrent effects using instrumental variable approaches (such as David Greenberg’s work using Arellano-Bond, or Wooditch/Weisburd using Bartik instruments). So maybe my idea that spatial aggregation does not matter is wrong.

Also there is plenty of stuff going on specifically in NYC. We had the dramatic drop in stops due to the same litigation. Further work by MacDonald/Fagan/Geller have shown stops that met a higher reasonable suspicion standard based on the reported data have greater effects than others (essentially using Impact zones as an instrument there).

So it was a question I was never able to figure out – how to correctly identify the right temporal unit to examine crime and deterrence from police action.

Some additional plots to go with Crime Increase Dispersion

So Jerry nerdsniped me again with his Crime Increase Dispersion statistic (Ratcliffe, 2010). Main motivation for this post is that I don’t find that stat very intuitive to be frank. So here are some alternate plots, based on how counts of crime approximately follow a Poisson distribution. These get at the same question though as Jerry’s work, is a crime increase (or decrease) uniform across the city or specific to a few particular sub-areas.

First, in R I am going to simulate some data. This creates a set of data that has a constant increase over 50 areas of 20%, but does the post crime counts as Poisson distributed (so it isn’t always exactly a 20% increase). I then create 3 outliers (two low places and one high place).

#Setting up the simulation
n <- 50
low <- 10
hig <- 400
inc <- 0.2
c1 <- trunc(runif(n,low,hig))
c2 <- rpois(n,(1+inc)*c1)
#Putting in 2 low outliers and 1 high outlier
c2[5] <- c1[5]*0.5
c2[10] <- c1[10]*0.5
c2[40] <- c1[40]*2
#data frame for ggplot
my_dat <- data.frame(pre=c1,post=c2)

The first plot I suggest is a simple scatterplot of the pre-crime counts on the X axis vs the post-crime counts on the Y axis. My make_cont function takes those pre and post crime counts as arguments and creates a set of contour lines to put as a backdrop to the plot. Points within those lines support the hypothesis that the area increased in crime at the same rate as the overall crime increase, taking into account the usual ups and downs you would expect with Poisson data. This is very similar to mine and Jerry’s weighted displacement difference test (Wheeler & Ratcliffe, 2018), and uses a normal based approximation to examine the differences in Poisson data. I default to plus/minus three because crime data tends to be slightly over-dispersed (Wheeler, 2016), so coverage with real data should be alittle better (although here is not necessary).

#Scatterplot of pre vs post with uniform 
#increase contours

make_cont <- function(pre_crime,post_crime,levels=c(-3,0,3),lr=10,hr=max(pre_crime)*1.05,steps=1000){
    #calculating the overall crime increase
    ov_inc <- sum(post_crime)/sum(pre_crime)
    #Making the sequence on the square root scale
    gr <- seq(sqrt(lr),sqrt(hr),length.out=steps)^2
    cont_data <- expand.grid(gr,levels)
    names(cont_data) <- c('x','levels')
    cont_data$inc <- cont_data$x*ov_inc
    cont_data$lines <- cont_data$inc + cont_data$levels*sqrt(cont_data$inc)

contours <- make_cont(c1,c2)

eq_plot <- ggplot() + 
           geom_line(data=contours, color="darkgrey", linetype=2, 
                     aes(x=x,y=lines,group=levels)) +
           geom_point(data=my_dat, shape = 21, colour = "black", fill = "grey", size=2.5, 
                      alpha=0.8, aes(x=pre,y=post)) +
           scale_y_continuous(breaks=seq(0,500,by=100)) +
           coord_fixed() +
           xlab("Pre Crime Counts") + ylab("Post Crime Counts")
           #scale_y_sqrt() + scale_x_sqrt() #not crazy to want square root scale here

#weighted correlation to view the overall change
cov.wt(my_dat[,c('pre','post')], wt = 1/sqrt(my_dat$pre), cor = TRUE)$cor[1,2]

So places that are way outside the norm here should pop out, either for increases or decreases. This will be better than Jerry’s stats for identifying outliers in lower baseline crime places.

I also show how to get an overall index based on a weighted correlation coefficient on the last line (as is can technically return a value within (-1,1), so might square it for a value within (0,1)). But I don’t think the overall metric is very useful – it has no operational utility for a crime department deciding on a strategy. You always need to look at the individual locations, no matter what the overall index metric says. So I think you should just cut out the middle man and go straight to these plots. I’ve had functionally similar discussions with folks about Martin Andresen’s S index metric (Wheeler, Steenbeek, Andresen, 2018), just make your graphs and maps!

An additional plot that basically takes the above scatterplot and turns it on its side is a Poisson version of a Bland-Altman plot. Traditionally this plot is the differences of two measures on the Y axis, and the average of the two measures on the X axis. Here to make the measures have the same variance, I divide the post-pre crime count differences by sqrt(post+pre). This is then like a Poison Z-score, taking into account the null of an equal increase (or decrease) in crime stats among all of the sub-areas. (Here you might also use the Poisson e-test to calculate p-values of the differences, but the normal based approximation works really well for say crime counts of 5+.)

#A take on the Bland-Altman plot for Poisson data

ov_total <- sum(my_dat$post)/sum(my_dat$pre)
my_dat$dif <- (my_dat$post - ov_total*my_dat$pre)/sqrt(my_dat$post + my_dat$pre)
my_dat$ave <- (my_dat$post + my_dat$pre)/2

ba_plot <- ggplot(data=my_dat, aes(x=ave, y=dif)) + 
           geom_point(shape = 21, colour = "black", fill = "grey", size=2.5, alpha=0.8) +
           scale_y_continuous(breaks=seq(-8,6,by=2)) +
           xlab("Average Crime") + ylab("Z-score (Equal Increase)")


#false discovery rate correction
my_dat$p_val <- pnorm(-abs(my_dat$dif))*2 #two-tailed p-value
my_dat$p_adj <- p.adjust(my_dat$p_val,method="BY") #BY correction since can be correlated
my_dat <- my_dat[order(my_dat$p_adj),]
my_dat #picks out the 3 cases I adjusted

So again places with large changes that do not follow the overall trend will pop out here, both for small and large crime count places. I also show here how to do a false-discovery rate correction (same as in Wheeler, Steenbeek, & Andresen, 2018) if you want to actually flag specific locations for further investigation. And if you run this code you will see it picks out my three outliers in the simulation, and all other adjusted p-values are 1.

One thing to note about these tests are they are conditional on the observed overall citywide crime increase. If it does happen that only one area increased by alot, it may make more sense to set these hypothesis tests to a null of equal over time. If you see that one area is way above the line and a ton are below the line, this would indicate that scenario. To set the null to no change in these graphs, for the first one just pass in the same pre estimates for both the pre and post arguments in the make_cont function. For the second graph, change ov_total <- 1 would do it.


  • Ratcliffe, J. H. (2010). The spatial dependency of crime increase dispersion. Security Journal, 23(1), 18-36.
  • Wheeler, A. P. (2016). Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice. International Journal of Police Science & Management, 18(3), 159-172.
  • Wheeler, A. P., & Ratcliffe, J. H. (2018). A simple weighted displacement difference test to evaluate place based crime interventions. Crime Science, 7(1), 11.
  • Wheeler, A. P., Steenbeek, W., & Andresen, M. A. (2018). Testing for similarity in area‐based spatial patterns: Alternative methods to Andresen’s spatial point pattern test. Transactions in GIS, 22(3), 760-774.

Setting the threshold for bail decisions

I am at it again on the bail reform stuff. So we have critics of algorithms in-place of bail say that these bail reforms are letting too many people out (see Chicago and NYC). We also have folks on the other side saying such systems are punishing too many people (the Philly piece is more about probation, but the critique applies the same to pre-trial algorithms). So how can risk assessment algorithms both be too harsh and too lenient at the same time? The answer is that how harsh or lenient is not intrinsic to the algorithm itself, people can choose the threshold to be either.

At its most basic, risk assessment algorithms provide an estimate of future risk over some specific time horizon. For example, after scored for a pre-trial risk assessment, the algorithm may say an individual has a 30% probability of committing a crime within the next 3 months. Is 30% low risk, and so they should be released? Or is 30% high risk? The majority of folks do not go on to commit a new offense awaiting for trial, so 30% overall may be really high actually relative to most folks being assessed. So how exactly do you decide the threshold – at what % is it too high and they should be detained pre-trial?

For an individual cost-benefit analysis, you consider the costs of pre-trial detainment (the physical costs to detain a person for a specific set of time, as well as the negative externalities of detention for individuals) vs the cost to society of whether a future crime occurs. So for the 30% mark, say that average crime cost to society is $10,000 (this is ignoring that violent crimes cost more than property crime, in practice you can get individual probability estimates for each and combine them together). So the expected risk if we let this person go would then be $10,000*0.3 = $3,000. Whether the 3k risk is worth the pre-trial detention depends on how exactly we valuate detention. Say we have on average 90 days pre-trial detention, and the cost is something like $200 up front fixed costs plus $50 per day. We would then have a cost to detain this person as $200 + $50*90 = $4,700. From that individual perspective, it is not worth to detain that person pre-trial. This is a simplified example, e.g. this ignores any negative externality cost of detaining (e.g. lost wages for individuals detained), but setting the threshold risk for detaining or releasing on recognizance (ROR) should look something like this.

One of the things I like about this metric is that it places at the forefront how another aspect of bail reform – the time to trial – impacts costs. So if you reduce the time to trial, for those ROR’d you reduce the time horizon of risk. (Those only out 1 month have less time to recidivate than those that have to wait 6 months.) It also reduces the cost of detaining individuals pre-trial as well (costs less for both the state and the individual to only be in jail 1 month vs 6 months). It may be to make everyone happy, we should reduce the queue so we can get people to trial faster. (But in the short term risk assessment tools I think have a great upside to both increase public safety as well as reduce imprisonment costs.)

Evaluating Overall Crime Effects

Part of the reason folks are not happy with the current bail reforms is that they think it is increasing overall crime (see this example in Dallas). Here is an example graph though that folks doing these assessments should be providing, both in terms of up-front establishing the threshold for risk, as well as evaluating the efficacy of the risk assessment tool in practice.

I will use the data from my prior posts on false positives to illustrate. For the graph, you place on the X axis the cumulative number of people that are detained, and the Y axis you place the expected number of crimes that your model thinks will be committed by those folks who are ROR’d. So a simplified table may be

Person %crime
   A     0.5
   B     0.4
   C     0.3
   D     0.2
   E     0.1

If we let all of these folks go, we would expect they commit a total of 1.5 crimes (the sum of the percent predicted crime column) forecasted per our risk assessment algorithm. If we detained just person A, we have 1 in the detain column, and then a cumulative risk for the remaining folks of 1 (the sum of the predicted crime column for those that are remaining and are not detained). So then we go from the above table to this table by increasing the number of folks detained one-by-one.

Detained ExpectedCrimes
   0          1.5
   1          1.0
   2          0.6
   3          0.3
   4          0.1
   5          0

Here is what that graph looks like using the ProPublica data, so if we apply this strategy to the just under 3,000 cases (in my test set from the prior blog post). So you can see that if we decided to detain no-one, we would expect a total of 1,200 extra crimes. And this curve decreases over detaining everyone. So you may say I don’t want more than 200 crimes, which you would need to have detained 1,500 people in the prior example (and happens to result in a risk threshold of 36% in this sample).

Using historical data, this is good to establish a overall amount of crime risk you expect to occur from a particular set of bail reform decisions. To apply it to the future threshold decision making, you need to assume the past in terms of the total number of people arrested as well as the risk distribution stays the same (the latter I don’t think is a big issue, and the former you should be able to make reasonable projections if it is not constant). But this sets up the hypothetical, OK if we release this many more people ROR, we expect this many more crimes to occur as an upfront expectation of bail reform. It may be even if the individual cost-benefit calculation above says release, this would result in a total number of extra crimes folks deem unacceptable when applying that decision to everyone. So we can set the threshold to say we only want 10 extra crimes to happen because of bail reform, or 50 extra, or 100 extra, etc. This example again just aggregates all crime together, but you can do the same thing for every individual crime outcome you are interested in.

After the assessment is in place, this is actually necessary monitoring folks should do be doing anyway to ensure the model is working as expected. That is, you get an estimate of the number of crimes folks who are released you think would commit per your risk assessment model. If you predict more/less than what you see amongst those released, your model is not well calibrated and needs to be updated. In practice you can’t just estimate a predictive model once and then use it forever, you need to constantly monitor whether it is still working well in real life. (Actually I showed in my prior blog post that the model was not very good, so this is a pretty big over-estimate of the number of crimes in this sample.)

This should simultaneously quell complaints about bail reform is causing too many crimes. The lack of this information is causing folks to backlash against these predictive algorithms (although I suspect they do better than human judges, so I suspect they can reduce crime overall if used wisely). Offhand the recent crime increases in Philly, NYC, and Dallas I’m skeptical are tied to these bail reform efforts (they seem too big of increases or too noisy up/downs to reliably pin to just this), but maybe I am underestimating how many people they are letting out and the cumulative overall risk expected from the current models in those places. On the flip-side folks are right to question those Chicago stats, I suspect the risk algorithm should be saying that more crimes are occurring then they observed (ignoring what they should or should not be counting as recidivated).

I’d note these metrics I am suggesting here should be pretty banal to produce in practice. It is administrative data already collected and should be available in various electronic databases. So in practice I don’t know why this data is not readily available in various jurisdictions.

What about False Positives?

One thing you may notice is that in my prior cost-benefit analysis I didn’t take into consideration false positives. Although my prior post details how you would set this, there is a fundamental problem with monitoring false positives (those detained but who would not go on to recidivate) in practice. In practice, you can’t observe this value (you can only estimate it from historical data). Once you detain an individual, by construction they aren’t given the chance to recidivate. So you don’t get any on-policy feedback about false-positives, only false-negatives (folks who were released and went on to commit a crime pre-trial).

This I think puts a pretty big nail in the coffin of using false positive rates as a policy goal for bail reform in practice. Like I said earlier, you can’t just set a model once and expect it to work forever in the future. But, I actually don’t think that should be considered in the cost-benefit calculus anyway. So traditionally people tend to think of setting the threshold for predictive models like this confusion table, where different outcomes in the table have different costs to individuals and to society:

In this table those on the bottom row are those detained pre-trial. So in the hypothetical, you may say if we could someone know the false positives, we should calculate extra harm that pre-trial detainment causes to those individuals (lost wages, losing job, health harms, etc.). But for the folks who would have gone on and recidivated, we should just calculate the bare bones cost of detainment.

I think this is the wrong way to think about it though. Those harms are basically across the board for everyone – even if the person was likely to recidivate they still bear those subsequent harms of being incarcerated. Whether you think people deserve the harm does not make it go away.

The main reason I am harping on bail reform so much (folks who know my work will realize it is outside my specific research area) is that the current bail system is grossly inefficient and unequitable. These are folks that piling on monetary bail costs are the exact wrong way to ensure safety and to promote better outcomes for these folks.

It is a hard decision to make on who to detain vs who to let go. But pretending the current state of judges making these decisions on whatever personal whims they have and thinking we are better off than using a cost-benefit approach and algorithmic assessments is just sticking your head in the sand.

Balancing False Positives

One area of prediction in criminal justice I think has alot of promise is using predictive algorithms in place of bail decisions. So using a predictive instrument to determine whether someone is detained pre-trial based on risk, or released on recognizance if you are low risk. Risk can be either defined as based on future dangerousness or flight risk. This cuts out the middle man of bail, which doesn’t have much evidence of effectiveness, and has negative externalities of placing economic burdens on folks we really don’t want to pile that onto. It is also the case algorithms can likely do quite a bit better than judges in figuring out future risk. So an area I think they can really do good compared to current status quo in the CJ system.

A reasonable critique of such systems though is they can have disparate racial impact. For example, ProPublica had an article on how the Compas risk assessment instrument resulted in more false positives for black than white individuals. Chris Stucchio has a nice breakdown for why this occurs, which is not due to the Compas being intrinsically racist algorithm, but due to the nature of the baseline risks for the two groups.

Consider a very simple example to illustrate. Imagine based on our cost-benefit analysis, we determine the probability threshold to flag a individual as high risk is 60%. Now say our once we apply our predictions, for those above the threshold, whites are all predicted to be 90%, and blacks are all 70%. If our model is well calibrated (which is typically the case), the false positive rate for whites will be 10%, and will be 30% for blacks.

It is actually a pretty trivial problem though to balance false positive rates between different groups, if that is what you want to do. So I figured I would illustrate here using the same ProPublica data. There are trade-offs though with this, balancing false positives means you lose out on other metrics of fairness. In particular, it means you don’t have equality of treatment – different racial groups will have different thresholds. The full data and code I use to illustrate this can be downloaded here.

An Example in Python

To illustrate how we would balance the false positive rates between groups, I use the same ProPublica risk assessment data. So this isn’t per se for bail decisions, but works fine as an illustration. First in python I load my libraries, and then read in the data – it is a few over 11,000 cases.

import pandas as pd
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

my_dir = r'C:\Users\andre\Dropbox\Documents\BLOG\BalanceFalsePos'

#For notes on data source, check out 
recid = pd.read_csv('PreppedCompas.csv')
print( recid.head() )

Next I prepare the dataset for modelling. I am not using all of the variables in the dataset. What I predict here is recidivism post 30 days (there are a bunch of recidivism right away in the dataset, so I am not 100% sure those are prior to screening). I use the three different aggregate compas scores, juvenile felony count, whether they were male, how old they were, and whether the current charge to precipitate screening is a felony or misdemeanor. I include the race variable in the dataset, but I won’t be using it in the predictive model. (That point deserves another blog post, contra to what you might expect, leaving race flags in will often result in better outcomes for that protected class.)

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
recid_prep['race'] = recid['race']
print( recid['race'].value_counts() ) #pretty good sample size for both whites/blacks

Next I make my testing and training sets of data. In practice I can perfectly balance false positives retrospectively. But having a test set is a better representation of reality, where you need to make some decisions on the historical data and apply it forward.

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1]
recid_test = recid_prep[recid_prep['Train'] == 0]

Now the procedure I suggest to balance false-positives doesn’t matter how you generate the predictions, just that we need a predicted probability. Here I use random forests, but you could use whatever machine learning or logistic regression model you want. Second part just generates the predicted probabilities for the training dataset.

#Now estimating the model
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis'] #no race in model
dep_var = 'Recid30'
rf_mod = RandomForestClassifier(n_estimators=500, random_state=10) = recid_train[ind_vars], y = recid_train[dep_var])

#Now getting the predicted probabilities in the training set
pred_prob = rf_mod.predict_proba(recid_train[ind_vars] )
recid_train['prob'] = pred_prob[:,1]
recid_train['prob_min'] = pred_prob[:,0]

Now to balance false positives, I will show a graph. Basically this just sorts the predicted probabilities in descending order for each racial group. Then you can calculate a cumulate false positive rate for different thresholds for each group.

#Making a cusum plot within each racial group for the false positives
recid_train.sort_values(by=['race','prob'], ascending=False, inplace=True)
recid_train['const'] = 1
recid_train['cum_fp'] = recid_train.groupby(['race'])['prob_min'].cumsum()
recid_train['cum_n'] = recid_train.groupby(['race'])['const'].cumsum()
recid_train['cum_fpm'] = recid_train['cum_fp'] / recid_train['cum_n']
white_rt = recid_train[recid_train['race'] == 'Caucasian']
black_rt = recid_train[recid_train['race'] == 'African-American' ] 

And now the fun part (and least in output, not really in writing matplotlib code).

#now make the chart for white and black
fig, ax = plt.subplots()
ax.plot(black_rt['prob'], black_rt['cum_fpm'], drawstyle='steps', color='b', label='Black')
ax.plot(white_rt['prob'], white_rt['cum_fpm'], drawstyle='steps', color='r', label='White')
ax.set_xlim(1, 0)  # decreasing probs
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Mean False Positive Rate')
ax.legend(facecolor='white', framealpha=1)
plt.savefig('FP_Rate.png', dpi=2000, bbox_inches='tight')

So what this chart shows is that if we set our threshold to a particular predicted probability (X axis), based on the data we would expect a false positive rate (Y axis). Hence if we want to balance false positives, we just figure out the race specific thresholds for each group at a particular Y axis value. Here we can see the white line is actually higher than the black line, so this is reverse ProPublica findings, we would expect whites to have a higher false positive rate than blacks given a consistent predicted probability of high risk threshold. So say we set the threshold at 10% to flag as high risk, we would guess the false positive rate among blacks in this sample should be around 40%, but will be closer to 45% in the white sample.

Technically the lines can cross at one or multiple places, and those are places where you get equality of treatment and equality of outcome. It doesn’t make sense to use that though from a safety standpoint – those crossings can happen at a predicted probability of 99% (so too many false negatives) or 0.1% (too many false positives). So say we wanted to equalize false positive rates at 30% for each group. Here this results in a threshold for whites as high risk of 0.256, and for blacks a threshold of 0.22.

#Figuring out where the threshold is to limit the mean FP rate to 0.3
#For each racial group
white_thresh = white_rt[white_rt['cum_fpm'] > 0.3]['prob'].max()
black_thresh = black_rt[black_rt['cum_fpm'] > 0.3]['prob'].max()
print( white_thresh, black_thresh )

Now for the real test, lets see if my advice actually worked in a new sample of data to balance the false positive rate.

#Now applying out of sample, lets see if this works
pred_prob = rf_mod.predict_proba(recid_test[ind_vars] )
recid_test['prob'] = pred_prob[:,1]
recid_test['prob_min'] = pred_prob[:,0]

white_test = recid_test[recid_test['race'] == 'Caucasian']
black_test = recid_test[recid_test['race'] == 'African-American' ]

white_test['Flag'] = 1*(white_test['prob'] > white_thresh)
black_test['Flag'] = 1*(black_test['prob'] > black_thresh)

white_fp= 1 - white_test[white_test['Flag'] == 1][dep_var].mean()
black_fp = 1 - black_test[black_test['Flag'] == 1][dep_var].mean()
print( white_fp, black_fp )

And we get a false positive rate of 54% for whites (294/547 false positives), and 42% for blacks (411/986) – yikes (since I wanted a 30% FPR). As typical, when applying your model to out of sample data, your predictions are too optimistic. I need to do some more investigation, but I think a better way to get error bars on such thresholds is to do some k-fold metrics and take the worst case scenario, but I need to investigate that some more. The sample sizes here are decent, but there will ultimately be some noise when deploying this in practice. So basically if you see in practice the false positive rates are within a few percentage points that is about as good as you can get in practice I imagine. (And for smaller sample sizes will be more volatile.)

Knowing when to fold them: A quantitative approach to ending investigations

The recent work on investigations in the criminal justice field has my head turning about potential quantitative applications in this area (check out the John Eck & Kim Rossmo podcasts on Jerry’s site first, then check out the recent papers in Criminology and Public Policy on the topic for a start). One particular problem that was presented to me was detective case loads — detectives are humans, so can only handle so many cases at once. Triage is typically taken at the initial crime reporting stage, with inputs such as seriousness of the offense, the overall probability of the case being solved, and future dangerousness of folks involved being examples of what goes into that calculus to assign a case.

Here I wanted to focus on a different problem though — how long to keep cases open? There are diminishing returns to keeping cases open indefinitely, and so PDs should be able to right size the backend of detective open cases as well as the front end triaging. Here my suggested solution is to estimate a survival model of the probability of a case being solved, and then you can estimate an expected return on investment given the time you put in.

Here is a simplified example. Say the table below shows the (instantaneous) probability of a case being solved per weeks put into the investigation.

Week 1  20%
Week 2  10%
Week 3   5%
Week 4   3%
Week 5   1%

In survival model parlance, this would be the hazard function in discrete time increments. And then we have diminishing probabilities over time, which should also be true (e.g. a higher probability of being solved right away, and gets lower over time). The expected return of investigating this crime at time t is the cumulative probability of the crime being solved at time t, multiplied by whatever value you assign to the case being solved. The costs of investigating will be fixed (based on the detective salary), so is just a multiple of t*invest_costs.

So just to fill in some numbers, lets say that it costs the police department $1,000 a week to keep an investigation going. Also say a crime has a return of $10,000 if it is solved (the latter number will be harder to figure out in practice, as cost of crime estimates are not a perfect fit). So filling in our table, we have below our detective return on investment estimates (note that the cumulative probability of being solved is not simply the sum of the instantaneous probabilities, else it would eventually go over 100%). So return on investment (ROI), at week 1 is 10,000*0.2 = 2,000, at week 2 is 10,000*0.28 = 2,800, etc.

        h(t) solved%  cum-costs   ROI   
Week 1  20%    20%     1,000     2,000
Week 2  10%    28%     2,000     2,800
Week 3   5%    32%     3,000     3,200
Week 4   3%    33%     4,000     3,300
Week 5   1%    34%     5,000     3,400

So the cumulative costs outweigh the total detective resources devoted to the crime by Week 4 here. So in practice (in this hypothetical example) you may say to a detective you get 4 weeks to figure it out, if not solved by then it should be closed (but not cleared), and you should move onto other things. In the long run (I think) this strategy will make sure detective resources are balanced against actual cases solved.

This right sizes investigation lengths from a global perspective, but you also might consider whether to close a case on an individual case-by-case basis. In that case you wouldn’t calculate the sunk cost of the investigation so far, it is just the probability of the case being solved going forward relative to future necessary resources. (You do the same table, just start the cum-costs and solved percent columns from scratch whenever you are making that decision.)

In an actual applied setting, you can estimate the survival function however you want (e.g. you may want a cure mixture-model, so not all cases will result in 100% being solved given infinite time). It is also the case that different crimes will not only have different survival curves, but also will have different costs of crime (e.g. a murder has a greater cost to society than a theft) and probably different investigative resources needed (detective costs may also get lower over time, so are not constant). You can bake that all right into this estimate. So you may say the cost of a murder is infinite, and you should forever keep that case open investigating it. A burglary though may be a very short time interval before it should be dropped (but still have some initial investment).

Another neat application of this is that if you can generate reasonable returns to solving crimes, you can right size your overall detective bureau. That is you can make a quantitative argument I need X more detectives, and they will help solve Y more crimes resulting in Z return on investment. It may be we should greatly expand detective bureaus, but have them only keep many cases open a short time period. I’m thinking of the recent officer shortages in Dallas, where very few cases are assigned at all. (Some PDs have patrol officers take initial detective duties on the crime scene as well.)

There are definitely difficulties with applying this approach. One is that getting the cost of solving a crime estimate is going to be tough, and bridges both quantitative cost of crime estimates (although many of them are sunk costs after the crime has been perpetrated, arresting someone does not undo the bullet wound), likelihood of future reoffending, and ethical boundaries as well. If we are thinking about a detective bureau that is over-booked to begin with, we aren’t deciding on assigning individual cases at that point, but will need to consider pre-empting current investigations for new ones (e.g. if you drop case A and pick up case B, we have a better ROI). And that is ignoring the estimating survival part of different cases, which is tricky using observational data as well (selection biases in what cases are currently assigned could certainly make our survival curve estimates too low or too high).

This problem has to have been tackled in different contexts before (either by actuaries or in other business/medical contexts). I don’t know the best terms to google though to figure it out — so let me know in the comments if there is related work I should look into on solving this problem.