Using Association Rules to Conduct Conjunctive Analysis

I’ve suggested to folks a few times in the past that a popular analysis in CJ, called conjunctive analysis (Drawve et al., 2019; Miethe et al., 2008; Hart & Miethe, 2015), could be automated in a fashion using a popular machine learning technique called association rules. So I figured a blog post illustrating it would be good.

I was motivated by some recent work by Nix et al. (2019) examining officer involved injuries in NIBRS data. So I will be doing a relevant analysis (although not as detailed as Justin’s) to illustrate the technique.

This ended up being quite a bit of work. NIBRS is complicated, and I had to do some rewrites of finding frequent itemsets to not run out of memory. I’ve posted the python code on GitHub here. So this blog post will be just a bit of a nicer walkthrough. I also have a book chapter illustrating geospatial association rules in SPSS (Wheeler, 2017).

A Brief Description of Conjunctive Analysis

Conjunctive analysis is more of an exploratory technique examining high cardinality categorical sets. Or in other words, you search though a database of cases that have many categories to find “interesting” patterns. It is probably easier to see an example than for me to describe it. Here is an example from Miethe et al. (2008):

You can see that here they are looking at characteristics of drug offenders, and then trying to identify particular sets of characteristics that influence the probability of a prison sentence. So this is easy to do in one dimension, it gets very difficult in multiple dimensions though.

Association rules were created for a very different type of problem – identifying common sets of items that shoppers buy together at the same time. But you can borrow that work to aid in conducting conjunctive analysis.

Data Prep for NIBRS

So here I am using 2012 NIBRS data to conduct analysis. Like I mentioned, I was motivated by the Nix and company paper examining officer injuries. They were interested in specifically examining officer involved injuries, and whether the perception that domestic violence cases were more dangerous for officers was justified.

For brevity I only ended up examining five different variable sets in NIBRS (Justin has quite a few more in his paper):

  • assault (or injury) type V4023
  • victim/off relationship V4032
  • ucr type V2006
  • drug use V2009 (also includes computer use!)
  • weapon V2017

All of these variables have three different item sets in the NIBRS codes, and many categories. You will have to dig into the python code, in the GitHub page to see how I recoded these variables.

Also maybe of interest I have some functions to do one-hot encoding of wide data. So a benefit of NIBRS is that you can have multiple crimes in one incident. So e.g. you can have one incident in which an assault and a burglary occurs. I do the analysis in a way that if you have common co-crimes they would pop out.

Don’t take this as very formal though. Justin’s paper which used 2016 NIBRS data only had 1 million observations, whereas here I have over 5 million (so somewhere along the way me and Justin are using different units of analysis). Also Justin’s incorporates dozens of other different variables into the analysis I don’t here.

It ends up being that with just these four variables (and the reduced sets of codes I created), there still end up being 34 different categories in the data.

Analysis of Frequent Item Sets

The first part of conjunctive analysis (or association rules) is to identify common item sets. So the work of Hart/Miethe is always pretty vague about how you do this. Association rules has the simple approach that you find any item sets, categories in which a particular itemset meets an arbitrary threshold.

So the way you represent the data is exactly how the prior Miethe et al. (2008) data showed, you create a series of dummy 0/1 variables. Then in association rules you look for sets in which for different cases all of the dummy variables take the value of 1.

The code on GitHub shows this going from the already created dummy variable data. I ended up writing my own function to do this, as I kept getting out of memory errors using the mlextend library. (I don’t know if this is due to my data is large N but smaller number of columns.) You can see my freq_sets function to do this.

Typically in association rules you identify item sets that meet a particular support threshold. Support here just means the proportion of cases that those items co-occur. E.g. if 20% of cases of assault also have a weapon of fists listed. Instead though I wrote the code to have a minimum N, which I choose here to be 1000 cases. (So out of 5 million cases, this is a support of 1/5000.)

I end up finding a total of 411 frequent item sets in the data that have at least 1000 cases (out of the over 5 million). Here are a few examples, with the frequencies to the left. So there are over 2000 cases in the 2012 NIBRS data that had a known relationship between victim/offender, resulted in assault, the weapon used was fists (or kicking), and involved computer use in some way. I only end up finding two itemsets that have 5 categories and that is it, there are no higher sets of categories that have at least 1000 cases in this dataset.

3509    {'rel_Known', 'ucr_Assault', 'weap_Fists', 'ucr_Drug'}
2660    {'rel_Known', 'ucr_Assault', 'weap_Firearm', 'ucr_WeaponViol'}
2321    {'rel_Known', 'ucr_Assault', 'weap_Fists', 'drug_ComputerUse'}
1132    {'rel_Known', 'ucr_Assault', 'weap_Fists', 'weap_Knife'}
1127    {'ucr_Assault', 'weap_Firearm', 'weap_Fists', 'ucr_WeaponViol'}
1332    {'rel_Known', 'ass_Argument', 'rel_Family', 'ucr_Assault', 'weap_Fists'}
1416    {'rel_Known', 'rel_Family', 'ucr_Assault', 'weap_Fists', 'ucr_Vandalism'}

Like I said I was interested in using NIBRS because of the Nix example. One way we can then examine what variables are potentially related to officer involved injuries during a commission of a crime would be to just pull out any itemsets which include the variable of interest, here ass_LEO_Assault.

4039    {'ass_LEO_Assault'}
1232    {'rel_Known', 'ass_LEO_Assault'}
4029    {'ucr_Assault', 'ass_LEO_Assault'}
1856    {'ass_LEO_Assault', 'weap_Fists'}
1231    {'rel_Known', 'ucr_Assault', 'ass_LEO_Assault'}
1856    {'ucr_Assault', 'ass_LEO_Assault', 'weap_Fists'}

So we see there are a total of just over 4000 officer assaults in the dataset. Unsurprisingly almost all of these also had an UCR offense of assault listed (4029 out of 4039).

Analysis of Association Rules

Sometimes just identifying the common item sets is what is of main interest in conjunctive analysis (see Hart & Miethe, 2015 for an example of examining the geographic characteristics of crime events).

But the apriori algorithm is one way to find particular rules that are of the form if A occurs then B occurs quite often, but swap out more complicated itemsets in the antecedent (A) and consequent (B) in the prior statement, and different ways of quantifying ‘quite often’.

I prefer conditional probability notation to the more typical association rule one, but for typical rules we have (here I use A for antecedent and B for consequent):

  • confidence: P(A & B) / P(B). So if the itemset of just B occurs 20% of the time, and the itemset of A and B together occurs 10% of the time, the confidence would be 50%. (Or more simply the probability of B conditional on A, P(B | A)).
  • lift: confidence(A,B) / P(B). This is a ratio of the baseline a category occurs for the denominator, and the numerator is the prior confidence category. So if you have a baseline B occurring 25% of the time, and the confidence of A & B is 50%, you would then have a lift of 2.

There are other rules as well that folks use, but those are the most common two I am interested in.

So for example in this data if I draw out rules that have a lift of over 2, I find rules like {'ucr_Vandalism', 'rel_Family'} -> {'ass_Argument'} produces a lift of over 6. (I can use the mlextend implementation here in this code, it was only the frequent itemsets code that was giving me problems.) So it ends up being arguments are listed in the injury codes around 1.6% of the time, but when you have a ucr crime of vandalism, and the relationship between victim/offender are family members, injury type of argument happens around 10.5% of the time (so 10.5/1.6 ~= 6).

The original use case for this is recommender systems/market analysis (so say if you see someone buy A, give them a coupon for B). So this ends up being not so interesting in this NIBRS example when you have you have more clear cause-effect type relationships criminologists would be interested in. But I describe in the next section some further potential machine learning models that may be more relevant, or how I might in the future amend the apriori algorithm for examining specific outcomes.

Further Notes

If you have a particular outcome you are interested in a specific outcome from the get go (so not so much totally exploratory analysis as here), there are a few different options that may make more sense than association rules.

One is the RuleFit algorithm, which basically just uses a regularized regression to find simple models and low order interactions. An example of this idea using police stop data is in Goel et al. (2016). These are very similar in the end to simple decision trees, you can also have continuous covariates in the analysis and it splits them into binary above/below rules. So you could say do RTM distance analysis, and still have it output a rule if < 1000 ft predict high risk. But they are fit in a way that tend to behave better out of sample than doing simple decision trees.

Another is fitting a more complicated model, say random forests, and then having reduced form summaries to describe those models. I have some examples of using shapely values for spatial crime prediction in Wheeler & Steenbeek (2020), but for a more if-then type sets of rules you could look at Scoped Rules.

I may need to dig into the association rules code some more though, and try to update the code to take the sample sizes and statistical significance into account for a particular outcome variable. So if you find higher lift in a four set predicting a particular outcome, you search the tree for more sets with a smaller support in the distribution. (I should probably also work on some cool network viz. to look at all the different rules.)



300 blog posts and public good criminology

This isn’t technically my 300th blog post, but the 300th page I’ve constructed on my blog (so e.g. it includes when I’ve made a page for a class). I’ve posted a spreadsheet of the titles and dates of the posts over time (and updating it I noticed I was at 300).

I typically get around 200~300 views per day. Most of these are probably bots, but unless say over 90% are bots this website gets way more views than the cumulative views of all my academic papers combined. Here is a screen shot of the stats wordpress gives to me. My downtick in 2019 I thought was going to spiral into very few views, but it is still holding on.

I kind of have three different types of blog posts. One are example code snippets/data analysis. Often these are things I have done multiple times, so I want to create a record for me to more easily search up later. For example making a hexbin map in ggplot, or a margins plot in Stata. I wrote a recent post because I was talking with a friend about crime weights, and I wanted an example of using regression in python and an error bar plot for my library. (Quite a few birds with that stone.)

Two are questions I repeatedly encounter by students. For example, I made a list of demographic variables I use in the census, and where to find or scrape crime generator variables. Consistently my most popular post is testing the equality of two regression coefficients.

The third are just more generic opinion pieces. For example my notes on (the now late) David Bayley’s writing on the police potential to reduce crime, or Jane Jacob’s take on neighborhoods, or that I don’t think latent trajectories are real things.

Some are multiple of these categories put together, particularly opinion pieces with example code snippets to illustrate the points I am making. Like a simulation of why I like to model individual delinquency items, or how to balance false positives in bail decisions.

On Public Good Criminology

None of these per se fit in the example framework of typical peer review output. So despite no peer review, I think things like deriving optimal treatment allocation with network spillovers, or that conformal predictions intervals for synthetic control estimates are much smaller than permutation tests are a substantive contribution to share!

So that brings me to the public good point. Most criminologists have a default of only valuing a closed peer review system. Despite my blog posts not being peer reviewed (ditto for the pre-prints I post at first), I hope folks can take the time to judge for themselves whether they are valuable or not. We would be much better off as a group if we did things like share code, share class preps, or failed projects by default.

Some of these posts I might write up if we had a short journal for our field akin to Economics Letters, but even that is a lot of work for very little value added to be frank. (If I had infinite time I also might turn my notes on Poisson/Negative Binomial regression into a little Sage green book.) Being a private sector data scientist now without the tenure boot on my neck, I don’t really have any need or desire to go through that process.

If all you value are getting the opinions of a handful of other academics than by all means keep your work close to the chest and only publish in peer reviewed journals. If you want to provide a public good though, your work actually needs to be public.

Conjoint Analysis of Crime Rankings

So part of my recent research mapping crime harm spots uses cost of crime estimates relevant to police departments (Wheeler & Reuter, 2020). But a limitation of this is that cost of crime estimates are always somewhat arbitrary.

For a simple example, those cost estimates are based mostly on people time by the PD to respond to crimes and devote investigative resources. Many big city PDs entirely triage crimes like breaking into vehicles though. So based on PD response the cost of those crimes are basically $0 (especially if PDs have an online reporting system).

But I don’t think the public would agree with that sentiment! So in an act of cognitive dissonance with my prior post, I think asking the public is likely necessary for police to be able to ultimately serve the publics interest when doing valuations. For some ethical trade-offs (like targeting hot spots vs increasing disproportionate minority contact, Wheeler, 2019) I am not sure there is any other reasonable approach than simply getting a bunch of peoples opinions.

But that being said, I suspected that these different metrics would provide pretty similar rankings for crime severity overall. So while it is criminology 101 that official crime and normative perceptions of deviance are not a perfect 1 to 1 mapping, most folks (across time and space) have largely similar agreement on the severity of different crimes, e.g. that assault is worse than theft.

So what I did was grab some survey ranking of crime data from the original source of crime ranking that I know of, Marvin Wolfgang’s supplement to the national crime victimization survey (Wolfgang et al., 2006). I have placed all the code in this github folder to replicate. And in particular check out this Jupyter notebook with the main analysis.

Conjoint Analysis of Crime Ranks

This analysis is often referred to as conjoint analysis. There are a bunch of different ways to conduct conjoint analysis – some ask folks to create a ranked list of items, others ask folks to choose between a list of a few items, and others ask folks to rank problems on a Likert item 1-5 scale. I would maybe guess Likert items are the most common in our field, see for example Spelman (2004) using surveys of asking people about disorder problems (and that data is available to, Taylor, 2008).

The Wolfgang survey I use here is crazy complicated, see the codebook, but in a nutshell they had an anchoring question where they assigned stealing a bike to a value of 10, and then asked folks to give a numeric score relative to that theft for a series of 24 other crime questions. Here I only analyze one version of the questionnaire, and after eliminating missing data there are still over 4,000 responses (in 1977!).

So you could do analyze those metric scores directly, but I am doing the lazy route and just doing a rank ordering (where ties are the average rank) within person. Then conjoint analysis is simply a regression predicting the rank. See the notebook for a more detailed walkthrough, so this just produces the same analysis as looking at the means of the ranks.

About the only thing I do different here than typical conjoint analysis is that I rescale the frequency weights (just changes the degrees of freedom for standard error estimates) to account for the repeated nature of the observations (e.g. I treat it like a sample of 4000 some observations, not 4000*25 observations). (I don’t worry about the survey weights here.)

To test my assertion of whether these different ranking systems will be largely in agreement, I take Jerry’s crime harm paper (Ratcliffe, 2015), which is based on sentencing guidelines, and map them as best I could to the Wolfgang questions (you could argue with me some though on those assements – and some questions don’t have any analog, like a company dumping waste). I rescaled the Wolfgang rankings to be in a range of 1-14, same as Jerry’s, instead of 1-25.

Doing a more deep dive into the Wolfgang questions, there are definately different levels in the nature of the questions you can tease out. Folks clearly take into account both harm to the victim and total damages/theft amounts. But overall the two systems are fairly correlated. So if an analyst wants to make crime harm spots now, I think it is reasonable to use one of these ranking systems, and then worry about getting the public perspective later on down the line.

The Wolfgang survey is really incredible. In this regression framework you can either adjust for other characteristics (e.g. it asks about all the usual demographics) or look at interactions (do folks who were recently victimized up their scores). So this is really just scratching the surface. I imagine if someone redid it with current data many of the metrics would be similar as well, although if I needed to do this I don’t think I would devise something as complicated as this, and would ask people to rank a smaller set of items directly.


  • Ratcliffe, J.H. (2015). Towards an index for harm-focused policing. Policing: A Journal of Policy and Practice, 9(2), 164-182.
  • Spelman, W. (2004). Optimal targeting of incivility-reduction strategies. Journal of Quantitative Criminology, 20(1), 63-88.
  • Taylor, R.B. (2008). Impacts of Specific Incivilities on Responses to Crime and Local Commitment, 1979-1994: [Atlanta, Baltimore, Chicago, Minneapolis-St. Paul, and Seattle].
  • Wheeler, A.P., & Reuter, S. (2020). Redrawing hot spots of crime in Dallas, Texas.
  • Wheeler, A.P. (2019). Allocating police resources while limiting racial inequality. Justice Quarterly, Online First.
  • Wolfgang, M.E., Figlio, R.M., Tracy, P.E., and Singer, S.I. (2006). National Crime Surveys: Index of Crime Severity, 1977.

Admin data should be used more often in policing research

I sometimes wonder if many researchers do not know actually what data police departments regularly collect. I commonly see articles on topics and think to myself “Hey, that is nice you did a survey on XYZ, why did you not confirm the responses with actual admin data on the same topic?”. Or I see topics that can be reasonably addressed using admin data not tackled at all by researchers.

So I decided to write this blog post.

I’ve mostly to date made a career out of analyzing administrative police data (only 2 out of my 30 some peer reviewed papers at this point are using non-regularly collected data as part of the analysis – and both of those link surveys to official crime records). To be honest I’m also motivated to write this as it is common for senior academics (in general in criminology, not just specific to policing researchers) to critique secondary data analysis (some of those folks are curmudgeons though, so maybe not worth stating). Of course you can do bad analysis with whatever data – primary or secondary makes no difference.

I think the default though should be to leverage admin data, so this sentiment I believe is in general misguided, and results in a lot of waste (time and money spent on primary data collection). I have never received research funding directly in my career (only as an RA for Rob Worden), so my work has essentially been for “free” on these projects (just my time). (I was basically subsidized by the university to do research!)

My opinion is based on two key points:

  1. Administrative data has already been collected by police agencies, so it has no additional costs for use by researchers.
  2. Administrative data defines core outcomes to which police agencies strive to reduce.

For 2 in particular this is reducing reported crime and reducing use of force. (Use of force can be conceived of as an “output” instead of an “outcome”, but I tend to think of it as a negative externality that should be minimized to the extent possible.) I’m sure a few folks are thinking here “these don’t define the potential universe of outcomes police departments are interested in” and I agree – permit me to discuss this in more detail in a few paragraphs. The argument I am making is ultimately fuzzy – not that we shouldn’t collect other data, but it should meet a higher threshold than using zero-cost data already collected by PDs.

What is Admin Policing Data?

For folks not familiar, police departments keep electronic records of various things, mostly related to crime and interactions with the public. All police departments I have worked with have these types of records in various tables/databases:

  • calls to 911 (Computer Automated Dispatch)
  • reported crimes and incidents
  • charges & arrests
  • discretionary stops (traffic and pedestrian)
  • use of force

All of these tables you can link to individual officers and/or individual citizens, as well as have a date-time and location stamp of where it happened. So you can do things like see all the cases detective X has been assigned and his specific clearance rate, or all cases in which Y was listed as a victim, or see the stop/use-of-force patterns of officer Z over time, etc.

Other types of admin data that are pretty regular are pysch screenings (especially for newer officers), civilian complaints, plain text detective/case notes, gang related databases (people/tags/incidents), databases of reported/recovered stolen goods, etc. Police collect alot of data! At this point PDs often have this data going back over a decade.

How often is Admin Policing Data Used in Policing Journal Articles?

To illustrate my point about admin data should be used more in policing research, I took the most recent issues of several policing journals and counted up the articles that used admin data. (There are probably more policing journals I missed, sorry, these are the ones I know of/have submitted articles to in the past.)

So this is a total of 14/50 ~28% in this sample. This is actually higher than I expected (I guessed 10%). Looking at the first issue of Police Quarterly for 2020 it is 0/5. The Policing Policy and Practice issue also contained a special sub-issue on recruit training, among them 0/6 likely contained administrative data. The Policing an International journal first issue of 2020 had a special issue on cyber crime, which appears to me have 2/14 papers using admin data. So if I add those stats, it is 16/75 ~ 21%.

I may be undercounting admin data here; for example I assume a survey of recruits is not a regular data collection (it hasn’t been in any police agency I’ve been involved with), but I of course may be wrong.

I’ve included as admin data looking at detective case notes (it is sort of like secondary analysis of a qualitative dataset!). Also counted as admin data one article that used the NCVS – which is regularly collected data (but by the federal govt, not local PD).

So you may squabble with my definitions here, but in broad strokes I don’t think any reasonable definition is likely to push this above ~1/3 papers in policing research use regularly collected admin data (in this sample of policing journals).

For reference I did a Twitter poll asking what proportion of policing research folks thought used admin data, and the distribution of the 86 responses was a slight favor for the right category (under 1/3rd, but almost the same amount guessed over 2/3’s).

So you can see a significant number of folks think that the distribution is opposite what it is in practice – the majority, not the minority, of policing research uses specially collected data and ignores admin data.

Restricting the subset to policing journals is likely to bias the estimate downward somewhat. I bet if I pulled policing articles from say Journal of Experimental Crim or Crime Science they are closer to 100% using admin policing data. But I think that also illustrates a pretty big discord in the current field of policing as well.

Some may think this cuts the research in terms of criminology/criminal justice – policing journals publish work on examining police behavior, whereas other journals tend to more frequently look at crime outcomes more associated with “criminological” research. This may be true, but admin data collected by police departments are pretty relevant for examining police behavior (e.g. proactive stops, use of force). These admin measures are almost always more relevant to police behavior than surveys of opinions! If you do surveys you should often tie it to these other admin measures to provide secondary evidence of different relevant measures.

Whats Wrong with Collecting New Data?

My argument is explicitly value-laden – I don’t know the correct percent of policing research that should use admin police data. But I do think the current swing in which the clear majority of research is oriented to collect primary data is wrong. Those primary data collections have both more costs (above data already collected by police agencies) and, for the most part, ignore core outcomes to which PDs strive for.

For example, the National Institute of Justice has stated they want researchers to move away from admin data. One reason for this is that past researchers have been unsuccessful lowering crime, and so you should collect alternative measures to validate your intervention.

This I believe is an actively harmful perspective called “goal switching,” and in general makes little sense. If crime is so rare a study is ultimately poorly powered, there isn’t much potential benefit to reducing crime in that area even if the intervention does work in practice. Best case you need to do longer interventions. I mean if you want to reduce violent crime you can look at community sentiment if you want; it doesn’t make sense though to entirely drop the ultimate goal of violence reduction in its place though!

And this gets to the crux of core outcomes police should strive for. It is a normative question, but I believe reduced crime and reduced use of force are relatively well agreed upon general goals of police. I think it is OK to have secondary measures – such as say attitudes towards police or fear of crime or measures of police stress. But these measures have several things working against them.

One, they are not regularly collected as administrative datasets. I imagine you can troll up a few examples of PDs who have started to do regular surveys of attitudes towards police (either general public or specific post-PD contact), but vast majority have not. So say you have an intervention intended to improve attitudes towards police. Great! For a police department interested in implementing that program, they not only have to allocate resources to that project, but also put an item in the budget to do the surveys forever. (This isn’t always true though, I think for example Rylan Simpson’s work is strong enough to justify making those low cost appearance changes and you don’t need to forever do surveys to see if it is working.) But for most interventions you can’t just do it once and hope it has improved indefinitely! (Same as you can’t stop measuring crime just because something you did made crime go down one time.)

Two, they are pretty fuzzy as to whether they should be reasonably swapped out for goals of crime reduction and reduced use of force in-and-of themselves. For sake of argument say hot spots policing causes back fire effects that cause increased fear of crime. How exactly do you trade off fear of crime vs actual crime reduction? Personally I think actual crime reductions should take precedence in that scenario. If you want to justify actually measuring fear of crime, you need to make some value based arguments to justify at minimum the cost of doing surveys. You should also probably justify altering police behavior in a particular way to improve that particular metric as well.

So any time you do a secondary data collection, you need to actually valuate the costs of the measures somehow (which I know is very difficult, hence it makes more sense to default to using admin data that is costless in terms of research!) Costless is probably a bit of a misnomer though – police departments have already sunk a lot of resources into collecting that admin data (patrol officers likely spend about equal time on dealing with people as they do with paperwork). But it is costless in terms of capital for me to query a database and say “use of force went down 10% after you instituted this policy”.

I think plenty of research collecting unique measures has potential to meet this threshold. One of the motivations to write this was Lois James articles on EIS – I think her general idea of doing a more deep dive to tease out more detailed interaction measures could be really important work (especially if it can be automated in a particular way, say through BWC footage). Lois’s work is just one example though. I also think measures of say police stressors could be very important in measuring churn of police officers over time. I already stated I think Rylan Simpson’s work on perceptions of police is well justified based on his simple experiments (since they are very low cost interventions, like wear purple gloves instead of black, or no cost e.g. take off your sunglasses when interviewing folks).

So these have potential to be worth the cost for police departments to open up their pocket books and collect those measures, but that is a bridge further than the majority of research currently being publishing in policing journals.

Some Caveats

So this is like I said a value-laden and fuzzy argument. No doubt some folks doing qualitative research or surveys will think this is loathsome, and think “I can’t answer my research question using administrative data”.

I intend the argument to go the other way though – we can be doing so much more quality research for much less cost. It is also the case that folks I believe need in general to do a much better job tying contemporary policing research to actual real life outcomes such as crime and use of force. Like I said I think the default should be basically the opposite proportion of what policing research looks like at the moment.

I’m not saying folks can’t do more basic data measures and collection – but as is the vast majority of this research lacks any semblance of a cost-benefit analysis that would justify the cost to collect those measures. As is, even if folks hypotheses are validated in a one time data collection, they lack the necessary valuation to justify police departments implement those measures going forward in practice. (Many of these same valuation critiques apply to the use of technology in policing, although it is the obverse, not much academic work but plenty of sinking $$ into tech with little return in terms of measurable outcomes.)

One thing I have not touched on is access. Folks may be thinking “I can’t get access to that info!”. You actually probably can though – I don’t know a PD that would let you do a survey or interviews that also wouldn’t share much of this admin data.

Another thing I have not touched on is bias in admin data. That deserves a whole additional blog post. It is a fair critique in part (bias no doubt exists, it is quantifying how large and its impact on the analysis is the question). The majority of the work in these policing journals though is not using alternative measures to get around bias in admin data though, they are measuring totally different things (as I said goal switching to totally different outcomes).

Statement on recent officer involved shooting research

Several recent studies (Johnson et al., 2019; Jetelina et al., 2020) use a similar study design to assess racial bias in officer involved shootings (OIS). In short, critiques of this work by Jon Mummolo (JM) are correct – they make a fundamental error in the analysis that renders the results mostly meaningless (Knox and Mummalo, 2020). JM critiques the work as switching conditional probabilities, this recent OIS work estimates the probability of the race of someone shot by police conditional on other characteristics, e.g. tests the hypothesis P(White | Other Stuff, Being Shot) = P(Minority | Other Stuff, Being Shot). Whereas we want Being Shot on the left hand side, e.g. P(Being Shot | Race), and switching these probabilities results in mostly a meaningless estimate in terms of inferring police behavior. You ultimately need to look at some cases in which folks were not shot to have a meaningful research design.

I’ve been having similar conversations with folks since publishing my work on officer involved shootings (Wheeler et al., 2017). Most folks don’t understand the critique, and unfortunately most folks also don’t take critiques very well. So this post is probably a waste of time, but here it is anyway.

The Road

I’m likely to get some of the timing wrong in how I came to be interested in this area – but here is what I remember. David Klinger and Richard Rosenfeld published a piece in Criminology & Public Policy (CPP) examining the count of OIS’s in neighborhoods in St. Louis, conditional on demographic and violent crime counts in those neighborhoods (Klinger et al., 2016). So in quantoid speak they estimated the expected number of OIS in neighborhoods, E[OIS_n | Demographic_n, Crime_n].

I thought this work was mostly meaningless, mainly because it really only makes sense to look at rates of behavior. You could stick a count of anything police do on the left hand side of this regression and the violent crime coefficient will be the largest positive effect. So you could say estimate the counts of officers helping old ladies cross the street, and you would make the same inferences as you would about OIS. It is basically just saying where officers spend more of their time at (in violent crime areas), and subsequently have more interactions with individuals. It doesn’t say anything fundamentally about police behavior in regards to racial bias.

So sometime in 2016 me and Scott Phillips came up with the study design using when officers draw their firearm as the denominator. (Before I moved to Dallas I knew about their open data.) It was the observational analogue to the shoot/don’t shoot lab experiments Lois James did (James et al., 2014). Also sometime during the time period Roland Fryer came out with his pre-print, in which he used Taser uses as the counter-factual don’t shoot cases (Fryer, 2019). I thought drawing the firearm made more sense as a counterfactual, but both are subject to the same potential selection effect. (Police may be quicker to the draw their firearms with minorities, which I readily admit in my paper.)

Also in that span Justin Nix came out with the birds-eye view CPP paper using the national level crowd sourced data (Nix et al., 2017) to estimate racial bias. They make what to me is a similar conditional probability mistake as the papers that motivated this post. Using the crowdsourced national level data, they estimate the probability of being unarmed, conditional on race (in the sample of just folks who were killed by the police). So they test whether P(Unarmed | White, Shot) = P(Unarmed | Minority, Shot).

Since like I said folks don’t really understand the conditional probability argument, basically at this point I just say folks get causality backwards. The police shooting at someone does not make them armed or unarmed, the same way police shooting at someone does not change their race. You cannot estimate a regression of X ~ beta*Y, then interpret beta as how much X causes Y. The stuff on the right hand side of the conditional probability statement works mostly the same way, we want to say stuff on the right hand side of the condition causes some change in the outcome.

I have this table I made in Wheeler et al. (2017) to illustrate various research designs – you can see the Ross (2015) made the same estimate of P(Unarmed | Race, Shot) as Justin did.

At this point you typically get a series of different retorts to the “you estimated the wrong conditional probability complaint”. The ones I’ve repeatedly seen are:

  1. No data is perfect. We should work with what we have.
  2. We ask a different research question.
  3. Our analysis are just descriptive, not causal.
  4. Our findings are consistent with a bunch of other work.

For (3) I would be OK if the results are described correctly, pretty much all of these articles are clearly interested in making inferences about police behavior though (which you cannot do with just looking at these negative encounters). It isn’t just a slip of mistaking conditional probabilities (like a common p-value mishap that doesn’t really impact the overall conclusions), the articles are directly motivated to make inferences about police behavior they cannot with this study design.

For (2) it is useful to consider how might the descriptive conditional probabilities be actually interpreted in a reasonable manner. So if we estimate P(Offender Race | Shot), you can think of a game where if you see a news headline about an OIS, and you want to guess the race of the person shot by police, what would be your best guess. Ditto for P(Unarmed | Shot), what is the probability of someone being unarmed conditional on them being shot. This game is clearly a superficial type of thing to estimate, those probabilities don’t say anything though about behavior in terms of things police officers can control, they are all just a function of how often police get in interactions with those different races (or armed status) of individuals.

Consider a different hypothetical, the probability a human is shot by police versus an animal. P(Human | Shot) is waay larger than P(Animal | Shot), are police biased against humans? No, the police just don’t deal with animals they need to shoot on a regular basis.

For (1) I will follow up below with some examples of how I think using this OIS data could actually be effective for shaping police behavior in practice, but suffice to say just collecting OIS you can’t really say anything about racial bias in terms of officer decision making.

I will say that a bunch of the individuals I am critiquing here I consider friends. Steve Bishopp was one of the co-authors on my OIS work with Dallas data. If I go to a conference Justin is one of the people I would prefer to sit down and have a drink with. I’ve been schmoozing up folks with good R programming skills to come to Dallas to work for Jenn Reingle-Gonzalez. They have all done other work I think is good. See Tregel et al. (2019) or Jetelina et al. (2017) or Cesario et al. (2019) for other examples I think are more legitimate research articles amongst the same people who I am critiquing here.

So in response to (4) I think you all made the wrong mistake – the conditional probability mistake is an easy one to make. So sorry to my friends whom I think are wrong about this. That being said, most of the vitriol in public forums, often accusing people of ad-hominem attacks on their motivations, is pretty much always out of line. I think basically everyone on Twitter is being a jerk to be frank. I’ve seen it all around on both sides in the most recent Twitter back and forth (both folks calling Jenn racist and JM biased against the police). None of them are racist or biased for/against the police. I suppose to expect any different though is setting myself up for dissapointment. I was called racist by academic reviewers for Wheeler et al. (2017) (it took 4 rejects for my OIS paper before it was published). I’ve seen Justin get critiques on Twitter for being white in the past when doing work in this area.

I think CJ folks questioning JM’s motivation miss the point of his critique though. He isn’t saying police are biased and these papers are wrong, he is just saying these research papers are wrong because they can’t tell whether police are biased one way or another.

Who gives a shit

So while I think better research could be conducted in this area – JM has his work on bounding estimates (Knox et al., 2019), and I imagine someone can come up with a reasonable instrumental variable strategy to address the selection bias in the same vein as my shoot/don’t shoot (say officer instruments, or exogenous incidents that make officers more on edge and more likely to draw their firearm). But I think the question of whether “the police” are racially biased is a facile question. Globally labelling all police (or a single department) as racist is mostly a waste of time. Good for academic papers and to get people riled up in Twitter, not so much for anything else.

The police are simply a cross section of the general public. So in terms of whether some officers are racist this is true (as it is for the general public). Or maybe even we are all a little racist (ala the implicit bias hypothesis). We can only observe behavior, we cannot peer into the hearts and minds of men. But suffice to say racism is still a part of our society in some capacity I believe is a pretty tame statement.

Subsequently if you gather enough data you will be able to get some estimate of the police being racist (the null is for sure wrong). But if people can’t reasonably understand conditional probabilities, imagine trying to have a conversation about what is a reasonable amount of racial bias for monitoring purposes (inferiority bounds). Or that this racial bias estimate is not for all police, but some mixture of police officers and actions. Hard pass on either of those from me.

Subsequently this work has no bearing on actual police practice (including my own). They are of very limited utility – at best a stick or shield in civil litigation. They don’t help police departments change behavior in response to discovering (or not discovering) racial bias. And OIS are basically so rare they are worthless for all but the biggest police departments in terms of a useful monitoring metric (it won’t be sensitive enough to say whether a police department as a whole is doing good or doing bad).

So what do I think is potentially useful way to use this data? I’ve used the term “monitoring metric” a few times – what I mean by that is using the information to actually inform some response. Internally for police departments, shootings should be part of an early intervention system used to monitor individual officers for problematic behavior. From a state or federal government perspective, they could actively monitor overall levels of force used to identify outlier agencies (see this blog post example of mine). For the latter think proactively identifying problematic departments, instead of the typical current approach of wait for some major incident and then the Department of Justice assigns a federal monitor.

In either of those strategies just looking at shootings won’t be enough, they would need to use all levels of use of force to effectively identify either bad individual cops or problematic departments as a whole. Hence why I suggested adding all levels of force to say NIBRS, rather than having a stand alone national level OIS database. And individual agencies already have all the data they need to do an effective early intervention system.

I’m not totally oppossed to having a national level OIS database just based on normative arguments – e.g. you think it is a travesty we can’t say how many folks were killed by police in the prior year. It is not a totally hollow gesture, as making people record the information does provide a level of oversight, so may make a small difference. But that data won’t be able to say anything about the racial bias in individual police officer decision making.


Cesario, J., Johnson, D. J., & Terrill, W. (2019). Is there evidence of racial disparity in police use of deadly force? Analyses of officer-involved fatal shootings in 2015–2016. Social psychological and personality science, 10(5), 586-595.

Fryer Jr, R. G. (2019). An empirical analysis of racial differences in police use of force. Journal of Political Economy, 127(3), 1210-1261.

Klinger, D., Rosenfeld, R., Isom, D., & Deckard, M. (2016). Race, crime, and the micro-ecology of deadly force. Criminology & Public Policy, 15(1), 193-222.

Knox, D., Lowe, W., & Mummolo, J. (2019). The bias is built in: How administrative records mask racially biased policing. Available at SSRN.

Knox, D., & Mummolo, J. (2020). Making inferences about racial disparities in police violence. Proceedings of the National Academy of Sciences, 117(3), 1261-1262.

James, L., Klinger, D., & Vila, B. (2014). Racial and ethnic bias in decisions to shoot seen through a stronger lens: Experimental results from high-fidelity laboratory simulations. Journal of Experimental Criminology, 10(3), 323-340.

Jetelina, K. K., Bishopp, S. A., Wiegand, J. G., & Gonzalez, J. M. R. (2020). Race/ethnicity composition of police officers in officer-involved shootings. Policing: An International Journal.

Jetelina, K. K., Jennings, W. G., Bishopp, S. A., Piquero, A. R., & Reingle Gonzalez, J. M. (2017). Dissecting the complexities of the relationship between police officer–civilian race/ethnicity dyads and less-than-lethal use of force. American journal of public health, 107(7), 1164-1170.

Johnson, D. J., Tress, T., Burkel, N., Taylor, C., & Cesario, J. (2019). Officer characteristics and racial disparities in fatal officer-involved shootings. Proceedings of the National Academy of Sciences, 116(32), 15877-15882.

Nix, J., Campbell, B. A., Byers, E. H., & Alpert, G. P. (2017). A bird’s eye view of civilians killed by police in 2015: Further evidence of implicit bias. Criminology & Public Policy, 16(1), 309-340.

Ross, C. T. (2015). A multi-level Bayesian analysis of racial bias in police shootings at the county-level in the United States, 2011–2014. PloS one, 10(11).

Tregle, B., Nix, J., & Alpert, G. P. (2019). Disparity does not mean bias: Making sense of observed racial disparities in fatal officer-involved shootings with multiple benchmarks. Journal of crime and justice, 42(1), 18-31.

Wheeler, A. P., Phillips, S. W., Worrall, J. L., & Bishopp, S. A. (2017). What factors influence an officer’s decision to shoot? The promise and limitations of using public data. Justice Research and Policy, 18(1), 48-76.

Balancing False Positives

One area of prediction in criminal justice I think has alot of promise is using predictive algorithms in place of bail decisions. So using a predictive instrument to determine whether someone is detained pre-trial based on risk, or released on recognizance if you are low risk. Risk can be either defined as based on future dangerousness or flight risk. This cuts out the middle man of bail, which doesn’t have much evidence of effectiveness, and has negative externalities of placing economic burdens on folks we really don’t want to pile that onto. It is also the case algorithms can likely do quite a bit better than judges in figuring out future risk. So an area I think they can really do good compared to current status quo in the CJ system.

A reasonable critique of such systems though is they can have disparate racial impact. For example, ProPublica had an article on how the Compas risk assessment instrument resulted in more false positives for black than white individuals. Chris Stucchio has a nice breakdown for why this occurs, which is not due to the Compas being intrinsically racist algorithm, but due to the nature of the baseline risks for the two groups.

Consider a very simple example to illustrate. Imagine based on our cost-benefit analysis, we determine the probability threshold to flag a individual as high risk is 60%. Now say our once we apply our predictions, for those above the threshold, whites are all predicted to be 90%, and blacks are all 70%. If our model is well calibrated (which is typically the case), the false positive rate for whites will be 10%, and will be 30% for blacks.

It is actually a pretty trivial problem though to balance false positive rates between different groups, if that is what you want to do. So I figured I would illustrate here using the same ProPublica data. There are trade-offs though with this, balancing false positives means you lose out on other metrics of fairness. In particular, it means you don’t have equality of treatment – different racial groups will have different thresholds. The full data and code I use to illustrate this can be downloaded here.

An Example in Python

To illustrate how we would balance the false positive rates between groups, I use the same ProPublica risk assessment data. So this isn’t per se for bail decisions, but works fine as an illustration. First in python I load my libraries, and then read in the data – it is a few over 11,000 cases.

import pandas as pd
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

my_dir = r'C:\Users\andre\Dropbox\Documents\BLOG\BalanceFalsePos'

#For notes on data source, check out 
recid = pd.read_csv('PreppedCompas.csv')
print( recid.head() )

Next I prepare the dataset for modelling. I am not using all of the variables in the dataset. What I predict here is recidivism post 30 days (there are a bunch of recidivism right away in the dataset, so I am not 100% sure those are prior to screening). I use the three different aggregate compas scores, juvenile felony count, whether they were male, how old they were, and whether the current charge to precipitate screening is a felony or misdemeanor. I include the race variable in the dataset, but I won’t be using it in the predictive model. (That point deserves another blog post, contra to what you might expect, leaving race flags in will often result in better outcomes for that protected class.)

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
recid_prep['race'] = recid['race']
print( recid['race'].value_counts() ) #pretty good sample size for both whites/blacks

Next I make my testing and training sets of data. In practice I can perfectly balance false positives retrospectively. But having a test set is a better representation of reality, where you need to make some decisions on the historical data and apply it forward.

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1]
recid_test = recid_prep[recid_prep['Train'] == 0]

Now the procedure I suggest to balance false-positives doesn’t matter how you generate the predictions, just that we need a predicted probability. Here I use random forests, but you could use whatever machine learning or logistic regression model you want. Second part just generates the predicted probabilities for the training dataset.

#Now estimating the model
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis'] #no race in model
dep_var = 'Recid30'
rf_mod = RandomForestClassifier(n_estimators=500, random_state=10) = recid_train[ind_vars], y = recid_train[dep_var])

#Now getting the predicted probabilities in the training set
pred_prob = rf_mod.predict_proba(recid_train[ind_vars] )
recid_train['prob'] = pred_prob[:,1]
recid_train['prob_min'] = pred_prob[:,0]

Now to balance false positives, I will show a graph. Basically this just sorts the predicted probabilities in descending order for each racial group. Then you can calculate a cumulate false positive rate for different thresholds for each group.

#Making a cusum plot within each racial group for the false positives
recid_train.sort_values(by=['race','prob'], ascending=False, inplace=True)
recid_train['const'] = 1
recid_train['cum_fp'] = recid_train.groupby(['race'])['prob_min'].cumsum()
recid_train['cum_n'] = recid_train.groupby(['race'])['const'].cumsum()
recid_train['cum_fpm'] = recid_train['cum_fp'] / recid_train['cum_n']
white_rt = recid_train[recid_train['race'] == 'Caucasian']
black_rt = recid_train[recid_train['race'] == 'African-American' ] 

And now the fun part (and least in output, not really in writing matplotlib code).

#now make the chart for white and black
fig, ax = plt.subplots()
ax.plot(black_rt['prob'], black_rt['cum_fpm'], drawstyle='steps', color='b', label='Black')
ax.plot(white_rt['prob'], white_rt['cum_fpm'], drawstyle='steps', color='r', label='White')
ax.set_xlim(1, 0)  # decreasing probs
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Mean False Positive Rate')
ax.legend(facecolor='white', framealpha=1)
plt.savefig('FP_Rate.png', dpi=2000, bbox_inches='tight')

So what this chart shows is that if we set our threshold to a particular predicted probability (X axis), based on the data we would expect a false positive rate (Y axis). Hence if we want to balance false positives, we just figure out the race specific thresholds for each group at a particular Y axis value. Here we can see the white line is actually higher than the black line, so this is reverse ProPublica findings, we would expect whites to have a higher false positive rate than blacks given a consistent predicted probability of high risk threshold. So say we set the threshold at 10% to flag as high risk, we would guess the false positive rate among blacks in this sample should be around 40%, but will be closer to 45% in the white sample.

Technically the lines can cross at one or multiple places, and those are places where you get equality of treatment and equality of outcome. It doesn’t make sense to use that though from a safety standpoint – those crossings can happen at a predicted probability of 99% (so too many false negatives) or 0.1% (too many false positives). So say we wanted to equalize false positive rates at 30% for each group. Here this results in a threshold for whites as high risk of 0.256, and for blacks a threshold of 0.22.

#Figuring out where the threshold is to limit the mean FP rate to 0.3
#For each racial group
white_thresh = white_rt[white_rt['cum_fpm'] > 0.3]['prob'].max()
black_thresh = black_rt[black_rt['cum_fpm'] > 0.3]['prob'].max()
print( white_thresh, black_thresh )

Now for the real test, lets see if my advice actually worked in a new sample of data to balance the false positive rate.

#Now applying out of sample, lets see if this works
pred_prob = rf_mod.predict_proba(recid_test[ind_vars] )
recid_test['prob'] = pred_prob[:,1]
recid_test['prob_min'] = pred_prob[:,0]

white_test = recid_test[recid_test['race'] == 'Caucasian']
black_test = recid_test[recid_test['race'] == 'African-American' ]

white_test['Flag'] = 1*(white_test['prob'] > white_thresh)
black_test['Flag'] = 1*(black_test['prob'] > black_thresh)

white_fp= 1 - white_test[white_test['Flag'] == 1][dep_var].mean()
black_fp = 1 - black_test[black_test['Flag'] == 1][dep_var].mean()
print( white_fp, black_fp )

And we get a false positive rate of 54% for whites (294/547 false positives), and 42% for blacks (411/986) – yikes (since I wanted a 30% FPR). As typical, when applying your model to out of sample data, your predictions are too optimistic. I need to do some more investigation, but I think a better way to get error bars on such thresholds is to do some k-fold metrics and take the worst case scenario, but I need to investigate that some more. The sample sizes here are decent, but there will ultimately be some noise when deploying this in practice. So basically if you see in practice the false positive rates are within a few percentage points that is about as good as you can get in practice I imagine. (And for smaller sample sizes will be more volatile.)

Reasons Police Departments Should Consider Collaborating with Me

Much of my academic work involves collaborating and consulting with police departments on quantitative problems. Most of the work I’ve done so far is very ad-hoc, through either the network of other academics asking for help on some project or police departments cold contacting me directly.

In an effort to advertise a bit more clearly, I wrote a page that describes examples of prior work I have done in collaboration with police departments. That discusses what I have previously done, but doesn’t describe why a police department would bother to collaborate with me or hire me as a consultant. In fact, it probably makes more sense to contact me for things no one has previously done before (including myself).

So here is a more general way to think about (from a police departments or criminal justice agencies perspective) whether it would be beneficial to reach out to me.

Should I do X?

So no one is going to be against different evidence based policing practices, but not all strategies make sense for all jurisdictions. For example, while focussed deterrence has been successfully applied in many different cities, if you do not have much of a gang violence problem it probably does not make sense to apply that strategy in your jurisdiction. Implementing any particular strategy should take into consideration the cost as well as the potential benefits of the program.

Should I do X may involve more open ended questions. I’ve previously conducted in person training for crime analysts that goes over various evidence based practices. It also may involve something more specific, such as should I redistrict my police beats? Or I have a theft-from-vehicle problem, what strategies should I implement to reduce them?

I can suggest strategies to implement, or conduct cost-benefit analysis as to whether a specific program is worth it for your jurisdiction.

I want to do X, how do I do it?

This is actually the best scenario for me. It is much easier to design a program up front that allows a police department to evaluate its efficacy (such as designing a randomized trial and collecting key measures). I also enjoy tackling some of the nitty-gritty problems of implementing particular strategies more efficiently or developing predictive instruments.

So you want to do hotspots policing? What strategies do you want to do at the hotspots? How many hotspots do you want to target? Those are examples of where it would make sense to collaborate with me. Pretty much all police departments should be doing some type of hot spots policing strategy, but depending on your particular problems (and budget constraints), it will change how you do your hot spots. No budget doesn’t mean you can’t do anything — many strategies can be implemented by shifting your current resources around in particular ways, as opposed to paying for a special unit.

If you are a police department at this stage I can often help identify potential grant funding sources, such as the Smart Policing grants, that can be used to pay for particular elements of the strategy (that have a research component).

I’ve done X, should I continue to do it?

Have you done something innovative and want to see if it was effective? Or are you putting a bunch of money into some strategy and are skeptical it works? It is always preferable to design a study up front, but often you can conduct pretty effective post-hoc analysis using quasi-experimental methods to see if some crime reduction strategy works.

If I don’t think you can do a fair evaluation I will say so. For example I don’t think you can do a fair evaluation of chronic offender strategies that use officer intel with matching methods. In that case I would suggest how you can do an experiment going forward to evaluate the efficacy of the program.

Mutual Benefits of Academic-Practitioner Collaboration

Often I collaborate with police departments pro bono — which you may ask what is in it for me then? As an academic I get evaluated mostly by my research productivity, which involves writing peer reviewed papers and getting research grants. So money is not the main factor from my perspective. It is typically easier to write papers about innovative problems or programs. If it involves applying for a grant (on a project I am interested in) I will volunteer my services to help write the grant and design the study.

I could go through my career writing papers without collaborating with police departments. But my work with police departments is more meaningful. It is not zero-sum, I tend to get better ideas when understanding specific agencies problems.

So get in touch if you think I can help your agency!

Monitoring Use of Force in New Jersey

Recently ProPublica published a map of uses-of-force across different jurisdictions in New Jersey. Such information can be used to monitor whether agencies are overall doing a good or bad job.

I’ve previously discussed the idea of using funnel charts to spot outliers, mostly around homicide rates but the idea is the same when examining any type of rate. For example in another post I illustrated its use for examining rates of officer involved shootings.

Here is another example applying it to lesser uses of force in New Jersey. Below is the rate of use of force reports per the total number of arrests. (Code to replicate at the end of the post.)

The average use of force per arrests in the state is around 3%. So the error bars show relative to the state average. Here is an interactive chart in which you can use tool tips to see the individual jurisdictions.

Now the original press release noted by Seth Stoughton on twitter noted that several towns have ratio’s of black to white use of force that are very high. Scott Wolfe suspected that was partly a function of smaller towns will have more variable rates. Basically as one is comparing the ratio between two rates with error, the error bars around the rate ratio will also be quite large.

Here is the chart showing the same type of funnel around the rate ratio of black to white use-of-force relative to the average over the whole sample (the black percent use of force is 3.2 percent of arrests, and the white percent use of force is 2.4, and the rate ratio between the two is 1.35). I show in the code how I constructed this, which I should write a blog post about itself, but in short there are decisions I could make to make the intervals wider. So the points that are just slightly above a ratio of 2 at around 10,000 arrests are arguably not outliers, those more to the top-right of the plot though are much better evidence. (I’d note that if one group is very small, you could always make these error bars really large, so to construct them you need to make reasonable assumptions about the size of the two groups you are comparing.)

And here is another interactive chart in which you can view the outliers again. The original press release, Millville, Lakewood, and South Orange are noted as outliers. Using arrests as the denominator instead of population, they each have a rate ratio of around 2. In this chart Millville and Lakewood are outside the bounds, but just barely. South Orange is within the bounds. So those aren’t the places I would have called out according to this chart.

That same twitter thread other folks noted the potential reliability/validity of such data (Pete Moskos and Kyle McLean). These charts cannot say why individual agencies are outliers — either high or low. It could be their officers are really using force at different rates, it could also be though they are using different definitions to reporting force. There are also potential other individual explanations that explain the use of force distribution as well as the ratio differences in black vs white — no doubt policing in Princeton vs Camden are substantively different. Also even if all individual agencies are doing well, it does not mean there are no potential problem officers (as noted by David Pyrooz, often a few officers contribute to most UoF).

Despite these limitations, I still think there is utility in this type of monitoring though. It is basically a flag to dig deeper when anomalous patterns are spotted. Those unaccounted for factors contribute to more points being pushed outside of my constructed limits (overdispersion), but more clearly indicate when a pattern is so far outside the norm of what is expected the public deserves some explanation of the pattern. Also it highlights when agencies are potentially doing good, and so can be promoted according to their current practices.

This is a terrific start to effectively monitoring police agencies by ProPublica — state criminal justice agencies should be doing this themselves though.

Here is the code to replicate the analysis.

New preprint: Allocating police resources while limiting racial inequality

I have a new working paper out, Allocating police resources while limiting racial inequality. In this work I tackle the problem that a hot spots policing strategy likely exacerbates disproportionate minority contact (DMC). This is because of the pretty simple fact that hot spots of crime tend to be in disadvantaged/minority neighborhoods.

Here is a graph illustrating the problem. X axis is the proportion of minorities stopped by the police in 500 by 500 meter grid cells (NYPD data). Y axis is the number of violent crimes over along time period (12 years). So a typical hot spots strategy would choose the top N areas to target (here I do top 20). These are all very high proportion minority areas. So the inevitable extra police contact in those hot spots (in the form of either stops or arrests) will increase DMC.

I’d note that the majority of critiques of predictive policing focus on whether reported crime data is biased or not. I think that is a bit of a red herring though, you could use totally objective crime data (say swap out acoustic gun shot sensors with reported crime) and you still have the same problem.

The proportion of stops by the NYPD of minorities has consistently hovered around 90%, so doing a bunch of extra stuff in those hot spots will increase DMC, as those 20 hot spots tend to have 95%+ stops of minorities (with the exception of one location). Also note this 90% has not changed even with the dramatic decrease in stops overall by the NYPD.

So to illustrate my suggested solution here is a simple example. Consider you have a hot spot with predicted 30 crimes vs a hot spot with predicted 28 crimes. Also imagine that the 30 crime hot spot results in around 90% stops of minorities, whereas the 28 crime hot spot only results in around 50% stops of minorities. If you agree reducing DMC is a reasonable goal for the police in-and-of-itself, you may say choosing the 28 crime area is a good idea, even though it is a less efficient choice than the 30 crime hot spot.

I show in the paper how to codify this trade-off into a linear program that says choose X hot spots, but has a constraint based on the expected number of minorities likely to be stopped. Here is an example graph that shows it doesn’t always choose the highest crime areas to meet that racial equity constraint.

This results in a trade-off of efficiency though. Going back to the original hypothetical, trading off a 28 crime vs 30 crime area is not a big deal. But if the trade off was 3 crimes vs 30 that is a bigger deal. In this example I show that getting to 80% stops of minorities (NYC is around 70% minorities) results in hot spots with around 55% of the crime compared to the no constraint hot spots. So in the hypothetical it would go from 30 crimes to 17 crimes.

There won’t be a uniform formula to calculate the expected decrease in efficiency, but I think getting to perfect equality with the residential pop. will typically result in similar large decreases in many scenarios. A recent paper by George Mohler and company showed similar fairly steep declines. (That uses a totally different method, but I think will be pretty similar outputs in practice — can tune the penalty factor in a similar way to changing the linear program constraint I think.)

So basically the trade-off to get perfect equity will be steep, but I think the best case scenario is that a PD can say "this predictive policing strategy will not make current levels of DMC worse" by applying this algorithm on-top-of your predictive policing forecasts.

I will be presenting this work at ASC, so stop on by! Feedback always appreciated.

American Community Survey Variables of Interest to Criminologists

I’ve written prior blog posts about downloading Five Year American Community Survey data estimates (ACS for short) for small area geographies, but one of the main hiccups is figuring out what variables you want to use. The census has so many variables that are just small iterations of one another (e.g. Males under 5, males 5 to 9, males 10 to 14, etc.) that it is quite a chore to specify the ones you want. Often you want combinations of variables or to calculate percentages as well, so you need to take two or more variables and turn them into your constructed variable.

I have posted some notes on the variables I have used for past projects in an excel spreadsheet. This includes the original variables, as well as some notes for creating percentage variables. Some are tricky — such as figuring out the proportion of black residents for block groups you need to add non-Hispanic black and Hispanic black estimates (and then divide by the total population). For spatially oriented criminologists these are basically indicators commonly used for social disorganization. It also includes notes on what is available at the smaller block group level, as not all of the variables are. So you are more limited in your choices if you want that small of area.

Let me know if you have been using other variables for your work. I’m not an expert on these variables by any stretch, so don’t take my list as authoritative in any way. For example I have no idea whether it is valid to use the imputed data for moving in the prior year at the block group level. (In general I have not incorporated the estimates of uncertainty for any of the variables into my analyses, not sure of the additional implications for the imputed data tables.) Also I have not incorporated variables that could be used for income-inequality or for ethnic heterogeneity (besides using white/black/Hispanic to calculate the index). I’m sure there are other social disorganization relevant variables at the block group level folks may be interested in as well. So let me know in the comments or shoot me an email if you have suggestions to update my list.

I would prefer if as a field we could create a set of standardized indices so we are not all using different variables (see for example this Jeremy Miles paper). It is a bit hodge-podge though what variables folks use from study-to-study, and most folks don’t report the original variables so it is hard to replicate their work exactly. British folks have their index of deprivation, and it would be nice to have a similarly standardized measure to use in social science research for the states.

The ACS data has consistent variable names over the years, such as B03001_001 is the total population, B03002_003 is the Non-Hispanic white population, etc. Unfortunately those variables are not necessarily in the same tables from year to year, so concatenating ACS results over multiple years is a bit of a pain. Below I post a python script that given a directory of the excel template files will produce a nice set of dictionaries to help find what table particular variables are in.

#This python code grabs ACS meta-data templates
#To easier search for tables that have particular variables
import xlrd, os

mydir = r'!!!Insert your path to the excel files here!!!!!'

def acs_vars(directory):
    #get the excel files in the directory
    excel_files = []
    for file in os.listdir(directory):
        if file.endswith(".xls"):
            excel_files.append( os.path.join(directory, file) )
    #getting the variables in a nice dictionaries
    lab_dict = {}
    loc_dict = {}
    for file in excel_files:
        book = xlrd.open_workbook(file) #first open the xls workbook
        sh = book.sheet_by_index(0)
        vars = [i.value for i in sh.row(0)] #names on the first row
        labs = [i.value for i in sh.row(1)] #labels on the second
        #now add to the overall dictionary
        for v,l in zip(vars,labs):
            lab_dict[v] = l
            loc_dict[v] = file
    #returning the two dictionaries
    return lab_dict,loc_dict
labels,tables = acs_vars(mydir)

#now if you have a list of variables you want, you can figure out the table
interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
for i in interest:
    head, tail = os.path.split(tables[i])
    print (i,labels[i],tail)