Statement on recent officer involved shooting research

Several recent studies (Johnson et al., 2019; Jetelina et al., 2020) use a similar study design to assess racial bias in officer involved shootings (OIS). In short, critiques of this work by Jon Mummolo (JM) are correct – they make a fundamental error in the analysis that renders the results mostly meaningless (Knox and Mummalo, 2020). JM critiques the work as switching conditional probabilities, this recent OIS work estimates the probability of the race of someone shot by police conditional on other characteristics, e.g. tests the hypothesis P(White | Other Stuff, Being Shot) = P(Minority | Other Stuff, Being Shot). Whereas we want Being Shot on the left hand side, e.g. P(Being Shot | Race), and switching these probabilities results in mostly a meaningless estimate in terms of inferring police behavior. You ultimately need to look at some cases in which folks were not shot to have a meaningful research design.

I’ve been having similar conversations with folks since publishing my work on officer involved shootings (Wheeler et al., 2017). Most folks don’t understand the critique, and unfortunately most folks also don’t take critiques very well. So this post is probably a waste of time, but here it is anyway.

The Road

I’m likely to get some of the timing wrong in how I came to be interested in this area – but here is what I remember. David Klinger and Richard Rosenfeld published a piece in Criminology & Public Policy (CPP) examining the count of OIS’s in neighborhoods in St. Louis, conditional on demographic and violent crime counts in those neighborhoods (Klinger et al., 2016). So in quantoid speak they estimated the expected number of OIS in neighborhoods, E[OIS_n | Demographic_n, Crime_n].

I thought this work was mostly meaningless, mainly because it really only makes sense to look at rates of behavior. You could stick a count of anything police do on the left hand side of this regression and the violent crime coefficient will be the largest positive effect. So you could say estimate the counts of officers helping old ladies cross the street, and you would make the same inferences as you would about OIS. It is basically just saying where officers spend more of their time at (in violent crime areas), and subsequently have more interactions with individuals. It doesn’t say anything fundamentally about police behavior in regards to racial bias.

So sometime in 2016 me and Scott Phillips came up with the study design using when officers draw their firearm as the denominator. (Before I moved to Dallas I knew about their open data.) It was the observational analogue to the shoot/don’t shoot lab experiments Lois James did (James et al., 2014). Also sometime during the time period Roland Fryer came out with his pre-print, in which he used Taser uses as the counter-factual don’t shoot cases (Fryer, 2019). I thought drawing the firearm made more sense as a counterfactual, but both are subject to the same potential selection effect. (Police may be quicker to the draw their firearms with minorities, which I readily admit in my paper.)

Also in that span Justin Nix came out with the birds-eye view CPP paper using the national level crowd sourced data (Nix et al., 2017) to estimate racial bias. They make what to me is a similar conditional probability mistake as the papers that motivated this post. Using the crowdsourced national level data, they estimate the probability of being unarmed, conditional on race (in the sample of just folks who were killed by the police). So they test whether P(Unarmed | White, Shot) = P(Unarmed | Minority, Shot).

Since like I said folks don’t really understand the conditional probability argument, basically at this point I just say folks get causality backwards. The police shooting at someone does not make them armed or unarmed, the same way police shooting at someone does not change their race. You cannot estimate a regression of X ~ beta*Y, then interpret beta as how much X causes Y. The stuff on the right hand side of the conditional probability statement works mostly the same way, we want to say stuff on the right hand side of the condition causes some change in the outcome.

I have this table I made in Wheeler et al. (2017) to illustrate various research designs – you can see the Ross (2015) made the same estimate of P(Unarmed | Race, Shot) as Justin did.

At this point you typically get a series of different retorts to the “you estimated the wrong conditional probability complaint”. The ones I’ve repeatedly seen are:

  1. No data is perfect. We should work with what we have.
  2. We ask a different research question.
  3. Our analysis are just descriptive, not causal.
  4. Our findings are consistent with a bunch of other work.

For (3) I would be OK if the results are described correctly, pretty much all of these articles are clearly interested in making inferences about police behavior though (which you cannot do with just looking at these negative encounters). It isn’t just a slip of mistaking conditional probabilities (like a common p-value mishap that doesn’t really impact the overall conclusions), the articles are directly motivated to make inferences about police behavior they cannot with this study design.

For (2) it is useful to consider how might the descriptive conditional probabilities be actually interpreted in a reasonable manner. So if we estimate P(Offender Race | Shot), you can think of a game where if you see a news headline about an OIS, and you want to guess the race of the person shot by police, what would be your best guess. Ditto for P(Unarmed | Shot), what is the probability of someone being unarmed conditional on them being shot. This game is clearly a superficial type of thing to estimate, those probabilities don’t say anything though about behavior in terms of things police officers can control, they are all just a function of how often police get in interactions with those different races (or armed status) of individuals.

Consider a different hypothetical, the probability a human is shot by police versus an animal. P(Human | Shot) is waay larger than P(Animal | Shot), are police biased against humans? No, the police just don’t deal with animals they need to shoot on a regular basis.

For (1) I will follow up below with some examples of how I think using this OIS data could actually be effective for shaping police behavior in practice, but suffice to say just collecting OIS you can’t really say anything about racial bias in terms of officer decision making.

I will say that a bunch of the individuals I am critiquing here I consider friends. Steve Bishopp was one of the co-authors on my OIS work with Dallas data. If I go to a conference Justin is one of the people I would prefer to sit down and have a drink with. I’ve been schmoozing up folks with good R programming skills to come to Dallas to work for Jenn Reingle-Gonzalez. They have all done other work I think is good. See Tregel et al. (2019) or Jetelina et al. (2017) or Cesario et al. (2019) for other examples I think are more legitimate research articles amongst the same people who I am critiquing here.

So in response to (4) I think you all made the wrong mistake – the conditional probability mistake is an easy one to make. So sorry to my friends whom I think are wrong about this. That being said, most of the vitriol in public forums, often accusing people of ad-hominem attacks on their motivations, is pretty much always out of line. I think basically everyone on Twitter is being a jerk to be frank. I’ve seen it all around on both sides in the most recent Twitter back and forth (both folks calling Jenn racist and JM biased against the police). None of them are racist or biased for/against the police. I suppose to expect any different though is setting myself up for dissapointment. I was called racist by academic reviewers for Wheeler et al. (2017) (it took 4 rejects for my OIS paper before it was published). I’ve seen Justin get critiques on Twitter for being white in the past when doing work in this area.

I think CJ folks questioning JM’s motivation miss the point of his critique though. He isn’t saying police are biased and these papers are wrong, he is just saying these research papers are wrong because they can’t tell whether police are biased one way or another.

Who gives a shit

So while I think better research could be conducted in this area – JM has his work on bounding estimates (Knox et al., 2019), and I imagine someone can come up with a reasonable instrumental variable strategy to address the selection bias in the same vein as my shoot/don’t shoot (say officer instruments, or exogenous incidents that make officers more on edge and more likely to draw their firearm). But I think the question of whether “the police” are racially biased is a facile question. Globally labelling all police (or a single department) as racist is mostly a waste of time. Good for academic papers and to get people riled up in Twitter, not so much for anything else.

The police are simply a cross section of the general public. So in terms of whether some officers are racist this is true (as it is for the general public). Or maybe even we are all a little racist (ala the implicit bias hypothesis). We can only observe behavior, we cannot peer into the hearts and minds of men. But suffice to say racism is still a part of our society in some capacity I believe is a pretty tame statement.

Subsequently if you gather enough data you will be able to get some estimate of the police being racist (the null is for sure wrong). But if people can’t reasonably understand conditional probabilities, imagine trying to have a conversation about what is a reasonable amount of racial bias for monitoring purposes (inferiority bounds). Or that this racial bias estimate is not for all police, but some mixture of police officers and actions. Hard pass on either of those from me.

Subsequently this work has no bearing on actual police practice (including my own). They are of very limited utility – at best a stick or shield in civil litigation. They don’t help police departments change behavior in response to discovering (or not discovering) racial bias. And OIS are basically so rare they are worthless for all but the biggest police departments in terms of a useful monitoring metric (it won’t be sensitive enough to say whether a police department as a whole is doing good or doing bad).

So what do I think is potentially useful way to use this data? I’ve used the term “monitoring metric” a few times – what I mean by that is using the information to actually inform some response. Internally for police departments, shootings should be part of an early intervention system used to monitor individual officers for problematic behavior. From a state or federal government perspective, they could actively monitor overall levels of force used to identify outlier agencies (see this blog post example of mine). For the latter think proactively identifying problematic departments, instead of the typical current approach of wait for some major incident and then the Department of Justice assigns a federal monitor.

In either of those strategies just looking at shootings won’t be enough, they would need to use all levels of use of force to effectively identify either bad individual cops or problematic departments as a whole. Hence why I suggested adding all levels of force to say NIBRS, rather than having a stand alone national level OIS database. And individual agencies already have all the data they need to do an effective early intervention system.

I’m not totally oppossed to having a national level OIS database just based on normative arguments – e.g. you think it is a travesty we can’t say how many folks were killed by police in the prior year. It is not a totally hollow gesture, as making people record the information does provide a level of oversight, so may make a small difference. But that data won’t be able to say anything about the racial bias in individual police officer decision making.

References

Cesario, J., Johnson, D. J., & Terrill, W. (2019). Is there evidence of racial disparity in police use of deadly force? Analyses of officer-involved fatal shootings in 2015–2016. Social psychological and personality science, 10(5), 586-595.

Fryer Jr, R. G. (2019). An empirical analysis of racial differences in police use of force. Journal of Political Economy, 127(3), 1210-1261.

Klinger, D., Rosenfeld, R., Isom, D., & Deckard, M. (2016). Race, crime, and the micro-ecology of deadly force. Criminology & Public Policy, 15(1), 193-222.

Knox, D., Lowe, W., & Mummolo, J. (2019). The bias is built in: How administrative records mask racially biased policing. Available at SSRN.

Knox, D., & Mummolo, J. (2020). Making inferences about racial disparities in police violence. Proceedings of the National Academy of Sciences, 117(3), 1261-1262.

James, L., Klinger, D., & Vila, B. (2014). Racial and ethnic bias in decisions to shoot seen through a stronger lens: Experimental results from high-fidelity laboratory simulations. Journal of Experimental Criminology, 10(3), 323-340.

Jetelina, K. K., Bishopp, S. A., Wiegand, J. G., & Gonzalez, J. M. R. (2020). Race/ethnicity composition of police officers in officer-involved shootings. Policing: An International Journal.

Jetelina, K. K., Jennings, W. G., Bishopp, S. A., Piquero, A. R., & Reingle Gonzalez, J. M. (2017). Dissecting the complexities of the relationship between police officer–civilian race/ethnicity dyads and less-than-lethal use of force. American journal of public health, 107(7), 1164-1170.

Johnson, D. J., Tress, T., Burkel, N., Taylor, C., & Cesario, J. (2019). Officer characteristics and racial disparities in fatal officer-involved shootings. Proceedings of the National Academy of Sciences, 116(32), 15877-15882.

Nix, J., Campbell, B. A., Byers, E. H., & Alpert, G. P. (2017). A bird’s eye view of civilians killed by police in 2015: Further evidence of implicit bias. Criminology & Public Policy, 16(1), 309-340.

Ross, C. T. (2015). A multi-level Bayesian analysis of racial bias in police shootings at the county-level in the United States, 2011–2014. PloS one, 10(11).

Tregle, B., Nix, J., & Alpert, G. P. (2019). Disparity does not mean bias: Making sense of observed racial disparities in fatal officer-involved shootings with multiple benchmarks. Journal of crime and justice, 42(1), 18-31.

Wheeler, A. P., Phillips, S. W., Worrall, J. L., & Bishopp, S. A. (2017). What factors influence an officer’s decision to shoot? The promise and limitations of using public data. Justice Research and Policy, 18(1), 48-76.

Balancing False Positives

One area of prediction in criminal justice I think has alot of promise is using predictive algorithms in place of bail decisions. So using a predictive instrument to determine whether someone is detained pre-trial based on risk, or released on recognizance if you are low risk. Risk can be either defined as based on future dangerousness or flight risk. This cuts out the middle man of bail, which doesn’t have much evidence of effectiveness, and has negative externalities of placing economic burdens on folks we really don’t want to pile that onto. It is also the case algorithms can likely do quite a bit better than judges in figuring out future risk. So an area I think they can really do good compared to current status quo in the CJ system.

A reasonable critique of such systems though is they can have disparate racial impact. For example, ProPublica had an article on how the Compas risk assessment instrument resulted in more false positives for black than white individuals. Chris Stucchio has a nice breakdown for why this occurs, which is not due to the Compas being intrinsically racist algorithm, but due to the nature of the baseline risks for the two groups.

Consider a very simple example to illustrate. Imagine based on our cost-benefit analysis, we determine the probability threshold to flag a individual as high risk is 60%. Now say our once we apply our predictions, for those above the threshold, whites are all predicted to be 90%, and blacks are all 70%. If our model is well calibrated (which is typically the case), the false positive rate for whites will be 10%, and will be 30% for blacks.

It is actually a pretty trivial problem though to balance false positive rates between different groups, if that is what you want to do. So I figured I would illustrate here using the same ProPublica data. There are trade-offs though with this, balancing false positives means you lose out on other metrics of fairness. In particular, it means you don’t have equality of treatment – different racial groups will have different thresholds. The full data and code I use to illustrate this can be downloaded here.

An Example in Python

To illustrate how we would balance the false positive rates between groups, I use the same ProPublica risk assessment data. So this isn’t per se for bail decisions, but works fine as an illustration. First in python I load my libraries, and then read in the data – it is a few over 11,000 cases.

import pandas as pd
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

my_dir = r'C:\Users\andre\Dropbox\Documents\BLOG\BalanceFalsePos'
os.chdir(my_dir)

#For notes on data source, check out 
#https://github.com/apwheele/ResearchDesign/tree/master/Week11_MachineLearning
recid = pd.read_csv('PreppedCompas.csv')
print( recid.head() )

Next I prepare the dataset for modelling. I am not using all of the variables in the dataset. What I predict here is recidivism post 30 days (there are a bunch of recidivism right away in the dataset, so I am not 100% sure those are prior to screening). I use the three different aggregate compas scores, juvenile felony count, whether they were male, how old they were, and whether the current charge to precipitate screening is a felony or misdemeanor. I include the race variable in the dataset, but I won’t be using it in the predictive model. (That point deserves another blog post, contra to what you might expect, leaving race flags in will often result in better outcomes for that protected class.)

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
                    'juv_fel_count','YearsScreening']]
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
recid_prep['race'] = recid['race']
print( recid['race'].value_counts() ) #pretty good sample size for both whites/blacks

Next I make my testing and training sets of data. In practice I can perfectly balance false positives retrospectively. But having a test set is a better representation of reality, where you need to make some decisions on the historical data and apply it forward.

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1]
recid_test = recid_prep[recid_prep['Train'] == 0]

Now the procedure I suggest to balance false-positives doesn’t matter how you generate the predictions, just that we need a predicted probability. Here I use random forests, but you could use whatever machine learning or logistic regression model you want. Second part just generates the predicted probabilities for the training dataset.

#Now estimating the model
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis'] #no race in model
dep_var = 'Recid30'
rf_mod = RandomForestClassifier(n_estimators=500, random_state=10)
rf_mod.fit(X = recid_train[ind_vars], y = recid_train[dep_var])

#Now getting the predicted probabilities in the training set
pred_prob = rf_mod.predict_proba(recid_train[ind_vars] )
recid_train['prob'] = pred_prob[:,1]
recid_train['prob_min'] = pred_prob[:,0]

Now to balance false positives, I will show a graph. Basically this just sorts the predicted probabilities in descending order for each racial group. Then you can calculate a cumulate false positive rate for different thresholds for each group.

#Making a cusum plot within each racial group for the false positives
recid_train.sort_values(by=['race','prob'], ascending=False, inplace=True)
recid_train['const'] = 1
recid_train['cum_fp'] = recid_train.groupby(['race'])['prob_min'].cumsum()
recid_train['cum_n'] = recid_train.groupby(['race'])['const'].cumsum()
recid_train['cum_fpm'] = recid_train['cum_fp'] / recid_train['cum_n']
white_rt = recid_train[recid_train['race'] == 'Caucasian']
black_rt = recid_train[recid_train['race'] == 'African-American' ] 

And now the fun part (and least in output, not really in writing matplotlib code).

#now make the chart for white and black
fig, ax = plt.subplots()
ax.plot(black_rt['prob'], black_rt['cum_fpm'], drawstyle='steps', color='b', label='Black')
ax.plot(white_rt['prob'], white_rt['cum_fpm'], drawstyle='steps', color='r', label='White')
ax.set_xlim(1, 0)  # decreasing probs
plt.xticks(np.arange(1.0,-0.1,-0.1))
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Mean False Positive Rate')
ax.grid(True,linestyle='--')
ax.legend(facecolor='white', framealpha=1)
plt.savefig('FP_Rate.png', dpi=2000, bbox_inches='tight')
plt.show()

So what this chart shows is that if we set our threshold to a particular predicted probability (X axis), based on the data we would expect a false positive rate (Y axis). Hence if we want to balance false positives, we just figure out the race specific thresholds for each group at a particular Y axis value. Here we can see the white line is actually higher than the black line, so this is reverse ProPublica findings, we would expect whites to have a higher false positive rate than blacks given a consistent predicted probability of high risk threshold. So say we set the threshold at 10% to flag as high risk, we would guess the false positive rate among blacks in this sample should be around 40%, but will be closer to 45% in the white sample.

Technically the lines can cross at one or multiple places, and those are places where you get equality of treatment and equality of outcome. It doesn’t make sense to use that though from a safety standpoint – those crossings can happen at a predicted probability of 99% (so too many false negatives) or 0.1% (too many false positives). So say we wanted to equalize false positive rates at 30% for each group. Here this results in a threshold for whites as high risk of 0.256, and for blacks a threshold of 0.22.

#Figuring out where the threshold is to limit the mean FP rate to 0.3
#For each racial group
white_thresh = white_rt[white_rt['cum_fpm'] > 0.3]['prob'].max()
black_thresh = black_rt[black_rt['cum_fpm'] > 0.3]['prob'].max()
print( white_thresh, black_thresh )

Now for the real test, lets see if my advice actually worked in a new sample of data to balance the false positive rate.

#Now applying out of sample, lets see if this works
pred_prob = rf_mod.predict_proba(recid_test[ind_vars] )
recid_test['prob'] = pred_prob[:,1]
recid_test['prob_min'] = pred_prob[:,0]

white_test = recid_test[recid_test['race'] == 'Caucasian']
black_test = recid_test[recid_test['race'] == 'African-American' ]

white_test['Flag'] = 1*(white_test['prob'] > white_thresh)
black_test['Flag'] = 1*(black_test['prob'] > black_thresh)

white_fp= 1 - white_test[white_test['Flag'] == 1][dep_var].mean()
black_fp = 1 - black_test[black_test['Flag'] == 1][dep_var].mean()
print( white_fp, black_fp )

And we get a false positive rate of 54% for whites (294/547 false positives), and 42% for blacks (411/986) – yikes (since I wanted a 30% FPR). As typical, when applying your model to out of sample data, your predictions are too optimistic. I need to do some more investigation, but I think a better way to get error bars on such thresholds is to do some k-fold metrics and take the worst case scenario, but I need to investigate that some more. The sample sizes here are decent, but there will ultimately be some noise when deploying this in practice. So basically if you see in practice the false positive rates are within a few percentage points that is about as good as you can get in practice I imagine. (And for smaller sample sizes will be more volatile.)

Reasons Police Departments Should Consider Collaborating with Me

Much of my academic work involves collaborating and consulting with police departments on quantitative problems. Most of the work I’ve done so far is very ad-hoc, through either the network of other academics asking for help on some project or police departments cold contacting me directly.

In an effort to advertise a bit more clearly, I wrote a page that describes examples of prior work I have done in collaboration with police departments. That discusses what I have previously done, but doesn’t describe why a police department would bother to collaborate with me or hire me as a consultant. In fact, it probably makes more sense to contact me for things no one has previously done before (including myself).

So here is a more general way to think about (from a police departments or criminal justice agencies perspective) whether it would be beneficial to reach out to me.

Should I do X?

So no one is going to be against different evidence based policing practices, but not all strategies make sense for all jurisdictions. For example, while focussed deterrence has been successfully applied in many different cities, if you do not have much of a gang violence problem it probably does not make sense to apply that strategy in your jurisdiction. Implementing any particular strategy should take into consideration the cost as well as the potential benefits of the program.

Should I do X may involve more open ended questions. I’ve previously conducted in person training for crime analysts that goes over various evidence based practices. It also may involve something more specific, such as should I redistrict my police beats? Or I have a theft-from-vehicle problem, what strategies should I implement to reduce them?

I can suggest strategies to implement, or conduct cost-benefit analysis as to whether a specific program is worth it for your jurisdiction.

I want to do X, how do I do it?

This is actually the best scenario for me. It is much easier to design a program up front that allows a police department to evaluate its efficacy (such as designing a randomized trial and collecting key measures). I also enjoy tackling some of the nitty-gritty problems of implementing particular strategies more efficiently or developing predictive instruments.

So you want to do hotspots policing? What strategies do you want to do at the hotspots? How many hotspots do you want to target? Those are examples of where it would make sense to collaborate with me. Pretty much all police departments should be doing some type of hot spots policing strategy, but depending on your particular problems (and budget constraints), it will change how you do your hot spots. No budget doesn’t mean you can’t do anything — many strategies can be implemented by shifting your current resources around in particular ways, as opposed to paying for a special unit.

If you are a police department at this stage I can often help identify potential grant funding sources, such as the Smart Policing grants, that can be used to pay for particular elements of the strategy (that have a research component).

I’ve done X, should I continue to do it?

Have you done something innovative and want to see if it was effective? Or are you putting a bunch of money into some strategy and are skeptical it works? It is always preferable to design a study up front, but often you can conduct pretty effective post-hoc analysis using quasi-experimental methods to see if some crime reduction strategy works.

If I don’t think you can do a fair evaluation I will say so. For example I don’t think you can do a fair evaluation of chronic offender strategies that use officer intel with matching methods. In that case I would suggest how you can do an experiment going forward to evaluate the efficacy of the program.

Mutual Benefits of Academic-Practitioner Collaboration

Often I collaborate with police departments pro bono — which you may ask what is in it for me then? As an academic I get evaluated mostly by my research productivity, which involves writing peer reviewed papers and getting research grants. So money is not the main factor from my perspective. It is typically easier to write papers about innovative problems or programs. If it involves applying for a grant (on a project I am interested in) I will volunteer my services to help write the grant and design the study.

I could go through my career writing papers without collaborating with police departments. But my work with police departments is more meaningful. It is not zero-sum, I tend to get better ideas when understanding specific agencies problems.

So get in touch if you think I can help your agency!

Monitoring Use of Force in New Jersey

Recently ProPublica published a map of uses-of-force across different jurisdictions in New Jersey. Such information can be used to monitor whether agencies are overall doing a good or bad job.

I’ve previously discussed the idea of using funnel charts to spot outliers, mostly around homicide rates but the idea is the same when examining any type of rate. For example in another post I illustrated its use for examining rates of officer involved shootings.

Here is another example applying it to lesser uses of force in New Jersey. Below is the rate of use of force reports per the total number of arrests. (Code to replicate at the end of the post.)

The average use of force per arrests in the state is around 3%. So the error bars show relative to the state average. Here is an interactive chart in which you can use tool tips to see the individual jurisdictions.

Now the original press release noted by Seth Stoughton on twitter noted that several towns have ratio’s of black to white use of force that are very high. Scott Wolfe suspected that was partly a function of smaller towns will have more variable rates. Basically as one is comparing the ratio between two rates with error, the error bars around the rate ratio will also be quite large.

Here is the chart showing the same type of funnel around the rate ratio of black to white use-of-force relative to the average over the whole sample (the black percent use of force is 3.2 percent of arrests, and the white percent use of force is 2.4, and the rate ratio between the two is 1.35). I show in the code how I constructed this, which I should write a blog post about itself, but in short there are decisions I could make to make the intervals wider. So the points that are just slightly above a ratio of 2 at around 10,000 arrests are arguably not outliers, those more to the top-right of the plot though are much better evidence. (I’d note that if one group is very small, you could always make these error bars really large, so to construct them you need to make reasonable assumptions about the size of the two groups you are comparing.)

And here is another interactive chart in which you can view the outliers again. The original press release, Millville, Lakewood, and South Orange are noted as outliers. Using arrests as the denominator instead of population, they each have a rate ratio of around 2. In this chart Millville and Lakewood are outside the bounds, but just barely. South Orange is within the bounds. So those aren’t the places I would have called out according to this chart.

That same twitter thread other folks noted the potential reliability/validity of such data (Pete Moskos and Kyle McLean). These charts cannot say why individual agencies are outliers — either high or low. It could be their officers are really using force at different rates, it could also be though they are using different definitions to reporting force. There are also potential other individual explanations that explain the use of force distribution as well as the ratio differences in black vs white — no doubt policing in Princeton vs Camden are substantively different. Also even if all individual agencies are doing well, it does not mean there are no potential problem officers (as noted by David Pyrooz, often a few officers contribute to most UoF).

Despite these limitations, I still think there is utility in this type of monitoring though. It is basically a flag to dig deeper when anomalous patterns are spotted. Those unaccounted for factors contribute to more points being pushed outside of my constructed limits (overdispersion), but more clearly indicate when a pattern is so far outside the norm of what is expected the public deserves some explanation of the pattern. Also it highlights when agencies are potentially doing good, and so can be promoted according to their current practices.

This is a terrific start to effectively monitoring police agencies by ProPublica — state criminal justice agencies should be doing this themselves though.

Here is the code to replicate the analysis.

New preprint: Allocating police resources while limiting racial inequality

I have a new working paper out, Allocating police resources while limiting racial inequality. In this work I tackle the problem that a hot spots policing strategy likely exacerbates disproportionate minority contact (DMC). This is because of the pretty simple fact that hot spots of crime tend to be in disadvantaged/minority neighborhoods.

Here is a graph illustrating the problem. X axis is the proportion of minorities stopped by the police in 500 by 500 meter grid cells (NYPD data). Y axis is the number of violent crimes over along time period (12 years). So a typical hot spots strategy would choose the top N areas to target (here I do top 20). These are all very high proportion minority areas. So the inevitable extra police contact in those hot spots (in the form of either stops or arrests) will increase DMC.

I’d note that the majority of critiques of predictive policing focus on whether reported crime data is biased or not. I think that is a bit of a red herring though, you could use totally objective crime data (say swap out acoustic gun shot sensors with reported crime) and you still have the same problem.

The proportion of stops by the NYPD of minorities has consistently hovered around 90%, so doing a bunch of extra stuff in those hot spots will increase DMC, as those 20 hot spots tend to have 95%+ stops of minorities (with the exception of one location). Also note this 90% has not changed even with the dramatic decrease in stops overall by the NYPD.

So to illustrate my suggested solution here is a simple example. Consider you have a hot spot with predicted 30 crimes vs a hot spot with predicted 28 crimes. Also imagine that the 30 crime hot spot results in around 90% stops of minorities, whereas the 28 crime hot spot only results in around 50% stops of minorities. If you agree reducing DMC is a reasonable goal for the police in-and-of-itself, you may say choosing the 28 crime area is a good idea, even though it is a less efficient choice than the 30 crime hot spot.

I show in the paper how to codify this trade-off into a linear program that says choose X hot spots, but has a constraint based on the expected number of minorities likely to be stopped. Here is an example graph that shows it doesn’t always choose the highest crime areas to meet that racial equity constraint.

This results in a trade-off of efficiency though. Going back to the original hypothetical, trading off a 28 crime vs 30 crime area is not a big deal. But if the trade off was 3 crimes vs 30 that is a bigger deal. In this example I show that getting to 80% stops of minorities (NYC is around 70% minorities) results in hot spots with around 55% of the crime compared to the no constraint hot spots. So in the hypothetical it would go from 30 crimes to 17 crimes.

There won’t be a uniform formula to calculate the expected decrease in efficiency, but I think getting to perfect equality with the residential pop. will typically result in similar large decreases in many scenarios. A recent paper by George Mohler and company showed similar fairly steep declines. (That uses a totally different method, but I think will be pretty similar outputs in practice — can tune the penalty factor in a similar way to changing the linear program constraint I think.)

So basically the trade-off to get perfect equity will be steep, but I think the best case scenario is that a PD can say "this predictive policing strategy will not make current levels of DMC worse" by applying this algorithm on-top-of your predictive policing forecasts.

I will be presenting this work at ASC, so stop on by! Feedback always appreciated.

American Community Survey Variables of Interest to Criminologists

I’ve written prior blog posts about downloading Five Year American Community Survey data estimates (ACS for short) for small area geographies, but one of the main hiccups is figuring out what variables you want to use. The census has so many variables that are just small iterations of one another (e.g. Males under 5, males 5 to 9, males 10 to 14, etc.) that it is quite a chore to specify the ones you want. Often you want combinations of variables or to calculate percentages as well, so you need to take two or more variables and turn them into your constructed variable.

I have posted some notes on the variables I have used for past projects in an excel spreadsheet. This includes the original variables, as well as some notes for creating percentage variables. Some are tricky — such as figuring out the proportion of black residents for block groups you need to add non-Hispanic black and Hispanic black estimates (and then divide by the total population). For spatially oriented criminologists these are basically indicators commonly used for social disorganization. It also includes notes on what is available at the smaller block group level, as not all of the variables are. So you are more limited in your choices if you want that small of area.

Let me know if you have been using other variables for your work. I’m not an expert on these variables by any stretch, so don’t take my list as authoritative in any way. For example I have no idea whether it is valid to use the imputed data for moving in the prior year at the block group level. (In general I have not incorporated the estimates of uncertainty for any of the variables into my analyses, not sure of the additional implications for the imputed data tables.) Also I have not incorporated variables that could be used for income-inequality or for ethnic heterogeneity (besides using white/black/Hispanic to calculate the index). I’m sure there are other social disorganization relevant variables at the block group level folks may be interested in as well. So let me know in the comments or shoot me an email if you have suggestions to update my list.

I would prefer if as a field we could create a set of standardized indices so we are not all using different variables (see for example this Jeremy Miles paper). It is a bit hodge-podge though what variables folks use from study-to-study, and most folks don’t report the original variables so it is hard to replicate their work exactly. British folks have their index of deprivation, and it would be nice to have a similarly standardized measure to use in social science research for the states.


The ACS data has consistent variable names over the years, such as B03001_001 is the total population, B03002_003 is the Non-Hispanic white population, etc. Unfortunately those variables are not necessarily in the same tables from year to year, so concatenating ACS results over multiple years is a bit of a pain. Below I post a python script that given a directory of the excel template files will produce a nice set of dictionaries to help find what table particular variables are in.

#This python code grabs ACS meta-data templates
#To easier search for tables that have particular variables
import xlrd, os

mydir = r'!!!Insert your path to the excel files here!!!!!'

def acs_vars(directory):
    #get the excel files in the directory
    excel_files = []
    for file in os.listdir(directory):
        if file.endswith(".xls"):
            excel_files.append( os.path.join(directory, file) )
    #getting the variables in a nice dictionaries
    lab_dict = {}
    loc_dict = {}
    for file in excel_files:
        book = xlrd.open_workbook(file) #first open the xls workbook
        sh = book.sheet_by_index(0)
        vars = [i.value for i in sh.row(0)] #names on the first row
        labs = [i.value for i in sh.row(1)] #labels on the second
        #now add to the overall dictionary
        for v,l in zip(vars,labs):
            lab_dict[v] = l
            loc_dict[v] = file
    #returning the two dictionaries
    return lab_dict,loc_dict
    
labels,tables = acs_vars(mydir)

#now if you have a list of variables you want, you can figure out the table
interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B14006_002','B01001_003','B23025_005','B22010_002',
            'B16002_004']
            
for i in interest:
    head, tail = os.path.split(tables[i])
    print (i,labels[i],tail)

Paper published: Evaluating Community Prosecution Code Enforcement in Dallas, Texas

Some work John Worrall and I collaborated on was just published in Justice Quarterly, Evaluating Community Prosecution Code Enforcement in Dallas, Texas. I have two links to share:

If you need access to the article always feel free to email.

Below is the abstract:

We evaluated a community prosecution program in Dallas, Texas. City attorneys, who in Dallas are the chief prosecutors for specified misdemeanors, were paired with code enforcement officers to improve property conditions in a number of proactive focus areas, or PFAs, throughout the city. We conducted a panel data analysis, focusing on the effects of PFA activity on crime in 19 PFAs over a six-year period (monthly observations from 2010 to 2015). Control areas with similar levels of pre-intervention crime were also included. Statistical analyses controlled for pre-existing crime trends, seasonality effects, and other law enforcement activities. With and without dosage data, the total crime rate decreased in PFA areas relative to control areas. City attorney/code enforcement teams, by seeking the voluntary or court-ordered abatement of code violations and criminal activity at residential and commercial properties, apparently improved public safety in targeted areas.

This was a neat program, as PFAs are near equivalents of hot spots that police focus on. So for the evaluation we drew control areas from Dallas PD’s Target Area Action Grid (TAAG) Areas:

New course in the spring – Crime Science

This spring I will be teaching a new graduate level course, Crime Science. A better name for the course would be evidence based policing tactics to reduce crime — but that name is too long!

Here you can see the current syllabus. I also have a page for the course, which I will update with more material over the winter break.

Given my background it has a heavy focus on hot spots policing (different tactics at hot spots, time spent at hot spots, crackdowns vs long term). But the class covers other policing strategies; such as chronic offenders, the focused deterrence gang model, and CPTED. We also discuss the use of technology in policing (e.g. CCTV, license plate readers, body-worn-cameras).

I will weave in ethical discussions throughout the course, but I reserved the last class to specifically talk about predictive policing strategies. In particular the two main concerns are increasing disproportionate minority contact through prediction, and privacy concerns with police collecting various pieces of information.

So take my course!

Monitoring homicide trends paper published

My paper, Monitoring Volatile Homicide Trends Across U.S. Cities (with coauthor Tom Kovandzic) has just been published online in Homicide Studies. Unfortunately, Homicide Studies does not give me a link to share a free PDF like other publishers, but you can either grab the pre-print on SSRN or always just email me for a copy of the paper.

They made me convert all of the charts to grey scale :(. Here is an example of the funnel chart for homicide rates in 2015.

And here are example fan charts I generated for a few different cities.

As always if you have feedback or suggestions let me know! I posted all of the code to replicate the analysis at this link. The prediction intervals can definately be improved both in coverage and in making their length smaller, so I hope to see other researchers tackling this as well.

Notes on using UCR data for class projects

Students in my classes often want to use UCR reported data for projects. One thing many don’t realize though is that the UCR data reported to the FBI is only aggregate statistics at regular intervals for the entire jurisdiction. So for example one can’t look at hot spots using reported UCR data.

If you do have a hypothesis that can be reasonably examined using monthly or yearly data at the jurisdiction level, here are a few notes on using UCR data. First is that you can get the most detailed downloads of data from ICPSR. That link has data series going back to 1960, and ends up being about two years behind (e.g. it is close to the end of 2017, and only 2015 data is available).

The datasets on ICPSR have monthly data for Part 1 crime types, as well as some information on arrests and clearances. Also they have all of the individual agencies, along with their ORI code. The ORI code allows you to link agencies over time.

While the FBI does have a page for more up to date UCR data (they just released the 2016 stats, so they are about a year behind), they are much more limited in the types of tables they disseminate. There typically is one table for Part 1 crime rates for individual large cities for each year, but otherwise it is aggregated to different city sizes. So most data analyses need to use the ICPSR data — the data directly from the FBI is not detailed enough.

For those wishing to map the data, it ends up being a bit tricky. Most people in the US are probably under the jurisdiction of at least two police departments — the local PD and the state police. Many people are also under the jurisdiction of a local sheriff. So many of these police agencies have overlapping boundaries. There is no easy source of the geographic boundaries for the police departments, but the ICPSR data does contain the zipcode for the headquarters for the police department. This won’t be accurate for state police — but should be suitable for mapping purposes for local agencies and sheriffs (sheriffs are sometimes organized at the county level). If you want polygon data for jurisdictional boundaries you will need to search for individual agencies and political boundaries — there is no easy source to download them all at once. Many rural areas will have police departments cover multiple towns, but if you stick to more urban areas you might be able to use city boundaries.

The ICPSR data has crime reports aggregated to the county level, so if that level of aggregation is not problematic you may use that data directly. You should be aware of many of the complaints about UCR data quality though. Mike Maltz has written a bit about it, but there are quite a few other folks who have noticed problems with reporting in the UCR data. The main problem to watch out for is missing data being accidentally reported as zero crimes occurring.

To stack datasets from different years from ICPSR is not too difficult if you are not going too far back in time. But if you go back to the older data, ICPSR changed the variable order. The variables are simply listed as V1 TO V100 something, so for example V15 in 1979 is not the same variable as V15 in 2005. My notes say they used the same variable order from 1998-2015, but you will want to check that yourself (I downloaded the SPSS files, it would not surprise me if the datasets differed for some of the years.)

Some additional resources students may want to familiarize themselves with to gather UCR data more quickly are the FBI UCR data tool and Mike Maltz’s cleaned up dataset and notes on how he made it. You should probably just use Mike Maltz’s dataset if you are using data over time.

If you are just interested in yearly homicides, I have provided a dataset of cleaned up homicides that goes back to 1960, see my paper that goes along with that dataset on graphing temporal homicide trends (mapping those trends could be an interesting project as well!)