An intro to linear programming for criminologists

Erik Alda made the point the other day on twitter that we are one of the few crim folks that do anything related to linear programming. I think it is crazy useful – much more so than say teaching myself some new regression technique or a programming language.

I don’t quite remember the motivation to learn it. I think I kept seeing repeated applications in papers I read, but was also totally baffled by it; I did not understand peoples notation for it at all. In retrospect that was because it is not statistics. You are optimizing a function by estimating some parameters (there is nothing stochastic about it, so there is no statistical inference). So it is more like finding the min/max of a function in calculus.

I think the best way to think about linear programming is in terms of decision analysis. We have a set of options among which we need to choose some action. So we make the choices that either maximize or minimize some objective, but also take into account constraints on the decisions we can make.

For social scientists here is an example that hopefully illustrates the difference between statistics and linear programming. Say we are interested in conducting a hot spots policing randomized experiment. So we define our top 20 crime hot spots in the city, and randomly assign 10 of them to receive the hot spots treatment. Linear programming is basically the obverse of this, given our 20 hot spot areas, which are the best 10 locations to choose for our intervention.

This problem as stated you might be thinking is trivial – just rank each of the 20 hot spots by the total number of crimes, and then choose the top 10. Where linear programming really helps though is if you have constraints on the final choices you make. Say you did not want to choose hot spots that are within 1 mile of each other (to spread out the hot spot interventions throughout the city). There is no simple way to sort your hot spots to obey that constraint, but you can encode that in the linear program and have the computer solve it quite easily.

There is no shortage of ways you could expand the complexity of this example hot spot decision analysis. Say you had two different types of hot spot treatments, and that they had different efficacy in different areas (one was good for property crime, and the other was better for violent crime). You might think of this as doing two separate decision analyses, where a constraint is that an area can only be assigned one of the two interventions.

Here I will provide some code examples in python using the pulp library to illustrate some more examples using data you can see in action, as well as different ways to think about linear programming problems in practice. (Technically the examples I give are all mixed integer linear programs, as the decision variables are binary 0/1.)

Formulating Objectives and Constraints

For this example I will be simulating data, but imagine a case you are an analyst for the IRS, and you want to determine which business tax returns to audit. We want to audit cases that both have a high probability of being fraudulent, as well as cases in which the total amount of the underpayment is large. (If you want a more typical criminology example, imagine assigning criminal cases to detectives, some cases have more costs, e.g. homicide vs burglary, and some cases have different probabilities of being solvable. This type of decision problem is very common in my experience – pretty much any time you have to make a binary choice, and those choices have variable costs/benefits.)

First I start off by simulating some data (the only libraries we need are numpy and pulp). So I simulate 1000 business tax returns, which have an estimate of the probability they are fraud, prob_fraud, and an estimate of the amount they underpayed, underpay_est.

import numpy as np
import pulp

###########################################################
#Simulate data for costs and probabilities

np.random.seed(10)
total_cases = 1000
underpay_est = np.random.uniform(1000,100000,total_cases)
prob_fraud = np.random.uniform(0,1,total_cases)
exp_return = prob_fraud*underpay_est

###########################################################

The objective we will be maximizing then is the expected return of auditing a tax return, exp_return, which is simply the multiplication of the probability of fraud multiplied by the amount of the underpayment. For a simple example, say we have a case where fraud is estimated to be 50%, and the estimate of the underpayment amount is $10,000. So our expected return for auditing that case is $5,000.

We need these two estimates external to our linear programming problem, and they themselves can be informed by predictive models (or simpler estimates, e.g. underpayment is proportional ~30% of deductions or something like that).

Now we have all we need to set up our linear programming problem. I am going to choose 100 cases out of these 1000 to audit. Hopefully that code is documented enough to see creating the decision variables (each tax return either gets a 1 if it is chosen, or a 0 if it is not), the original objective function that we are maximizing, and the results.

#Setting up the problem
case_index = list(range(total_cases))
tot_audit = 100

####################################
#Basic Problem
P = pulp.LpProblem("Choosing Cases to Audit", pulp.LpMaximize)
D = pulp.LpVariable.dicts("Decision Variable", [i for i in case_index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Objective Function
P += pulp.lpSum( D[i]*exp_return[i] for i in case_index)
#Constraint on total number of cases audited
P += pulp.lpSum( D[i] for i in case_index ) == tot_audit
#Solve the problem
P.solve()
#Get the decision variables
dec_list = [D[i].varValue for i in case_index]
dec_np = np.asarray(dec_list)
#Expected return
print( (dec_np * exp_return).sum() )
#Should be the same
print( pulp.value(P.objective) )
#Hit rate of cases
print( (dec_np * prob_fraud).sum()/tot_audit )
####################################

If you are following along in python, you will see that the total expected return is 7,287,915, and the estimated hit rate (or clearance rate) of the audits is around 0.88.

This example would be no different if we just chose the top 100 cases based on the expected return. Say that you thought the hit rate though of 88% was too low. You will choose cases that are big dollar amounts, but not necessarily a very high probability. So you may say I want my clearance rate to be over 90% overall. That is a simple constraint to add into the above model.

####################################
#Updating the problem to constrain on the hit rate
#Above a particular threshold
hit_rate = 0.9
P += pulp.lpSum( D[i]*prob_fraud[i] for i in case_index ) >= hit_rate*tot_audit
P.solve()
#Get the decision variables
dec_list = [D[i].varValue for i in case_index]
dec_np = np.asarray(dec_list)
#Expected return is slightly lower than before
print( pulp.value(P.objective) )
#Hit rate of cases
print( (dec_np * prob_fraud).sum()/tot_audit )
####################################

So now the total expected return is lower than without the constraint, 7,229,140 (so a reduction of about $60k), but our expected hit rate is just above 90%.

You may be thinking, “why not just eliminate cases with a probability of lower than 90%”, and then amongst those left over select the highest expected return. That meets your constraints, but has a lower expected return than this program! Think of this program as more tit-for-tat. High expected return / lower probability audits can still be selected with this model, but you need to balance them out with some high probability cases in response to tip the scales to meet the overall hit rate objective.

Trade-Offs and the Frontier Curve

So you may be thinking, ok the trade-off to get a 90% clearance was not too bad in terms of total extra taxes collected. So why not set the constraint to 95%. When you create constraints, they always lower the objective function (lower or equal to be more precise). The question then becomes quantifying that trade off.

You can subsequently vary the hit rate constraint, and see how much it changes the total expected return. Here is an example of doing that, each model only takes around a second to complete.

###########################################################
#Drawing the trade-off in hit rate vs expected return

hit_rate = np.linspace(0.85, 0.95, 30)
total_return = []

#Function to estimate the model
def const_hit_rate(er, prob, tot_n, hr):
    c_index = range(len(er))
    Prob = pulp.LpProblem("Choosing Cases to Audit", pulp.LpMaximize)
    Dec = pulp.LpVariable.dicts("Decision Variable", [i for i in c_index], lowBound=0, upBound=1, cat=pulp.LpInteger)
    Prob += pulp.lpSum( Dec[i]*er[i] for i in c_index)
    Prob += pulp.lpSum( Dec[i] for i in c_index ) == tot_n
    Prob += pulp.lpSum( Dec[i]*prob[i] for i in c_index ) >= hr*tot_n
    Prob.solve()
    dec_li = [Dec[i].varValue for i in c_index]
    dec_np = np.asarray(dec_li)
    return pulp.value(Prob.objective), dec_np

for h in hit_rate:
    print(f'Estimating hit rate {h}')
    obj, dec_res = const_hit_rate(exp_return, prob_fraud, 100, h)
    total_return.append(obj)

###########################################################

For this simulated data example, there end up being pretty severe trade-offs in the total return after you get above 91% hit rates, so from this it may not be worth the trade-off to get a much higher hit rate in practice. Just depends on how much you are willing to trade-off one for the other.

There are other ways to formulate this trade off (via bi-objective functions/soft-constraints, or weighted ranking schemes), but the blog post is long enough as is!

Other Potential Applications

So in terms of my work, I have examples of using linear programs to make spatial location decisions, encode fairness constraints into predictive policing, and allocate treatment assignment with network spillovers.

Erik Alda and Joseph Ferrandino have conducted frontier analysis of different criminal justice organizations, which is based on estimating the frontier curve above from data (instead of a pre-specified objective function).

That is about it for criminologists that I know of, but there are plenty of applications towards criminal justice topics using linear programming (or related concepts). It is most popular among operations researchers, of which Laura Albert is one of my favorites. (Criminal Justice as a field might not exist for Albert Blumstein, who was also a very influential operations researcher.)

One of the things that makes this different from more traditional quantitative work in the social sciences is that again it is not statistics – we are not testing hypotheses. The contribution is simply formulating the decision problem in a tractable way that can be solved, and the drawing of the trade-offs I showed above.

It is one of the ways I really like it though – unlike saying how your regression model can be used to inform decisions, this much more explicitly shows the utility of the results of those models in some practice.

300 blog posts and public good criminology

This isn’t technically my 300th blog post, but the 300th page I’ve constructed on my blog (so e.g. it includes when I’ve made a page for a class). I’ve posted a spreadsheet of the titles and dates of the posts over time (and updating it I noticed I was at 300).

I typically get around 200~300 views per day. Most of these are probably bots, but unless say over 90% are bots this website gets way more views than the cumulative views of all my academic papers combined. Here is a screen shot of the stats wordpress gives to me. My downtick in 2019 I thought was going to spiral into very few views, but it is still holding on.

I kind of have three different types of blog posts. One are example code snippets/data analysis. Often these are things I have done multiple times, so I want to create a record for me to more easily search up later. For example making a hexbin map in ggplot, or a margins plot in Stata. I wrote a recent post because I was talking with a friend about crime weights, and I wanted an example of using regression in python and an error bar plot for my library. (Quite a few birds with that stone.)

Two are questions I repeatedly encounter by students. For example, I made a list of demographic variables I use in the census, and where to find or scrape crime generator variables. Consistently my most popular post is testing the equality of two regression coefficients.

The third are just more generic opinion pieces. For example my notes on (the now late) David Bayley’s writing on the police potential to reduce crime, or Jane Jacob’s take on neighborhoods, or that I don’t think latent trajectories are real things.

Some are multiple of these categories put together, particularly opinion pieces with example code snippets to illustrate the points I am making. Like a simulation of why I like to model individual delinquency items, or how to balance false positives in bail decisions.

On Public Good Criminology

None of these per se fit in the example framework of typical peer review output. So despite no peer review, I think things like deriving optimal treatment allocation with network spillovers, or that conformal predictions intervals for synthetic control estimates are much smaller than permutation tests are a substantive contribution to share!

So that brings me to the public good point. Most criminologists have a default of only valuing a closed peer review system. Despite my blog posts not being peer reviewed (ditto for the pre-prints I post at first), I hope folks can take the time to judge for themselves whether they are valuable or not. We would be much better off as a group if we did things like share code, share class preps, or failed projects by default.

Some of these posts I might write up if we had a short journal for our field akin to Economics Letters, but even that is a lot of work for very little value added to be frank. (If I had infinite time I also might turn my notes on Poisson/Negative Binomial regression into a little Sage green book.) Being a private sector data scientist now without the tenure boot on my neck, I don’t really have any need or desire to go through that process.

If all you value are getting the opinions of a handful of other academics than by all means keep your work close to the chest and only publish in peer reviewed journals. If you want to provide a public good though, your work actually needs to be public.

Conjoint Analysis of Crime Rankings

So part of my recent research mapping crime harm spots uses cost of crime estimates relevant to police departments (Wheeler & Reuter, 2020). But a limitation of this is that cost of crime estimates are always somewhat arbitrary.

For a simple example, those cost estimates are based mostly on people time by the PD to respond to crimes and devote investigative resources. Many big city PDs entirely triage crimes like breaking into vehicles though. So based on PD response the cost of those crimes are basically $0 (especially if PDs have an online reporting system).

But I don’t think the public would agree with that sentiment! So in an act of cognitive dissonance with my prior post, I think asking the public is likely necessary for police to be able to ultimately serve the publics interest when doing valuations. For some ethical trade-offs (like targeting hot spots vs increasing disproportionate minority contact, Wheeler, 2019) I am not sure there is any other reasonable approach than simply getting a bunch of peoples opinions.

But that being said, I suspected that these different metrics would provide pretty similar rankings for crime severity overall. So while it is criminology 101 that official crime and normative perceptions of deviance are not a perfect 1 to 1 mapping, most folks (across time and space) have largely similar agreement on the severity of different crimes, e.g. that assault is worse than theft.

So what I did was grab some survey ranking of crime data from the original source of crime ranking that I know of, Marvin Wolfgang’s supplement to the national crime victimization survey (Wolfgang et al., 2006). I have placed all the code in this github folder to replicate. And in particular check out this Jupyter notebook with the main analysis.

Conjoint Analysis of Crime Ranks

This analysis is often referred to as conjoint analysis. There are a bunch of different ways to conduct conjoint analysis – some ask folks to create a ranked list of items, others ask folks to choose between a list of a few items, and others ask folks to rank problems on a Likert item 1-5 scale. I would maybe guess Likert items are the most common in our field, see for example Spelman (2004) using surveys of asking people about disorder problems (and that data is available to, Taylor, 2008).

The Wolfgang survey I use here is crazy complicated, see the codebook, but in a nutshell they had an anchoring question where they assigned stealing a bike to a value of 10, and then asked folks to give a numeric score relative to that theft for a series of 24 other crime questions. Here I only analyze one version of the questionnaire, and after eliminating missing data there are still over 4,000 responses (in 1977!).

So you could do analyze those metric scores directly, but I am doing the lazy route and just doing a rank ordering (where ties are the average rank) within person. Then conjoint analysis is simply a regression predicting the rank. See the notebook for a more detailed walkthrough, so this just produces the same analysis as looking at the means of the ranks.

About the only thing I do different here than typical conjoint analysis is that I rescale the frequency weights (just changes the degrees of freedom for standard error estimates) to account for the repeated nature of the observations (e.g. I treat it like a sample of 4000 some observations, not 4000*25 observations). (I don’t worry about the survey weights here.)

To test my assertion of whether these different ranking systems will be largely in agreement, I take Jerry’s crime harm paper (Ratcliffe, 2015), which is based on sentencing guidelines, and map them as best I could to the Wolfgang questions (you could argue with me some though on those assements – and some questions don’t have any analog, like a company dumping waste). I rescaled the Wolfgang rankings to be in a range of 1-14, same as Jerry’s, instead of 1-25.

Doing a more deep dive into the Wolfgang questions, there are definately different levels in the nature of the questions you can tease out. Folks clearly take into account both harm to the victim and total damages/theft amounts. But overall the two systems are fairly correlated. So if an analyst wants to make crime harm spots now, I think it is reasonable to use one of these ranking systems, and then worry about getting the public perspective later on down the line.

The Wolfgang survey is really incredible. In this regression framework you can either adjust for other characteristics (e.g. it asks about all the usual demographics) or look at interactions (do folks who were recently victimized up their scores). So this is really just scratching the surface. I imagine if someone redid it with current data many of the metrics would be similar as well, although if I needed to do this I don’t think I would devise something as complicated as this, and would ask people to rank a smaller set of items directly.

References

  • Ratcliffe, J.H. (2015). Towards an index for harm-focused policing. Policing: A Journal of Policy and Practice, 9(2), 164-182.
  • Spelman, W. (2004). Optimal targeting of incivility-reduction strategies. Journal of Quantitative Criminology, 20(1), 63-88.
  • Taylor, R.B. (2008). Impacts of Specific Incivilities on Responses to Crime and Local Commitment, 1979-1994: [Atlanta, Baltimore, Chicago, Minneapolis-St. Paul, and Seattle]. https://doi.org/10.3886/ICPSR02520.v1
  • Wheeler, A.P., & Reuter, S. (2020). Redrawing hot spots of crime in Dallas, Texas. https://doi.org/10.31235/osf.io/nmq8r
  • Wheeler, A.P. (2019). Allocating police resources while limiting racial inequality. Justice Quarterly, Online First.
  • Wolfgang, M.E., Figlio, R.M., Tracy, P.E., and Singer, S.I. (2006). National Crime Surveys: Index of Crime Severity, 1977. https://doi.org/10.3886/ICPSR08295.v1

Admin data should be used more often in policing research

I sometimes wonder if many researchers do not know actually what data police departments regularly collect. I commonly see articles on topics and think to myself “Hey, that is nice you did a survey on XYZ, why did you not confirm the responses with actual admin data on the same topic?”. Or I see topics that can be reasonably addressed using admin data not tackled at all by researchers.

So I decided to write this blog post.

I’ve mostly to date made a career out of analyzing administrative police data (only 2 out of my 30 some peer reviewed papers at this point are using non-regularly collected data as part of the analysis – and both of those link surveys to official crime records). To be honest I’m also motivated to write this as it is common for senior academics (in general in criminology, not just specific to policing researchers) to critique secondary data analysis (some of those folks are curmudgeons though, so maybe not worth stating). Of course you can do bad analysis with whatever data – primary or secondary makes no difference.

I think the default though should be to leverage admin data, so this sentiment I believe is in general misguided, and results in a lot of waste (time and money spent on primary data collection). I have never received research funding directly in my career (only as an RA for Rob Worden), so my work has essentially been for “free” on these projects (just my time). (I was basically subsidized by the university to do research!)

My opinion is based on two key points:

  1. Administrative data has already been collected by police agencies, so it has no additional costs for use by researchers.
  2. Administrative data defines core outcomes to which police agencies strive to reduce.

For 2 in particular this is reducing reported crime and reducing use of force. (Use of force can be conceived of as an “output” instead of an “outcome”, but I tend to think of it as a negative externality that should be minimized to the extent possible.) I’m sure a few folks are thinking here “these don’t define the potential universe of outcomes police departments are interested in” and I agree – permit me to discuss this in more detail in a few paragraphs. The argument I am making is ultimately fuzzy – not that we shouldn’t collect other data, but it should meet a higher threshold than using zero-cost data already collected by PDs.

What is Admin Policing Data?

For folks not familiar, police departments keep electronic records of various things, mostly related to crime and interactions with the public. All police departments I have worked with have these types of records in various tables/databases:

  • calls to 911 (Computer Automated Dispatch)
  • reported crimes and incidents
  • charges & arrests
  • discretionary stops (traffic and pedestrian)
  • use of force

All of these tables you can link to individual officers and/or individual citizens, as well as have a date-time and location stamp of where it happened. So you can do things like see all the cases detective X has been assigned and his specific clearance rate, or all cases in which Y was listed as a victim, or see the stop/use-of-force patterns of officer Z over time, etc.

Other types of admin data that are pretty regular are pysch screenings (especially for newer officers), civilian complaints, plain text detective/case notes, gang related databases (people/tags/incidents), databases of reported/recovered stolen goods, etc. Police collect alot of data! At this point PDs often have this data going back over a decade.

How often is Admin Policing Data Used in Policing Journal Articles?

To illustrate my point about admin data should be used more in policing research, I took the most recent issues of several policing journals and counted up the articles that used admin data. (There are probably more policing journals I missed, sorry, these are the ones I know of/have submitted articles to in the past.)

So this is a total of 14/50 ~28% in this sample. This is actually higher than I expected (I guessed 10%). Looking at the first issue of Police Quarterly for 2020 it is 0/5. The Policing Policy and Practice issue also contained a special sub-issue on recruit training, among them 0/6 likely contained administrative data. The Policing an International journal first issue of 2020 had a special issue on cyber crime, which appears to me have 2/14 papers using admin data. So if I add those stats, it is 16/75 ~ 21%.

I may be undercounting admin data here; for example I assume a survey of recruits is not a regular data collection (it hasn’t been in any police agency I’ve been involved with), but I of course may be wrong.

I’ve included as admin data looking at detective case notes (it is sort of like secondary analysis of a qualitative dataset!). Also counted as admin data one article that used the NCVS – which is regularly collected data (but by the federal govt, not local PD).

So you may squabble with my definitions here, but in broad strokes I don’t think any reasonable definition is likely to push this above ~1/3 papers in policing research use regularly collected admin data (in this sample of policing journals).

For reference I did a Twitter poll asking what proportion of policing research folks thought used admin data, and the distribution of the 86 responses was a slight favor for the right category (under 1/3rd, but almost the same amount guessed over 2/3’s).

So you can see a significant number of folks think that the distribution is opposite what it is in practice – the majority, not the minority, of policing research uses specially collected data and ignores admin data.

Restricting the subset to policing journals is likely to bias the estimate downward somewhat. I bet if I pulled policing articles from say Journal of Experimental Crim or Crime Science they are closer to 100% using admin policing data. But I think that also illustrates a pretty big discord in the current field of policing as well.

Some may think this cuts the research in terms of criminology/criminal justice – policing journals publish work on examining police behavior, whereas other journals tend to more frequently look at crime outcomes more associated with “criminological” research. This may be true, but admin data collected by police departments are pretty relevant for examining police behavior (e.g. proactive stops, use of force). These admin measures are almost always more relevant to police behavior than surveys of opinions! If you do surveys you should often tie it to these other admin measures to provide secondary evidence of different relevant measures.

Whats Wrong with Collecting New Data?

My argument is explicitly value-laden – I don’t know the correct percent of policing research that should use admin police data. But I do think the current swing in which the clear majority of research is oriented to collect primary data is wrong. Those primary data collections have both more costs (above data already collected by police agencies) and, for the most part, ignore core outcomes to which PDs strive for.

For example, the National Institute of Justice has stated they want researchers to move away from admin data. One reason for this is that past researchers have been unsuccessful lowering crime, and so you should collect alternative measures to validate your intervention.

This I believe is an actively harmful perspective called “goal switching,” and in general makes little sense. If crime is so rare a study is ultimately poorly powered, there isn’t much potential benefit to reducing crime in that area even if the intervention does work in practice. Best case you need to do longer interventions. I mean if you want to reduce violent crime you can look at community sentiment if you want; it doesn’t make sense though to entirely drop the ultimate goal of violence reduction in its place though!

And this gets to the crux of core outcomes police should strive for. It is a normative question, but I believe reduced crime and reduced use of force are relatively well agreed upon general goals of police. I think it is OK to have secondary measures – such as say attitudes towards police or fear of crime or measures of police stress. But these measures have several things working against them.

One, they are not regularly collected as administrative datasets. I imagine you can troll up a few examples of PDs who have started to do regular surveys of attitudes towards police (either general public or specific post-PD contact), but vast majority have not. So say you have an intervention intended to improve attitudes towards police. Great! For a police department interested in implementing that program, they not only have to allocate resources to that project, but also put an item in the budget to do the surveys forever. (This isn’t always true though, I think for example Rylan Simpson’s work is strong enough to justify making those low cost appearance changes and you don’t need to forever do surveys to see if it is working.) But for most interventions you can’t just do it once and hope it has improved indefinitely! (Same as you can’t stop measuring crime just because something you did made crime go down one time.)

Two, they are pretty fuzzy as to whether they should be reasonably swapped out for goals of crime reduction and reduced use of force in-and-of themselves. For sake of argument say hot spots policing causes back fire effects that cause increased fear of crime. How exactly do you trade off fear of crime vs actual crime reduction? Personally I think actual crime reductions should take precedence in that scenario. If you want to justify actually measuring fear of crime, you need to make some value based arguments to justify at minimum the cost of doing surveys. You should also probably justify altering police behavior in a particular way to improve that particular metric as well.

So any time you do a secondary data collection, you need to actually valuate the costs of the measures somehow (which I know is very difficult, hence it makes more sense to default to using admin data that is costless in terms of research!) Costless is probably a bit of a misnomer though – police departments have already sunk a lot of resources into collecting that admin data (patrol officers likely spend about equal time on dealing with people as they do with paperwork). But it is costless in terms of capital for me to query a database and say “use of force went down 10% after you instituted this policy”.

I think plenty of research collecting unique measures has potential to meet this threshold. One of the motivations to write this was Lois James articles on EIS – I think her general idea of doing a more deep dive to tease out more detailed interaction measures could be really important work (especially if it can be automated in a particular way, say through BWC footage). Lois’s work is just one example though. I also think measures of say police stressors could be very important in measuring churn of police officers over time. I already stated I think Rylan Simpson’s work on perceptions of police is well justified based on his simple experiments (since they are very low cost interventions, like wear purple gloves instead of black, or no cost e.g. take off your sunglasses when interviewing folks).

So these have potential to be worth the cost for police departments to open up their pocket books and collect those measures, but that is a bridge further than the majority of research currently being publishing in policing journals.

Some Caveats

So this is like I said a value-laden and fuzzy argument. No doubt some folks doing qualitative research or surveys will think this is loathsome, and think “I can’t answer my research question using administrative data”.

I intend the argument to go the other way though – we can be doing so much more quality research for much less cost. It is also the case that folks I believe need in general to do a much better job tying contemporary policing research to actual real life outcomes such as crime and use of force. Like I said I think the default should be basically the opposite proportion of what policing research looks like at the moment.

I’m not saying folks can’t do more basic data measures and collection – but as is the vast majority of this research lacks any semblance of a cost-benefit analysis that would justify the cost to collect those measures. As is, even if folks hypotheses are validated in a one time data collection, they lack the necessary valuation to justify police departments implement those measures going forward in practice. (Many of these same valuation critiques apply to the use of technology in policing, although it is the obverse, not much academic work but plenty of sinking $$ into tech with little return in terms of measurable outcomes.)

One thing I have not touched on is access. Folks may be thinking “I can’t get access to that info!”. You actually probably can though – I don’t know a PD that would let you do a survey or interviews that also wouldn’t share much of this admin data.

Another thing I have not touched on is bias in admin data. That deserves a whole additional blog post. It is a fair critique in part (bias no doubt exists, it is quantifying how large and its impact on the analysis is the question). The majority of the work in these policing journals though is not using alternative measures to get around bias in admin data though, they are measuring totally different things (as I said goal switching to totally different outcomes).

Notes on matplotlib and seaborn charts (python)

My current workplace is a python shop. I actually didn’t use pandas/numpy for most of my prior academic projects, but I really like pandas for data manipulation now that I know it better. I’m using python objects (lists, dictionaries, sets) inside of data frames quite a bit to do some tricky data manipulations.

I do however really miss using ggplot to make graphs. So here are my notes on using python tools to make plots, specifically the matplotlib and seaborn libraries. Here is the data/code to follow along on your own.

some set up

First I am going to redo the data analysis for predictive recidivism I did in a prior blog post. One change is that I noticed the default random forest implementation in sci-kit was prone to overfitting the data – so one simple regularization was to either limit depth of trees, or number of samples needed to split, or the total number of samples in a final leaf. (I noticed this when I developed a simulated example xgboost did well with the defaults, but random forests did not. It happened to be becauase xgboost defaults had a way smaller number of potential splits, when using similar defaults they were pretty much the same.)

Here I just up the minimum samples per leaf to 100.

#########################################################
#set up for libraries and data I need
import pandas as pd
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

my_dir = r'C:\Users\andre\Dropbox\Documents\BLOG\matplotlib_seaborn'
os.chdir(my_dir)

#Modelling recidivism using random forests, see below for background 
#https://andrewpwheeler.com/2020/01/05/balancing-false-positives/

recid = pd.read_csv('PreppedCompas.csv')
#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
                    'juv_fel_count','YearsScreening']]
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
recid_prep['race'] = recid['race']

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1]
recid_test = recid_prep[recid_prep['Train'] == 0]

#Now estimating the model
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis'] #no race in model
dep_var = 'Recid30'
rf_mod = RandomForestClassifier(n_estimators=500, random_state=10, min_samples_leaf=100)
rf_mod.fit(X = recid_train[ind_vars], y = recid_train[dep_var])

#Now applying out of sample
pred_prob = rf_mod.predict_proba(recid_test[ind_vars] )
recid_test['prob'] = pred_prob[:,1]
#########################################################

matplotlib themes

One thing you can do is easily update the base template for matplotlib. Here are example settings I typically use, in particular making the default font sizes much larger. I also like a using a drop shadow for legends – although many consider drop shadows for data chart-junky, they actually help distinguish the legend from the background plot (a trick I learned from cartographic maps).

#########################################################
#Settings for matplotlib base

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}
 
print( matplotlib.rcParams )
#matplotlib.rcParams.update(andy_theme)

#print(plt.style.available)
#plt.style.use('classic')
#########################################################

I have it commented out here, but once you define your dictionary of particular style changes, then you can just run matplotlib.rcParams.update(your_dictionary) to update the base plots. You can also see the ton of options by printing matplotlib.rcParams, and there are a few different styles already available to view as well.

creating a lift-calibration line plot

Now I am going to create a plot that I have seen several names used for – I am going to call it a calibration lift-plot. Calibration is basically “if my model predicts something will happen 5% of the time, does it actually happen 5% of the time”. I used to always do calibration charts where I binned the data, and put the predicted on the X axis, and observed on the Y (see this example). But data-robot has an alternative plot, where you superimpose those two lines that has been growing on me.

#########################################################
#Creating a calibration lift-plot for entire test set

bin_n = 30
recid_test['Bin'] = pd.qcut(recid_test['prob'], bin_n, range(bin_n) ).astype(int) + 1
recid_test['Count'] = 1

agg_bins = recid_test.groupby('Bin', as_index=False)['Recid30','prob','Count'].sum()
agg_bins['Predicted'] = agg_bins['prob']/agg_bins['Count']
agg_bins['Actual'] = agg_bins['Recid30']/agg_bins['Count']

#Now can make a nice matplotlib plot
fig, ax = plt.subplots(figsize=(6,4))
ax.plot(agg_bins['Bin'], agg_bins['Predicted'], marker='+', label='Predicted')
ax.plot(agg_bins['Bin'], agg_bins['Actual'], marker='o', markeredgecolor='w', label='Actual')
ax.set_ylabel('Probability')
ax.legend(loc='upper left')
plt.savefig('Default_mpl.png', dpi=500, bbox_inches='tight')
plt.show()
#########################################################

You can see that the model is fairly well calibrated in the test set, and that the predictions range from around 10% to 75%. It is noisy and snakes high and low, but that is expected as we don’t have a real giant test sample here (around a total of 100 observations per bin).

So this is the default matplotlib style. Here is the slight update using my above specific theme.

matplotlib.rcParams.update(andy_theme)
fig, ax = plt.subplots(figsize=(6,4))
ax.plot(agg_bins['Bin'], agg_bins['Predicted'], marker='+', label='Predicted')
ax.plot(agg_bins['Bin'], agg_bins['Actual'], marker='o', markeredgecolor='w', label='Actual')
ax.set_ylabel('Probability')
ax.legend(loc='upper left')
plt.savefig('Mytheme_mpl.png', dpi=500, bbox_inches='tight')
plt.show()

Not too different from the default, but I only have to call matplotlib.rcParams.update(andy_theme) one time and it will apply it to all my charts. So I don’t have to continually set the legend shadow, grid lines, etc.

making a lineplot in seaborn

matplotlib is basically like base graphics in R, where if you want to superimpose a bunch of stuff you make the base plot and then add in lines() or points() etc. on top of the base. This is ok for only a few items, but if you have your data in long format, where a certain category distinguishes groups in the data, it is not very convenient.

The seaborn library provides some functions to get closer to the ggplot idea of mapping aesthetics using long data, so here is the same lineplot example. seaborn builds stuff on top of matplotlib, so it inherits the style I defined earlier. In this code snippet, first I melt the agg_bins data to long format. Then it is a similarish plot call to draw the graph.

#########################################################
#Now making the same chart in seaborn
#Easier to melt to wide data

agg_long = pd.melt(agg_bins, id_vars=['Bin'], value_vars=['Predicted','Actual'], var_name='Type', value_name='Probability')

plt.figure(figsize=(6,4))
sns.lineplot(x='Bin', y='Probability', hue='Type', style='Type', data=agg_long, dashes=False,
             markers=True, markeredgecolor='w')
plt.xlabel(None)
plt.savefig('sns_lift.png', dpi=500, bbox_inches='tight')   
#########################################################

By default seaborn adds in a legend title – although it is not stuffed into the actual legend title slot. (This is because they will handle multiple sub-aesthetics more gracefully I think, e.g. map color to one attribute and dash types to another.) But here I just want to get rid of it. (Similar to maps, no need to give a legend the title “legend” – should be obvious.) Also the legend did not inherit the white edge colors, so I set that as well.

#Now lets edit the legend
plt.figure(figsize=(6,4))
ax = sns.lineplot(x='Bin', y='Probability', hue='Type', style='Type', data=agg_long, dashes=False,
             markers=True, markeredgecolor='w')
plt.xlabel(None)
handles, labels = ax.get_legend_handles_labels()
for i in handles:
    i.set_markeredgecolor('w')
legend = ax.legend(handles=handles[1:], labels=labels[1:])
plt.savefig('sns_lift_edited_leg.png', dpi=500, bbox_inches='tight')

making a small multiple plot

Another nicety of seaborn is that it can make small multiple plots for you. So here I conduct analysis of calibration among subsets of data for different racial categories. First I collapse the different racial subsets into an other category, then I do the same qcut, but within the different groupings. To figure that out, I do what all good programmers do, google it and adapt from a stackoverflow example.

#########################################################
#replace everyone not black/white as other
print( recid_test['race'].value_counts() )
other_group = ['Hispanic','Other','Asian','Native American']
recid_test['RaceComb'] = recid_test['race'].replace(other_group, 'Other')
print(recid_test['RaceComb'].value_counts() )

#qcut by group
bin_sub = 20
recid_test['BinRace'] = (recid_test.groupby('RaceComb',as_index=False)['prob']
                        ).transform( lambda x: pd.qcut(x, bin_sub, labels=range(bin_sub))
                        ).astype(int) + 1

#Now aggregate two categories, and then melt
race_bins = recid_test.groupby(['BinRace','RaceComb'], as_index=False)['Recid30','prob','Count'].sum()
race_bins['Predicted'] = race_bins['prob']/race_bins['Count']
race_bins['Actual'] = race_bins['Recid30']/race_bins['Count']
race_long = pd.melt(race_bins, id_vars=['BinRace','RaceComb'], value_vars=['Predicted','Actual'], var_name='Type', value_name='Probability')

#Now making the small multiple plot
d = {'marker': ['o','X']}
ax = sns.FacetGrid(data=race_long, col='RaceComb', hue='Type', hue_kws=d,
                   col_wrap=2, despine=False, height=4)
ax.map(plt.plot, 'BinRace', 'Probability', markeredgecolor="w")
ax.set_titles("{col_name}")
ax.set_xlabels("")
#plt.legend(loc="upper left")
plt.legend(bbox_to_anchor=(1.9,0.8))
plt.savefig('sns_smallmult_niceleg.png', dpi=500, bbox_inches='tight')
#########################################################

And you can see that the model is fairly well calibrated for each racial subset of the data. The other category is more volatile, but it has a smaller number of observations as well. But overall does not look too bad. (If you take out my end leaf needs 100 samples though, all of these calibration plots look really bad!)

I am having a hellishly hard time doing the map of sns.lineplot to the sub-charts, but you can just do normal matplotlib plots. When you set the legend, it defaults to the last figure that was drawn, so one way to set it where you want is to use bbox_to_anchor and just test to see where it is a good spot (can use negative arguments in this function to go to the left more). Seaborn has not nice functions to map the grid text names using formatted string substitution. And the post is long enough, you can play around yourself to see how the other options change the look of the plot.

For a few notes on various gotchas I’ve encountered so far:

  • For sns.FacetGrid, you need to set the size of the plots in that call, not by instantiating plt.figure(figsize=(6,4)). (This is because it draws multiple figures.)
  • When drawing elements in the chart, even for named arguments the order often matters. You often need to do things like color first before other arguments for aethetics. (I believe my problem mapping sns.lineplot to my small multiple is some inheritance problems where plt.plot does not have named arguments for x/y, but sns.lineplot does.)
  • To edit the legend in a FacetGrid generated set of charts, ax returns a grid, not just one element. Since each grid inherits the same legend though, you can do handles, labels = ax[0].get_legend_handles_labels() to get the legend handles to edit if you want.

Using pytorch to estimate group based traj models

Deep learning, tensors, pytorch. Now that I have that seo junk out of the way 🙂 – I’ve been trying to teach myself some “Deep Learning”, as it is what all of the cool kids are doing these days.

I was having a hard time though with many of the different examples. Many are for image data, and so it was hard for me to translate that to actual applications I am interested in. Many do talk about dimension reduction and reducing to hidden layers, so I thought that was similar in nature to latent class analysis, such as group-based-trajectory-modelling (GBTM).

If you aren’t familiar with GBTM, imagine a scenario in which you cluster data, and then you estimate a different regression model to predict some outcome for each subset of the clustered data. This is just a way to do that whole set-up in one go, instead of doing each part separately. It has quite a few different names – latent class analysis and mixture modelling are two common ones. The only thing really different about GBTM is that you have repeated observations – so if you follow the same person over time, they should always be assigned to the same cluster/mixture.

In short you totally can do GBTM models in deep learning libraries (as I will show), but actually most examples that I have walked through are more akin to dimension reduction of columns (so like PCA/Canonical Correlation). But the deep learning libraries are flexible enough to do the latent class analysis I want here. As far as I can tell they are basically just a nice way to estimate systems of equations (with a ton of potential parameters, and do it on the GPU if you want).

So I took it as a challenge to estimate GBTM models in a deep learning library – here pytorch. In terms of the different architectures/libraries (e.g. pytorch, tensorflow, Vowpal Wabbit) I just chose pytorch because one of my co-workers suggested pytorch was easier to learn!

I’ve posted a more detailed notebook of the code, but it worked out quite well. So first I simulated two groups of data (50 observations in each group and 11 time periods). I added a tiny bit of random noise, so this (I was hoping) should be a pretty tame problem for the machine to learn.

The code to generate a pytorch module and have the machine churn out the gradients is pretty slick (less than 30 lines total of non-comments). Many GBTM code bases make you do the analysis in wide format (so one row is an observation), but here I was able to figure out how to set it up in long data format, which makes it real easy to generalize to unbalanced data.

It took quite a few iterations to converge though, (iterations were super fast, but it is a tiny problem, so not sure how timing will generalize) and only converged when using the Adam optimizer (stochastic gradient descent converged to an answer with a similar mean square error, but not to anywhere near the right answer). These models are notorious for converging to sub-optimal locations, so that may just be an intrinsic part of the problem and a good library needs to do better with starting conditions.

I have a few notes about potential updates to the code at the end of my Jupyter notebook. For count or binomial 0/1 data, that should be a pretty easy update. Also need to write code to do out of sample predictions (which I think I can figure out as well). A harder problem I am not sure how to figure out is to do an equation for the latent groups inside of the function. And I don’t know how to get standard errors for the coefficient estimates. Hopefully I can figure that out while trying to teach myself some more deep learning. I have a few convolution ideas I want to try out for spatial-temporal crime forecasting and include proactive police feedback, but I won’t get around to them for quite awhile I imagine.

Nearby Analysis Example (Excel)

The other day on Twitter I made a comment to Joel Caplan about how I would solve analysis with multiple buffers and not counting overlaps. A typical GIS workflow would go:

  • take your points of interest and create buffers
  • join the points to the buffer polygons, and get a count of the crimes of interest

I often do the analysis in different way though – I do a spatial join of the location of interest to the point features, in which you get a field that is the distance to the nearest feature, and then subsequently do analysis on that distance field. In that workflow, it makes it much easier to change the size of the buffer for sensitivity analysis, or conduct analysis on different subsets of data.

To start I am going to be working with a set of robberies in Dallas (from the open data, not quite 16k), and DART stations (n = 74). (DART is the Dallas above ground train.) You can access the Excel file I am doing analysis with here. Using excel as I often suggest it for undergrads/masters for projects who aren’t up to speed with programming – so this is a good illustration of that buffer analysis workflow.

Distance to Nearest

To start, I would typically use a GIS system (or R/python/SQL) to calculate the distance to a nearest object. But I don’t have access to Arc anymore, so I am going to show you a way to do this right in Excel. This only works for projected data (not latitude/longitude), and calculating distance from point-to-point.

So first, to figure out the distance between two points in Euclidean space, we can just use the Pythagorean theorem that you learned in grade school, Distance = sqrt( (x1 - x2)^2 + (y1 - y2)^2 ). Because we are doing this in an Excel spreadsheet and want to find the nearest Dart station to the robbery, we will use a little array formula magic. I named my table of Dart locations Dart, and so the array formula to find the nearest distance in Excel is:

=MIN( SQRT( (B2 - Dart[X])^2  + (C2 - Dart[Y])^2))

When you enter this formula, hit Ctrl + Shift + Enter, else it just returns the distance to the first Dart station. If you did this right, you will see the formula have {} brackets around it in the formula bar.

Distance will be defined in whatever the projected units are in – here they are in feet. But by using MIN with the array, it returns the distance to the nearest station. To get the ID of the associated station, we need to do a similar formula (and this only works with numeric ID fields). You can basically do an array IF formula, and the only station this is true for will be the MAX of that array. (Again hit Ctrl + Shift + Enter when finishing off this cell calculation instead of just Enter.)

=MAX(IF(F2=SQRT((B2 - Dart[X])^2  + (C2 - Dart[Y])^2), Dart[DartID],0))

User beware – this runs super fast on my machine (surprisingly) but it is quite a few computations under the hood. For much larger data again use a GIS/database/Stat program to do these calculations.

Using Pivot Tables to do Buffer Analysis

So now that we have those distance fields, it is easy to do a formula along the lines of you want to count up the robberies within 1000 feet. You can do another IF formula that is something like IF([@Distance] < 1000, 1, 0).

And then go ahead and make a pivot table, and put the DartID as the rows, and the Within distance field you just made as the values (to sum in the pivot table).

Then bam, you have your buffer analysis. Here I sorted the pivot table so you can see the highest crime Dart is 12. (I haven’t looked up which one this is, you can use Excel though to map them out).

So say you wanted to change the buffer size? It is as simple as changing out the 1000 in the prior formula to a different value. One thing I like to do though is to make a lookup table to define different bins. You can see I named that table BuffTable (naming the tables makes it easier to refer to them later in array formulas, also I shifted down the pivot table to not accidently overwrite it later).

And now I use a combination of MATCH to find what row it falls into for this table, and INDEX to return the row label I want. So first I have =MATCH([@Distance],BuffTable[Within Bounds],1). This is very similar to VLOOKUP, and will match to the row that the distance is less than.

This just returns the row number of the match though – I want to pipe in those nicer labels I made. To do that, I nest the match results within index, =INDEX(BuffTable, MATCH([@Distance],BuffTable[Within Bounds],1)+1, 2). And voila, I get my binned data.

Now we can do our pivot table so the columns are the new field we just made (make sure to refresh the pivot table).

And we can do our buffer analysis and varying buffers. Just update the tables to however you want the buffers, hit refresh, and everything will be updated. (I should have done the labels so they are ordered a bit more nicely in the pivot table.)

I like this approach for students, as it is easy to pivot/filter on other characteristics as well. Want to get arrest rates around areas? Want to see changes in crimes nearby different DART stations over time? It is just a few formulas/filters and a pivot table away in this format.

Distance to Nearest Analysis for DART stations

Another analysis I think is useful is to look at the cumulative distance analysis. I got this idea from a paper of Jerry Ratcliffe’s.

So what you can do is to round the distance data, e.g. using a formula like this will round the data to every 300 feet.

=ROUND([@Distance]/300,0)*300

And then you can make a pivot table of the rounded counts. Here I also did additional stats to calculate the spatial density of the points, and show the distance decay curve.

Jerry’s paper I linked to looks for change points – I think that idea is somewhat misleading though. It does look like a change point in Jerry’s graphs, but that is a function of the binning I think (see this Xu/Griffiths paper, same method, finer bins, and shows a more smooth decay).

So here I tied the round function to a cell, and again I can just update the value to a different bin size, and everything get auto-updated in the spreadsheet. Here is a bin size of 100 feet, which introduces some volatility in the very nearby locations, but you can see still pretty much follows that smooth distance decay effect.

Actually the Xu/Griffiths paper looks at the street network distance, which I think makes more sense. (And again need a GIS to do that analysis.) The buffer areas can behave funny, and won’t have a direct relationship to the street length exposure, so I think the typical Euclidean analysis can be misleading in some cases. I will need to save that for another blog post though!

Co-author networks in Criminology

In my bin of things I will never finish at this point, I started a manuscript looking at co-author networks in criminology using web of science data. I recruited several folks over the years (grad students at the time Jen Laprade and Richard Hernendez, and Marie Oullet), but I was never able to put in the last bit of time to finish it off. Exploratory work is hard, as there is no end goal to work towards. So I was never able to get it to a point I was happy with.

The shamble of the current paper is here, which will contain more details than this post. But basically I downloaded all of the Web of Science data that had the CJ/Crim label attached up to 2016, then turned that into a co-author network.

So the way it works is if I co-authored an article with Rob Worden & Sarah McLean, and Rob Worden & Sarah McLean co-authored a paper with Chris Harris, me and Chris are not directly connected, but are just 1 degree apart. After doing this, I wanted to see if we clustered into different groups. The answer to that is yes, I can get the computer to spit out clusters (colored below), but we are still definately small world (everyone is connected to everyone one with only a few hops).

I had a really hard go at it to get the networks to layout nicely (a typical problem with big, interconnected networks). I’ve posted an interactive version here. You can zoom in, look at the clusters, and look yourself up.

Here is a GIF showing surfing the network. I look up Beth Huebner (I would say Beth is part of the Michigan State/CJ folks Cluster), see she is attached to Scott Decker (who is in another blue cluster that has a pretty big array of folks, it has many Arizona but also Alex Piquero, Dan Nagin, and Shawn Bushway), then go onto Scott Wolfe etc.

I figured the clusters would be by topical area (which is true to a certain extent), but they were also by University clusters. Here was my attempt to give some meaning to the clusters, by pulling out the top 3 authors/journals. There are some 40 clusters in the excel file in the paper folder shared earlier. (There are more clusters than that even, but they are the 40 biggest in terms of authors/articles.)

So that gives some face validity to the clusters, but like I said it is small world, so maybe that isn’t worth noting at all anyway. One of the things I noticed was that the clusters had a big seperation between USA folks and international folks.

So if someone wants to take this over let me know. I didn’t share a link to the data directly (I imagine that violates the Web of Science terms of service.) But will share offline plus my code if someone wants it. (It is already 3+ years old data, I don’t even want to think about updating the work. Jen and Richard did a bunch of grunge work to clean the names for me to make the network.)

Coauthorship over time

One thing I noted was the change in co-authorship over time. It is a perpetual question about how to evaluate folks by solo-authorship. I can’t answer that question, but we can observe how it is changing over time. Here are graphs of proportion solo over time, as well as the mean number of authors over time (with error intervals, much more data in recent years than past).

This holds true the same for our top journals (the WOS data is quite a hodge podge, including forensic pysch, some trade magazines, etc.).

Citations Over Time

Another example bit of data analysis I did with this dataset is you can look at citations over time as well. Here is the mean of citations in well known crim/cj journals over time.

And here is a scatterplot of the individual papers. I’ve posted an interactive version of this as well.

So more stuff than I can handle zipping around this data. (I tried to make some sense of keywords for articles at one point, but that would take some more serious semantic reduction of like words.)

The Failed Idea Bin: Temporal Aggregation and the Crime/Stop Relationship

A recent paper by the Hipp/Kim/Wo trio analyzing robbery at very fine temporal scales in NYC reminded me on a failed project I never quite worked out to completion. This project was about temporal aggregation bias. We talk about spatial aggregation bias quite a bit, which I actually don’t think is that big of deal for many projects (for reasons discussed in my dissertation).

I think it is actually a bigger deal though when dealing with temporal relationships, especially when we are considering endogenous relationships between crime and police action in response to crime. This is because they are a countervailing endogenous relationship – most endogenous relationships are positively correlated, but here we think police do more stuff (like arrests and stops) in areas with more crime, and that crime falls in response.

I remember the first time I thought about the topic was when I was working with the now late Dennis Smith and Robert Purtell as a consultant for the SQF litigation in NYC. Jeff Fagan had some models predicting the number of stops in an area, conditional on crime and demographic factors at the quarterly level. Dennis and Bob critiqued this as not being at the right temporal aggregation – police respond to crime patterns much faster than at the quarterly level. So Jeff redid his models at the monthly level and found the exact same thing as he did at the quarterly level. This however just begs the question of whether monthly is the appropriate temporal resolution.

So to try to tackle the problem I took the same approach as I did for my dissertation – I pretend I know what the micro level equation looks like, and then aggregate it up, and see what happens. So I start with two endogenous equations:

crime_t1 = -0.5*(stops_t0) + e_c
stops_t1 =  0.5*(crime_t0) + e_s

And then aggregation is just a sum of the micro level units:

Crime_T = (crime_t1 + crime_t0)
Stops_T = (stops_t1 + stops_t0)

And then what happens when we look at the aggregate relationship?

Crime_T = Beta*(Stops_T)

Intuitively here you may see where this is going. Since crime and stops have the exact same countervailing effects on each other, they cancel out if you aggregate up one step. I however show in the paper if you aggregate up more than two temporal units in this situation the positive effect wins. The reason is that back substitution for prior negative time series relationships oscillates (so a negative covariance at t-1 is a positive covariance at t-2). And in the aggregate the positive swamps the negative relationship. Even estimating Crime_T = Beta*(Stops_T-1) does not solve the problem. These endogenous auto-regressive relationships actually turn into an integrated series quite quickly (a point that cannot be credited to me, Clive Granger did a bunch of related work).

So this presented a few hypotheses. One, since I think short run effects for stops and crime are more realistic (think the crackdown literature), the covariance between them at higher resolutions (say monthly) should be positive. You should only be able to recover the deterrent effect of stops at very short temporal aggregations I think. Also crime and stops should be co-integrated at large temporal aggregations of a month or more.

Real life was not so convenient for me though. Here I have the project data and code saved. I have the rough draft of the theoretical aggregation junk here for those interested. Part of the reason this is in the failed idea bin is that neither of my hypotheses appears to be true with the actual crime and stop data. For the NYC citywide data I broke up stops into radio-runs and not-radio-runs (less discretion for radio runs, but still should have similar deterrent effects), and crimes as Part 1 Violent, Part 1 Non-Violent, and Part 2. More recently I handed it off to Zach Powell, and he ran various vector auto-regression models at the monthly/weekly/daily/hourly levels. IIRC it was pretty weak sauce evidence that stops at the lower temporal aggregations showed greater evidence of reducing crime.

There of course is a lot going on that could explain the results. Others have found deterrent effects using instrumental variable approaches (such as David Greenberg’s work using Arellano-Bond, or Wooditch/Weisburd using Bartik instruments). So maybe my idea that spatial aggregation does not matter is wrong.

Also there is plenty of stuff going on specifically in NYC. We had the dramatic drop in stops due to the same litigation. Further work by MacDonald/Fagan/Geller have shown stops that met a higher reasonable suspicion standard based on the reported data have greater effects than others (essentially using Impact zones as an instrument there).

So it was a question I was never able to figure out – how to correctly identify the right temporal unit to examine crime and deterrence from police action.

Some additional plots to go with Crime Increase Dispersion

So Jerry nerdsniped me again with his Crime Increase Dispersion statistic (Ratcliffe, 2010). Main motivation for this post is that I don’t find that stat very intuitive to be frank. So here are some alternate plots, based on how counts of crime approximately follow a Poisson distribution. These get at the same question though as Jerry’s work, is a crime increase (or decrease) uniform across the city or specific to a few particular sub-areas.

First, in R I am going to simulate some data. This creates a set of data that has a constant increase over 50 areas of 20%, but does the post crime counts as Poisson distributed (so it isn’t always exactly a 20% increase). I then create 3 outliers (two low places and one high place).

###########################################
#Setting up the simulation
set.seed(10)
n <- 50
low <- 10
hig <- 400
inc <- 0.2
c1 <- trunc(runif(n,low,hig))
c2 <- rpois(n,(1+inc)*c1)
#Putting in 2 low outliers and 1 high outlier
c2[5] <- c1[5]*0.5
c2[10] <- c1[10]*0.5
c2[40] <- c1[40]*2
#data frame for ggplot
my_dat <- data.frame(pre=c1,post=c2)
###########################################

The first plot I suggest is a simple scatterplot of the pre-crime counts on the X axis vs the post-crime counts on the Y axis. My make_cont function takes those pre and post crime counts as arguments and creates a set of contour lines to put as a backdrop to the plot. Points within those lines support the hypothesis that the area increased in crime at the same rate as the overall crime increase, taking into account the usual ups and downs you would expect with Poisson data. This is very similar to mine and Jerry’s weighted displacement difference test (Wheeler & Ratcliffe, 2018), and uses a normal based approximation to examine the differences in Poisson data. I default to plus/minus three because crime data tends to be slightly over-dispersed (Wheeler, 2016), so coverage with real data should be alittle better (although here is not necessary).

###########################################
#Scatterplot of pre vs post with uniform 
#increase contours

make_cont <- function(pre_crime,post_crime,levels=c(-3,0,3),lr=10,hr=max(pre_crime)*1.05,steps=1000){
    #calculating the overall crime increase
    ov_inc <- sum(post_crime)/sum(pre_crime)
    #Making the sequence on the square root scale
    gr <- seq(sqrt(lr),sqrt(hr),length.out=steps)^2
    cont_data <- expand.grid(gr,levels)
    names(cont_data) <- c('x','levels')
    cont_data$inc <- cont_data$x*ov_inc
    cont_data$lines <- cont_data$inc + cont_data$levels*sqrt(cont_data$inc)
    return(as.data.frame(cont_data))
}

contours <- make_cont(c1,c2)

library(ggplot2)
eq_plot <- ggplot() + 
           geom_line(data=contours, color="darkgrey", linetype=2, 
                     aes(x=x,y=lines,group=levels)) +
           geom_point(data=my_dat, shape = 21, colour = "black", fill = "grey", size=2.5, 
                      alpha=0.8, aes(x=pre,y=post)) +
           scale_y_continuous(breaks=seq(0,500,by=100)) +
           coord_fixed() +
           xlab("Pre Crime Counts") + ylab("Post Crime Counts")
           #scale_y_sqrt() + scale_x_sqrt() #not crazy to want square root scale here
eq_plot

#weighted correlation to view the overall change
cov.wt(my_dat[,c('pre','post')], wt = 1/sqrt(my_dat$pre), cor = TRUE)$cor[1,2]
########################################### 

So places that are way outside the norm here should pop out, either for increases or decreases. This will be better than Jerry’s stats for identifying outliers in lower baseline crime places.

I also show how to get an overall index based on a weighted correlation coefficient on the last line (as is can technically return a value within (-1,1), so might square it for a value within (0,1)). But I don’t think the overall metric is very useful – it has no operational utility for a crime department deciding on a strategy. You always need to look at the individual locations, no matter what the overall index metric says. So I think you should just cut out the middle man and go straight to these plots. I’ve had functionally similar discussions with folks about Martin Andresen’s S index metric (Wheeler, Steenbeek, Andresen, 2018), just make your graphs and maps!

An additional plot that basically takes the above scatterplot and turns it on its side is a Poisson version of a Bland-Altman plot. Traditionally this plot is the differences of two measures on the Y axis, and the average of the two measures on the X axis. Here to make the measures have the same variance, I divide the post-pre crime count differences by sqrt(post+pre). This is then like a Poison Z-score, taking into account the null of an equal increase (or decrease) in crime stats among all of the sub-areas. (Here you might also use the Poisson e-test to calculate p-values of the differences, but the normal based approximation works really well for say crime counts of 5+.)

###########################################
#A take on the Bland-Altman plot for Poisson data

ov_total <- sum(my_dat$post)/sum(my_dat$pre)
my_dat$dif <- (my_dat$post - ov_total*my_dat$pre)/sqrt(my_dat$post + my_dat$pre)
my_dat$ave <- (my_dat$post + my_dat$pre)/2

ba_plot <- ggplot(data=my_dat, aes(x=ave, y=dif)) + 
           geom_point(shape = 21, colour = "black", fill = "grey", size=2.5, alpha=0.8) +
           scale_y_continuous(breaks=seq(-8,6,by=2)) +
           xlab("Average Crime") + ylab("Z-score (Equal Increase)")

ba_plot

#false discovery rate correction
my_dat$p_val <- pnorm(-abs(my_dat$dif))*2 #two-tailed p-value
my_dat$p_adj <- p.adjust(my_dat$p_val,method="BY") #BY correction since can be correlated
my_dat <- my_dat[order(my_dat$p_adj),]
my_dat #picks out the 3 cases I adjusted
###########################################

So again places with large changes that do not follow the overall trend will pop out here, both for small and large crime count places. I also show here how to do a false-discovery rate correction (same as in Wheeler, Steenbeek, & Andresen, 2018) if you want to actually flag specific locations for further investigation. And if you run this code you will see it picks out my three outliers in the simulation, and all other adjusted p-values are 1.

One thing to note about these tests are they are conditional on the observed overall citywide crime increase. If it does happen that only one area increased by alot, it may make more sense to set these hypothesis tests to a null of equal over time. If you see that one area is way above the line and a ton are below the line, this would indicate that scenario. To set the null to no change in these graphs, for the first one just pass in the same pre estimates for both the pre and post arguments in the make_cont function. For the second graph, change ov_total <- 1 would do it.

References

  • Ratcliffe, J. H. (2010). The spatial dependency of crime increase dispersion. Security Journal, 23(1), 18-36.
  • Wheeler, A. P. (2016). Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice. International Journal of Police Science & Management, 18(3), 159-172.
  • Wheeler, A. P., & Ratcliffe, J. H. (2018). A simple weighted displacement difference test to evaluate place based crime interventions. Crime Science, 7(1), 11.
  • Wheeler, A. P., Steenbeek, W., & Andresen, M. A. (2018). Testing for similarity in area‐based spatial patterns: Alternative methods to Andresen’s spatial point pattern test. Transactions in GIS, 22(3), 760-774.