At its most basic, risk assessment algorithms provide an estimate of future risk over some specific time horizon. For example, after scored for a pretrial risk assessment, the algorithm may say an individual has a 30% probability of committing a crime within the next 3 months. Is 30% low risk, and so they should be released? Or is 30% high risk? The majority of folks do not go on to commit a new offense awaiting for trial, so 30% overall may be really high actually relative to most folks being assessed. So how exactly do you decide the threshold – at what % is it too high and they should be detained pretrial?
For an individual costbenefit analysis, you consider the costs of pretrial detainment (the physical costs to detain a person for a specific set of time, as well as the negative externalities of detention for individuals) vs the cost to society of whether a future crime occurs. So for the 30% mark, say that average crime cost to society is $10,000 (this is ignoring that violent crimes cost more than property crime, in practice you can get individual probability estimates for each and combine them together). So the expected risk if we let this person go would then be $10,000*0.3 = $3,000
. Whether the 3k risk is worth the pretrial detention depends on how exactly we valuate detention. Say we have on average 90 days pretrial detention, and the cost is something like $200 up front fixed costs plus $50 per day. We would then have a cost to detain this person as $200 + $50*90 = $4,700
. From that individual perspective, it is not worth to detain that person pretrial. This is a simplified example, e.g. this ignores any negative externality cost of detaining (e.g. lost wages for individuals detained), but setting the threshold risk for detaining or releasing on recognizance (ROR) should look something like this.
One of the things I like about this metric is that it places at the forefront how another aspect of bail reform – the time to trial – impacts costs. So if you reduce the time to trial, for those ROR’d you reduce the time horizon of risk. (Those only out 1 month have less time to recidivate than those that have to wait 6 months.) It also reduces the cost of detaining individuals pretrial as well (costs less for both the state and the individual to only be in jail 1 month vs 6 months). It may be to make everyone happy, we should reduce the queue so we can get people to trial faster. (But in the short term risk assessment tools I think have a great upside to both increase public safety as well as reduce imprisonment costs.)
Part of the reason folks are not happy with the current bail reforms is that they think it is increasing overall crime (see this example in Dallas). Here is an example graph though that folks doing these assessments should be providing, both in terms of upfront establishing the threshold for risk, as well as evaluating the efficacy of the risk assessment tool in practice.
I will use the data from my prior posts on false positives to illustrate. For the graph, you place on the X axis the cumulative number of people that are detained, and the Y axis you place the expected number of crimes that your model thinks will be committed by those folks who are ROR’d. So a simplified table may be
Person %crime
A 0.5
B 0.4
C 0.3
D 0.2
E 0.1
If we let all of these folks go, we would expect they commit a total of 1.5 crimes (the sum of the percent predicted crime column) forecasted per our risk assessment algorithm. If we detained just person A, we have 1 in the detain column, and then a cumulative risk for the remaining folks of 1 (the sum of the predicted crime column for those that are remaining and are not detained). So then we go from the above table to this table by increasing the number of folks detained onebyone.
Detained ExpectedCrimes
0 1.5
1 1.0
2 0.6
3 0.3
4 0.1
5 0
Here is what that graph looks like using the ProPublica data, so if we apply this strategy to the just under 3,000 cases (in my test set from the prior blog post). So you can see that if we decided to detain noone, we would expect a total of 1,200 extra crimes. And this curve decreases over detaining everyone. So you may say I don’t want more than 200 crimes, which you would need to have detained 1,500 people in the prior example (and happens to result in a risk threshold of 36% in this sample).
Using historical data, this is good to establish a overall amount of crime risk you expect to occur from a particular set of bail reform decisions. To apply it to the future threshold decision making, you need to assume the past in terms of the total number of people arrested as well as the risk distribution stays the same (the latter I don’t think is a big issue, and the former you should be able to make reasonable projections if it is not constant). But this sets up the hypothetical, OK if we release this many more people ROR, we expect this many more crimes to occur as an upfront expectation of bail reform. It may be even if the individual costbenefit calculation above says release, this would result in a total number of extra crimes folks deem unacceptable when applying that decision to everyone. So we can set the threshold to say we only want 10 extra crimes to happen because of bail reform, or 50 extra, or 100 extra, etc. This example again just aggregates all crime together, but you can do the same thing for every individual crime outcome you are interested in.
After the assessment is in place, this is actually necessary monitoring folks should do be doing anyway to ensure the model is working as expected. That is, you get an estimate of the number of crimes folks who are released you think would commit per your risk assessment model. If you predict more/less than what you see amongst those released, your model is not well calibrated and needs to be updated. In practice you can’t just estimate a predictive model once and then use it forever, you need to constantly monitor whether it is still working well in real life. (Actually I showed in my prior blog post that the model was not very good, so this is a pretty big overestimate of the number of crimes in this sample.)
This should simultaneously quell complaints about bail reform is causing too many crimes. The lack of this information is causing folks to backlash against these predictive algorithms (although I suspect they do better than human judges, so I suspect they can reduce crime overall if used wisely). Offhand the recent crime increases in Philly, NYC, and Dallas I’m skeptical are tied to these bail reform efforts (they seem too big of increases or too noisy up/downs to reliably pin to just this), but maybe I am underestimating how many people they are letting out and the cumulative overall risk expected from the current models in those places. On the flipside folks are right to question those Chicago stats, I suspect the risk algorithm should be saying that more crimes are occurring then they observed (ignoring what they should or should not be counting as recidivated).
I’d note these metrics I am suggesting here should be pretty banal to produce in practice. It is administrative data already collected and should be available in various electronic databases. So in practice I don’t know why this data is not readily available in various jurisdictions.
One thing you may notice is that in my prior costbenefit analysis I didn’t take into consideration false positives. Although my prior post details how you would set this, there is a fundamental problem with monitoring false positives (those detained but who would not go on to recidivate) in practice. In practice, you can’t observe this value (you can only estimate it from historical data). Once you detain an individual, by construction they aren’t given the chance to recidivate. So you don’t get any onpolicy feedback about falsepositives, only falsenegatives (folks who were released and went on to commit a crime pretrial).
This I think puts a pretty big nail in the coffin of using false positive rates as a policy goal for bail reform in practice. Like I said earlier, you can’t just set a model once and expect it to work forever in the future. But, I actually don’t think that should be considered in the costbenefit calculus anyway. So traditionally people tend to think of setting the threshold for predictive models like this confusion table, where different outcomes in the table have different costs to individuals and to society:
In this table those on the bottom row are those detained pretrial. So in the hypothetical, you may say if we could someone know the false positives, we should calculate extra harm that pretrial detainment causes to those individuals (lost wages, losing job, health harms, etc.). But for the folks who would have gone on and recidivated, we should just calculate the bare bones cost of detainment.
I think this is the wrong way to think about it though. Those harms are basically across the board for everyone – even if the person was likely to recidivate they still bear those subsequent harms of being incarcerated. Whether you think people deserve the harm does not make it go away.
The main reason I am harping on bail reform so much (folks who know my work will realize it is outside my specific research area) is that the current bail system is grossly inefficient and unequitable. These are folks that piling on monetary bail costs are the exact wrong way to ensure safety and to promote better outcomes for these folks.
It is a hard decision to make on who to detain vs who to let go. But pretending the current state of judges making these decisions on whatever personal whims they have and thinking we are better off than using a costbenefit approach and algorithmic assessments is just sticking your head in the sand.
]]>P(White  Other Stuff, Being Shot) = P(Minority  Other Stuff, Being Shot)
. Whereas we want Being Shot
on the left hand side, e.g. P(Being Shot  Race)
, and switching these probabilities results in mostly a meaningless estimate in terms of inferring police behavior. You ultimately need to look at some cases in which folks were not shot to have a meaningful research design.
I’ve been having similar conversations with folks since publishing my work on officer involved shootings (Wheeler et al., 2017). Most folks don’t understand the critique, and unfortunately most folks also don’t take critiques very well. So this post is probably a waste of time, but here it is anyway.
I’m likely to get some of the timing wrong in how I came to be interested in this area – but here is what I remember. David Klinger and Richard Rosenfeld published a piece in Criminology & Public Policy (CPP) examining the count of OIS’s in neighborhoods in St. Louis, conditional on demographic and violent crime counts in those neighborhoods (Klinger et al., 2016). So in quantoid speak they estimated the expected number of OIS in neighborhoods, E[OIS_n  Demographic_n, Crime_n]
.
I thought this work was mostly meaningless, mainly because it really only makes sense to look at rates of behavior. You could stick a count of anything police do on the left hand side of this regression and the violent crime coefficient will be the largest positive effect. So you could say estimate the counts of officers helping old ladies cross the street, and you would make the same inferences as you would about OIS. It is basically just saying where officers spend more of their time at (in violent crime areas), and subsequently have more interactions with individuals. It doesn’t say anything fundamentally about police behavior in regards to racial bias.
So sometime in 2016 me and Scott Phillips came up with the study design using when officers draw their firearm as the denominator. (Before I moved to Dallas I knew about their open data.) It was the observational analogue to the shoot/don’t shoot lab experiments Lois James did (James et al., 2014). Also sometime during the time period Roland Fryer came out with his preprint, in which he used Taser uses as the counterfactual don’t shoot cases (Fryer, 2019). I thought drawing the firearm made more sense as a counterfactual, but both are subject to the same potential selection effect. (Police may be quicker to the draw their firearms with minorities, which I readily admit in my paper.)
Also in that span Justin Nix came out with the birdseye view CPP paper using the national level crowd sourced data (Nix et al., 2017) to estimate racial bias. They make what to me is a similar conditional probability mistake as the papers that motivated this post. Using the crowdsourced national level data, they estimate the probability of being unarmed, conditional on race (in the sample of just folks who were killed by the police). So they test whether P(Unarmed  White, Shot) = P(Unarmed  Minority, Shot)
.
Since like I said folks don’t really understand the conditional probability argument, basically at this point I just say folks get causality backwards. The police shooting at someone does not make them armed or unarmed, the same way police shooting at someone does not change their race. You cannot estimate a regression of X ~ beta*Y
, then interpret beta
as how much X causes Y. The stuff on the right hand side of the conditional probability statement works mostly the same way, we want to say stuff on the right hand side of the condition causes some change in the outcome.
I have this table I made in Wheeler et al. (2017) to illustrate various research designs – you can see the Ross (2015) made the same estimate of P(Unarmed  Race, Shot)
as Justin did.
At this point you typically get a series of different retorts to the “you estimated the wrong conditional probability complaint”. The ones I’ve repeatedly seen are:
For (3) I would be OK if the results are described correctly, pretty much all of these articles are clearly interested in making inferences about police behavior though (which you cannot do with just looking at these negative encounters). It isn’t just a slip of mistaking conditional probabilities (like a common pvalue mishap that doesn’t really impact the overall conclusions), the articles are directly motivated to make inferences about police behavior they cannot with this study design.
For (2) it is useful to consider how might the descriptive conditional probabilities be actually interpreted in a reasonable manner. So if we estimate P(Offender Race  Shot)
, you can think of a game where if you see a news headline about an OIS, and you want to guess the race of the person shot by police, what would be your best guess. Ditto for P(Unarmed  Shot)
, what is the probability of someone being unarmed conditional on them being shot. This game is clearly a superficial type of thing to estimate, those probabilities don’t say anything though about behavior in terms of things police officers can control, they are all just a function of how often police get in interactions with those different races (or armed status) of individuals.
Consider a different hypothetical, the probability a human is shot by police versus an animal. P(Human  Shot)
is waay larger than P(Animal  Shot)
, are police biased against humans? No, the police just don’t deal with animals they need to shoot on a regular basis.
For (1) I will follow up below with some examples of how I think using this OIS data could actually be effective for shaping police behavior in practice, but suffice to say just collecting OIS you can’t really say anything about racial bias in terms of officer decision making.
I will say that a bunch of the individuals I am critiquing here I consider friends. Steve Bishopp was one of the coauthors on my OIS work with Dallas data. If I go to a conference Justin is one of the people I would prefer to sit down and have a drink with. I’ve been schmoozing up folks with good R programming skills to come to Dallas to work for Jenn ReingleGonzalez. They have all done other work I think is good. See Tregel et al. (2019) or Jetelina et al. (2017) or Cesario et al. (2019) for other examples I think are more legitimate research articles amongst the same people who I am critiquing here.
So in response to (4) I think you all made the wrong mistake – the conditional probability mistake is an easy one to make. So sorry to my friends whom I think are wrong about this. That being said, most of the vitriol in public forums, often accusing people of adhominem attacks on their motivations, is pretty much always out of line. I think basically everyone on Twitter is being a jerk to be frank. I’ve seen it all around on both sides in the most recent Twitter back and forth (both folks calling Jenn racist and JM biased against the police). None of them are racist or biased for/against the police. I suppose to expect any different though is setting myself up for dissapointment. I was called racist by academic reviewers for Wheeler et al. (2017) (it took 4 rejects for my OIS paper before it was published). I’ve seen Justin get critiques on Twitter for being white in the past when doing work in this area.
I think CJ folks questioning JM’s motivation miss the point of his critique though. He isn’t saying police are biased and these papers are wrong, he is just saying these research papers are wrong because they can’t tell whether police are biased one way or another.
So while I think better research could be conducted in this area – JM has his work on bounding estimates (Knox et al., 2019), and I imagine someone can come up with a reasonable instrumental variable strategy to address the selection bias in the same vein as my shoot/don’t shoot (say officer instruments, or exogenous incidents that make officers more on edge and more likely to draw their firearm). But I think the question of whether “the police” are racially biased is a facile question. Globally labelling all police (or a single department) as racist is mostly a waste of time. Good for academic papers and to get people riled up in Twitter, not so much for anything else.
The police are simply a cross section of the general public. So in terms of whether some officers are racist this is true (as it is for the general public). Or maybe even we are all a little racist (ala the implicit bias hypothesis). We can only observe behavior, we cannot peer into the hearts and minds of men. But suffice to say racism is still a part of our society in some capacity I believe is a pretty tame statement.
Subsequently if you gather enough data you will be able to get some estimate of the police being racist (the null is for sure wrong). But if people can’t reasonably understand conditional probabilities, imagine trying to have a conversation about what is a reasonable amount of racial bias for monitoring purposes (inferiority bounds). Or that this racial bias estimate is not for all police, but some mixture of police officers and actions. Hard pass on either of those from me.
Subsequently this work has no bearing on actual police practice (including my own). They are of very limited utility – at best a stick or shield in civil litigation. They don’t help police departments change behavior in response to discovering (or not discovering) racial bias. And OIS are basically so rare they are worthless for all but the biggest police departments in terms of a useful monitoring metric (it won’t be sensitive enough to say whether a police department as a whole is doing good or doing bad).
So what do I think is potentially useful way to use this data? I’ve used the term “monitoring metric” a few times – what I mean by that is using the information to actually inform some response. Internally for police departments, shootings should be part of an early intervention system used to monitor individual officers for problematic behavior. From a state or federal government perspective, they could actively monitor overall levels of force used to identify outlier agencies (see this blog post example of mine). For the latter think proactively identifying problematic departments, instead of the typical current approach of wait for some major incident and then the Department of Justice assigns a federal monitor.
In either of those strategies just looking at shootings won’t be enough, they would need to use all levels of use of force to effectively identify either bad individual cops or problematic departments as a whole. Hence why I suggested adding all levels of force to say NIBRS, rather than having a stand alone national level OIS database. And individual agencies already have all the data they need to do an effective early intervention system.
I’m not totally oppossed to having a national level OIS database just based on normative arguments – e.g. you think it is a travesty we can’t say how many folks were killed by police in the prior year. It is not a totally hollow gesture, as making people record the information does provide a level of oversight, so may make a small difference. But that data won’t be able to say anything about the racial bias in individual police officer decision making.
Cesario, J., Johnson, D. J., & Terrill, W. (2019). Is there evidence of racial disparity in police use of deadly force? Analyses of officerinvolved fatal shootings in 2015–2016. Social psychological and personality science, 10(5), 586595.
Fryer Jr, R. G. (2019). An empirical analysis of racial differences in police use of force. Journal of Political Economy, 127(3), 12101261.
Klinger, D., Rosenfeld, R., Isom, D., & Deckard, M. (2016). Race, crime, and the microecology of deadly force. Criminology & Public Policy, 15(1), 193222.
Knox, D., Lowe, W., & Mummolo, J. (2019). The bias is built in: How administrative records mask racially biased policing. Available at SSRN.
Knox, D., & Mummolo, J. (2020). Making inferences about racial disparities in police violence. Proceedings of the National Academy of Sciences, 117(3), 12611262.
James, L., Klinger, D., & Vila, B. (2014). Racial and ethnic bias in decisions to shoot seen through a stronger lens: Experimental results from highfidelity laboratory simulations. Journal of Experimental Criminology, 10(3), 323340.
Jetelina, K. K., Bishopp, S. A., Wiegand, J. G., & Gonzalez, J. M. R. (2020). Race/ethnicity composition of police officers in officerinvolved shootings. Policing: An International Journal.
Jetelina, K. K., Jennings, W. G., Bishopp, S. A., Piquero, A. R., & Reingle Gonzalez, J. M. (2017). Dissecting the complexities of the relationship between police officer–civilian race/ethnicity dyads and lessthanlethal use of force. American journal of public health, 107(7), 11641170.
Johnson, D. J., Tress, T., Burkel, N., Taylor, C., & Cesario, J. (2019). Officer characteristics and racial disparities in fatal officerinvolved shootings. Proceedings of the National Academy of Sciences, 116(32), 1587715882.
Nix, J., Campbell, B. A., Byers, E. H., & Alpert, G. P. (2017). A bird’s eye view of civilians killed by police in 2015: Further evidence of implicit bias. Criminology & Public Policy, 16(1), 309340.
Ross, C. T. (2015). A multilevel Bayesian analysis of racial bias in police shootings at the countylevel in the United States, 2011–2014. PloS one, 10(11).
Tregle, B., Nix, J., & Alpert, G. P. (2019). Disparity does not mean bias: Making sense of observed racial disparities in fatal officerinvolved shootings with multiple benchmarks. Journal of crime and justice, 42(1), 1831.
Wheeler, A. P., Phillips, S. W., Worrall, J. L., & Bishopp, S. A. (2017). What factors influence an officer’s decision to shoot? The promise and limitations of using public data. Justice Research and Policy, 18(1), 4876.
]]>A reasonable critique of such systems though is they can have disparate racial impact. For example, ProPublica had an article on how the Compas risk assessment instrument resulted in more false positives for black than white individuals. Chris Stucchio has a nice breakdown for why this occurs, which is not due to the Compas being intrinsically racist algorithm, but due to the nature of the baseline risks for the two groups.
Consider a very simple example to illustrate. Imagine based on our costbenefit analysis, we determine the probability threshold to flag a individual as high risk is 60%. Now say our once we apply our predictions, for those above the threshold, whites are all predicted to be 90%, and blacks are all 70%. If our model is well calibrated (which is typically the case), the false positive rate for whites will be 10%, and will be 30% for blacks.
It is actually a pretty trivial problem though to balance false positive rates between different groups, if that is what you want to do. So I figured I would illustrate here using the same ProPublica data. There are tradeoffs though with this, balancing false positives means you lose out on other metrics of fairness. In particular, it means you don’t have equality of treatment – different racial groups will have different thresholds. The full data and code I use to illustrate this can be downloaded here.
To illustrate how we would balance the false positive rates between groups, I use the same ProPublica risk assessment data. So this isn’t per se for bail decisions, but works fine as an illustration. First in python I load my libraries, and then read in the data – it is a few over 11,000 cases.
import pandas as pd
import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
my_dir = r'C:\Users\andre\Dropbox\Documents\BLOG\BalanceFalsePos'
os.chdir(my_dir)
#For notes on data source, check out
#https://github.com/apwheele/ResearchDesign/tree/master/Week11_MachineLearning
recid = pd.read_csv('PreppedCompas.csv')
print( recid.head() )
Next I prepare the dataset for modelling. I am not using all of the variables in the dataset. What I predict here is recidivism post 30 days (there are a bunch of recidivism right away in the dataset, so I am not 100% sure those are prior to screening). I use the three different aggregate compas scores, juvenile felony count, whether they were male, how old they were, and whether the current charge to precipitate screening is a felony or misdemeanor. I include the race variable in the dataset, but I won’t be using it in the predictive model. (That point deserves another blog post, contra to what you might expect, leaving race flags in will often result in better outcomes for that protected class.)
#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
'juv_fel_count','YearsScreening']]
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
recid_prep['race'] = recid['race']
print( recid['race'].value_counts() ) #pretty good sample size for both whites/blacks
Next I make my testing and training sets of data. In practice I can perfectly balance false positives retrospectively. But having a test set is a better representation of reality, where you need to make some decisions on the historical data and apply it forward.
#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1]
recid_test = recid_prep[recid_prep['Train'] == 0]
Now the procedure I suggest to balance falsepositives doesn’t matter how you generate the predictions, just that we need a predicted probability. Here I use random forests, but you could use whatever machine learning or logistic regression model you want. Second part just generates the predicted probabilities for the training dataset.
#Now estimating the model
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
'juv_fel_count','YearsScreening','Male','Fel','Mis'] #no race in model
dep_var = 'Recid30'
rf_mod = RandomForestClassifier(n_estimators=500, random_state=10)
rf_mod.fit(X = recid_train[ind_vars], y = recid_train[dep_var])
#Now getting the predicted probabilities in the training set
pred_prob = rf_mod.predict_proba(recid_train[ind_vars] )
recid_train['prob'] = pred_prob[:,1]
recid_train['prob_min'] = pred_prob[:,0]
Now to balance false positives, I will show a graph. Basically this just sorts the predicted probabilities in descending order for each racial group. Then you can calculate a cumulate false positive rate for different thresholds for each group.
#Making a cusum plot within each racial group for the false positives
recid_train.sort_values(by=['race','prob'], ascending=False, inplace=True)
recid_train['const'] = 1
recid_train['cum_fp'] = recid_train.groupby(['race'])['prob_min'].cumsum()
recid_train['cum_n'] = recid_train.groupby(['race'])['const'].cumsum()
recid_train['cum_fpm'] = recid_train['cum_fp'] / recid_train['cum_n']
white_rt = recid_train[recid_train['race'] == 'Caucasian']
black_rt = recid_train[recid_train['race'] == 'AfricanAmerican' ]
And now the fun part (and least in output, not really in writing matplotlib code).
#now make the chart for white and black
fig, ax = plt.subplots()
ax.plot(black_rt['prob'], black_rt['cum_fpm'], drawstyle='steps', color='b', label='Black')
ax.plot(white_rt['prob'], white_rt['cum_fpm'], drawstyle='steps', color='r', label='White')
ax.set_xlim(1, 0) # decreasing probs
plt.xticks(np.arange(1.0,0.1,0.1))
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Mean False Positive Rate')
ax.grid(True,linestyle='')
ax.legend(facecolor='white', framealpha=1)
plt.savefig('FP_Rate.png', dpi=2000, bbox_inches='tight')
plt.show()
So what this chart shows is that if we set our threshold to a particular predicted probability (X axis), based on the data we would expect a false positive rate (Y axis). Hence if we want to balance false positives, we just figure out the race specific thresholds for each group at a particular Y axis value. Here we can see the white line is actually higher than the black line, so this is reverse ProPublica findings, we would expect whites to have a higher false positive rate than blacks given a consistent predicted probability of high risk threshold. So say we set the threshold at 10% to flag as high risk, we would guess the false positive rate among blacks in this sample should be around 40%, but will be closer to 45% in the white sample.
Technically the lines can cross at one or multiple places, and those are places where you get equality of treatment and equality of outcome. It doesn’t make sense to use that though from a safety standpoint – those crossings can happen at a predicted probability of 99% (so too many false negatives) or 0.1% (too many false positives). So say we wanted to equalize false positive rates at 30% for each group. Here this results in a threshold for whites as high risk of 0.256, and for blacks a threshold of 0.22.
#Figuring out where the threshold is to limit the mean FP rate to 0.3
#For each racial group
white_thresh = white_rt[white_rt['cum_fpm'] > 0.3]['prob'].max()
black_thresh = black_rt[black_rt['cum_fpm'] > 0.3]['prob'].max()
print( white_thresh, black_thresh )
Now for the real test, lets see if my advice actually worked in a new sample of data to balance the false positive rate.
#Now applying out of sample, lets see if this works
pred_prob = rf_mod.predict_proba(recid_test[ind_vars] )
recid_test['prob'] = pred_prob[:,1]
recid_test['prob_min'] = pred_prob[:,0]
white_test = recid_test[recid_test['race'] == 'Caucasian']
black_test = recid_test[recid_test['race'] == 'AfricanAmerican' ]
white_test['Flag'] = 1*(white_test['prob'] > white_thresh)
black_test['Flag'] = 1*(black_test['prob'] > black_thresh)
white_fp= 1  white_test[white_test['Flag'] == 1][dep_var].mean()
black_fp = 1  black_test[black_test['Flag'] == 1][dep_var].mean()
print( white_fp, black_fp )
And we get a false positive rate of 54% for whites (294/547 false positives), and 42% for blacks (411/986) – yikes (since I wanted a 30% FPR). As typical, when applying your model to out of sample data, your predictions are too optimistic. I need to do some more investigation, but I think a better way to get error bars on such thresholds is to do some kfold metrics and take the worst case scenario, but I need to investigate that some more. The sample sizes here are decent, but there will ultimately be some noise when deploying this in practice. So basically if you see in practice the false positive rates are within a few percentage points that is about as good as you can get in practice I imagine. (And for smaller sample sizes will be more volatile.)
]]>So how do we solve that problem? One solution is to use regularized regression results (e.g. ridge regression, lasso), when the number of predictors is greater than the number of observations. So I can cast the matching procedure into a regression problem to generate the weights. Those regression procedures are typically used for forecasting, but don’t have well defined standard errors, and so subsequently are typically only used for point forecasts. One way to make inferences though is to generate the synthetic weights (here using lasso regression), and then use conformal prediction intervals to do our hypothesis testing of counterfactual trends.
Here I walk through an example using state panel crime data in R, full code and data can be downloaded here.
So first, these are the packages we need to replicate the results. conformalInference
is not on CRAN yet, so use devtools to install it.
#library(devtools)
#install_github(repo="ryantibs/conformal", subdir="conformalInference")
library(conformalInference)
library(glmnet)
library(Synth)
Then I have prepped a nice state panel dataset of crime rates and counts from 1960 through 2014. I set a hypothetical treatment start year in 2005 just so I have a nice 10 years post data for illustration. That is a pretty good length prepanel though, and a good number of potential donors.
MyDir < "C:\\Users\\axw161530\\Desktop\\SynthIdeas"
setwd(MyDir)
TreatYear < 2005
LongData < read.csv("CrimeStatebyState_Edited.csv")
summary(LongData)
Next I prep my data, currently it is in long panel format, but I need it in wide format to fit the regression equations I want. I am just matching on violent crime rates here. I take out NY, as it is missing a few years of data. (This dataset also includes DC.) Then I split it up into my pre intervention and post intervention set.
#Changing the data to wide for just the violent offenses
wide < LongData[,c('State','Year','Violent.Crime.rate')]
names(wide)[3] < 'VCR'
wide < reshape(wide, idvar="Year", timevar="State", direction="wide")
summary(wide)
#Take out NY because of NAs
wide < wide[,c(1:33,35:52)]
wide_pre < as.matrix(wide[wide$Year < TreatYear,])
wide_post < as.matrix(wide[wide$Year >= TreatYear,])
Now onto the good stuff, we can estimate our lasso regression using the predata to get our weights. This constrains the coefficients to be positive and below 1. But does not have the constraint they sum to 1. I just choose Alabama as an example treated unit — I intentionally chose a state and year that should not have any effects for illustration and to check the coverage of my technique vs more traditional analyses.
You can see in my notes this is different than traditional synth in that it has an intercept as well. I was surprised, but the predictions in sample were really bad without the intercept no matter how I sliced it.
res < glmnet(x=wide_pre[,3:51],y=wide_pre[,2],family="gaussian",
lower.limits=0,upper.limits=1,intercept=TRUE,standardize=FALSE,
alpha=1) #need the intercept, predictions suck otherwise
Even though this does not constrain the coefficients to sum to 1, it ends up with weights really close to that ideal anyway (sum of the nonintercept coefficients is just over 1.01). When I use crossvalidation it does not choose weights that sum to unity, but in sample the above code and the cv.glmnet
are really similar in terms of predictions.
co_ridge < as.matrix(coef(res))
fin < co_ridge[,"s99"]
active < fin[fin > 0] #Does not include intercept
If you print active
we then have for our state weights (and the intercept is pretty tiny, 22). So not quite sure why eliminating the intercept was causing such problems in this example. So North Carolina just sneaks in, but otherwise the synthetic control is a mix of Arkansas, California, Kentucky, and Texas. The intercept is just a level shift, so we are still matching curves otherwise, so that does not bother me very much.
VCR.AR 0.2078156362
VCR.CA 0.1201658279
VCR.IL 0.1543015666
VCR.KY 0.2483613907
VCR.NC 0.0002896238
VCR.TX 0.2818272850
If we look at our predictions for the pretime period, Alabama had the typical crime path, with a big raise going into the early 90’s and then a fall afterward (black line), and our insample predictions from the lasso regression are decent.
pre_pred < predict(res,newx=wide_pre[,3:51],s=min(res$lambda)) #for not cv results
plot(wide_pre[,1],wide_pre[,2],type='l',xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_pre[,1],pre_pred,bg='red',pch=21) #Not too shabby
legend(1960,800,legend=c("Observed Albama","Predicted"),col=c("black","black"), pt.bg=c("black","red"), lty=c(1,NA), pch=c(NA,21))
Now to evaluate post intervention, we are going to generate conformal prediction intervals using a jackknife approach. Basically doing all the jazz of above, but leaving one pre year out at a time, and trying to predict Alabama’s violent crime rate for that left out year. Repeat that same process for all prior years, and we can get a calculation of the standard error of our prediction. Then apply that standard error to future years, so we can tell if the observed trend is different than the counterfactual we estimated (given the counterfactual has errors). I generate both 90% prediction intervals, as well as 99% prediction intervals.
train_fun < function(x, y, out=NULL){
return( glmnet(x,y,alpha=1,standardize=FALSE,intercept=TRUE,nlambda=100,
lower.limits=0,upper.limits=1,family="gaussian")
)
}
pred_fun = function(out, newx) {
return(predict(out, newx, s=min(out$lambda)))
}
limits_10 < conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
train.fun=train_fun,predict.fun=pred_fun,alpha=0.10,
verbose=TRUE)
limits_01 < conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
train.fun=train_fun,predict.fun=pred_fun,alpha=0.01,
verbose=TRUE)
plot(wide_post[,1],wide_post[,2],type='l',ylim=c(150,650),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
legend("topright",legend=c("Observed Albama","Predicted","90% Pred. Int.","99% Pred. Int."),cex=0.7,
col=c("black","black","grey","grey"), pt.bg="red", lty=c(1,NA,1,1), pch=c(NA,21,NA,NA), lwd=c(1,1,1,3))
Then at the end of the above code snippet I made a plot. Black line is observed for Alabama from 0514. Red dots are the estimated counterfactual based on the preweights. The lighter grey lines are then the prediction intervals. So we can see it is just outside the 90% intervals 3 times in the later years (would only expect 1 time), but all easily within the 99% intervals.
Note these are prediction intervals, not confidence intervals. Thinking about it I honestly don’t know whether we want prediction or confidence intervals in this circumstance, but prediction will be wider.
So this approach just matches on the pretreated same outcome observations. To match on additional covariates, you can add them in as rows into the pretreatment dataset (although you would want to normalize the values to a similar mean and standard deviation as the pretreated outcome series).
You may also add in other covariates, like functions of time (although this changes the nature of the identification). So for example say you incorporate a linear and quadratic trend in time, and lasso only chooses those two time factors and no control areas. You are doing something more akin to interrupted time series analysis at that point (the counterfactual is simply based on your estimate of the pretrend). Which I think is OK sometimes, but is quite different than using control areas to hopefully capture random shocks.
To see whether my error intervals are similar to the placebo approach, I used the old school synth R package. It isn’t 100% comparable, as it makes you match on at least one covariate, so here I choose to also match on the average logged population over the pretreatment period.
#NY is missing years
LongData_MinNY < LongData[as.character(LongData$State) != "NY",c("State","Year","Violent.Crime.rate","Population")]
LongData_MinNY$StateNum < as.numeric(LongData_MinNY$State)
LongData_MinNY$State < as.character(LongData_MinNY$State)
LongData_MinNY$LogPop < log(LongData_MinNY$Population)
state_nums < unique(LongData_MinNY$StateNum)
dataprep.out < dataprep(foo = LongData_MinNY,
dependent = "Violent.Crime.rate",
predictors = c("LogPop"),
unit.variable = "StateNum",
unit.names.variable = "State",
time.variable = "Year",
treatment.identifier = 2,
controls.identifier = state_nums[!state_nums %in% 2],
time.optimize.ssr = 1960:(TreatYear1),
time.predictors.prior = 1960:(TreatYear1),
time.plot = 1960:2014
)
synth_res < synth(dataprep.out)
synth_tables < synth.tab(dataprep.res = dataprep.out, synth.res = synth_res)
synth_tables$tab.w #a bunch of little weights across the board
path.plot(synth.res = synth_res, dataprep.res = dataprep.out, tr.intake=TreatYear,Xlab='',Ylab='Violent Crime Rate per 100,000',
Legend=c("Alabama","Synthetic Control"), Legend.position=c("topleft"))
Looking at the weights, it is a bunch of little ones for many different states. Looking at the plot, it doesn’t appear to be any better fit than the lasso approach.
And then I just do the typical approach and use placebo checks to do inference. I loop over my 49 placebos (1 state for NY, but +1 state because this list includes DC).
#Dataframes to stuff the placebos check results into
Predicted < data.frame(dataprep.out$Y0plot %*% synth_res$solution.w)
names(Predicted) < "TreatPred"
Pred_MinTreat < data.frame(TreatPred = Predicted$TreatPred  LongData_MinNY[LongData_MinNY$StateNum == 2,"Violent.Crime.rate"])
#Now I just need to loop over the other states and collect their results for the placebo tests
placebos < state_nums[!state_nums %in% 2]
for (i in placebos){
dataprep.plac < dataprep(foo = LongData_MinNY,
dependent = "Violent.Crime.rate",
predictors = c("LogPop"),
unit.variable = "StateNum",
unit.names.variable = "State",
time.variable = "Year",
treatment.identifier = i,
controls.identifier = state_nums[!state_nums %in% i],
time.optimize.ssr = 1960:(TreatYear1),
time.predictors.prior = 1960:(TreatYear1),
time.plot = 1960:2014
)
synth_resP < synth(dataprep.plac)
synth_tablesP < synth.tab(dataprep.res = dataprep.plac, synth.res = synth_resP)
nm < paste0("S.",i)
Predicted[,nm] < dataprep.plac$Y0plot %*% synth_resP$solution.w
Pred_MinTreat[,nm] < Predicted[,nm]  LongData_MinNY[LongData_MinNY$StateNum == i,"Violent.Crime.rate"]
}
If you look at the synth estimates for Alabama (grey circles), they are almost exactly the same as the lasso predictions (red circles), even though the weights are very different.
PredRecent < Predicted[1960:2014 >= TreatYear,]
DiffRecent < Pred_MinTreat[1960:2014 >= TreatYear,]
plot(wide_post[,1],wide_post[,2],type='l',ylim=c(100,700),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
points(wide_post[,1],PredRecent$TreatPred,bg='grey',pch=21)
legend("topright",legend=c("Observed Albama","Lasso Pred.","90% Pred. Int.","99% Pred. Int.","Synth Pred."),cex=0.6,
col=c("black","black","grey","grey"), pt.bg=c(NA,"red",NA,NA,"grey"), lty=c(1,NA,1,1,NA), pch=c(NA,21,NA,NA,21), lwd=c(1,1,1,3,1))
But when we look at variation in our placebo results (thin, purple lines), they are much wider than our conformal prediction intervals.
plot(wide_post[,1],wide_post[,2]post_pred,type='l',ylim=c(500,500),xlab='',ylab='Observed  Predicted (Violent Crime Rates)')
points(wide_post[,1],post_predpost_pred,bg='red',pch=21)
lines(wide_post[,1],limits_01$lopost_pred,col='grey',lwd=3)
lines(wide_post[,1],limits_01$uppost_pred,col='grey',lwd=3)
for (i in 2:ncol(PredRecent)){
lines(wide_post[,1],DiffRecent[,i],col='#9400D340',lwd=0.5)
}
legend(x=2005.5,y=700,legend=c("Observed Albama","Lasso Pred.","99% Pred. Int.","Placebos"),
col=c("black","black","grey",'#9400D3'), pt.bg=c(NA,"red",NA,NA), lty=c(1,NA,1,1),
pch=c(NA,21,NA,NA), lwd=c(1,1,3,0.5), xpd=TRUE, horiz=TRUE, cex = 0.45)
So I was hoping they would be the same (conformal would cover the placebo at the expected rate), but alas they are not. So I’m not sure if my conformal intervals are too small, or the placebo checks are extra noisy. I can’t prove it, but I suspect the placebo checks are somewhat noisy, mainly because there will always be some intervention that is idiosyncratic to specific donors over long periods of time that makes them no longer good counterfactuals. This seems especially true if you consider predictions further out from the treatment year. Although I find the logic of the placebo checks pretty convincing, so I am somewhat torn.
Since we have in this example 49 donors, the twotailed pvalue for being outside the placebos would be 2/(49+1)=0.04. Here we would need an intervention that either increased violent crime rates by plus/minus 400 per 100,000, pretty much an impossible standard given a baseline of only 400 crimes per 100,000 as of 2004. The 99% conformal intervals are still pretty wide, with an increase/decrease of about 150 violent crimes per 100,000 needed to be a significant change. The two lines way outside 400 happen to be Alaska and Wyoming, not DC, so maybe a tiny population state results in higher volatility problem. But besides them there are a bunch of placebo states around plus/minus 300 as well.
So caveat emptor if you want to use this idea in your own work, I don’t know if my suggestion is good or bad. Here it suggests its more diagnostic (smaller intervals) than the placebo checks, and isn’t limited by the number of potential donors in setting the alpha level for your tests (e.g. if you only have 10 potential donors your placebo checks are only 90% intervals).
Since this is just one example, there are a few things I would need to know before recommending it more generally. One is that it may not work with smaller pre time series and/or a smaller donor pool. (Not sure of any better way of checking than via a ton of different simulations.)
Doing some more lit review while preparing this post, I appear to be like 15th in line to suggest this approach (so don’t take it as novel). In terms of using the lasso to estimate the synth weights, it seems Susan Athey and colleagues proposed something similar in addition to using other machine learning techniques. Also see Amjad et al. 2018 in the Journal of Machine Learning, and this workshop by Alex Hollingsworth and Coady Wing. I am not even the first one to think to use conformal prediction intervals apparently, see this working paper (Chernozhukov, Wuthrich, and Zhu, 2019) posted just a few weeks prior.
There is another R package, gsynth, that appears to solve the problem of p > n via a variable reduction technique (Xu, 2017). Xu also discusses how incorporating more information is really making different identification assumptions. So again just getting good predictions/minimizing the insample mean square error is not necessarily the right approach to get correct causal inferences.
Just a blog post, so again can’t say if this is an improvement over other work offhand. This is just illustrative that the bounds for the conformal prediction may be smaller than the typical permutation based approach. Casting it as a regression problem I intuitively grok more, and think opens up more possibilities. For example, you may want to use binomial logistic models instead of linear for the fitting process (so takes into account more volatility for smaller population states).
]]>
One of my favorite plots to go with random effect models is called a caterpillar plot. So typically folks just stop at reporting the variance of the random intercepts and slopes when they estimate these models. But you not only get the global variance estimates, but can also get an estimate (and standard error) for each higher level variable. So if I have 100 people, and I do a random intercept for those 100 people, I can say “Joe B’s random intercept is 0.5, and Jane Doe’s random intercept is 0.2” etc.
So this is halfway in between confirmatory data analysis (we used a model to get those estimates) but is often useful for further understanding the model and seeing if you should add anything else. E.g. if you see the random intercepts have a high correlation with some other piece of person information, that information should be incorporated into the model. It is also useful to spot outliers. And if you have spatial data mapping the random intercepts should be something you do.
SPSS recently made it easier to make these types of plot (as of V25), so I am going to give an example. In my class, I give code examples in R, Stata, and SPSS whenever I can, so this link contains code for all three programs. I will be using data from my dissertation, with crime on street segments in DC, nested within regular grid cells (used to approximate neighborhoods).
So first data prep, I define where my data is using FILE HANDLE, read in the csv file of the data, compute a new variable (the sum of both detritus and physical infrastructure 311 calls). Then finally I declare that the FishID variable (my grid cells neighborhoods) is a nominal level variable. SPSS needs that defined correctly for later models.
*************************************************************.
FILE HANDLE data /NAME = "??????Your Path Here!!!!!!!!!!!".
*Importing the CSV file into SPSS.
GET DATA /TYPE=TXT
/FILE="data\DC_Crime_withAreas.csv"
/ENCODING='UTF8'
/DELCASE=LINE
/DELIMITERS=","
/QUALIFIER='"'
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/DATATYPEMIN PERCENTAGE=95.0
/VARIABLES=
MarID AUTO
XMeters AUTO
YMeters AUTO
FishID AUTO
XMetFish AUTO
YMetFish AUTO
TotalArea AUTO
WaterArea AUTO
AreaMinWat AUTO
TotalLic AUTO
TotalCrime AUTO
CFS1 AUTO
CFS2 AUTO
CFS1Neigh AUTO
CFS2Neigh AUTO
/MAP.
CACHE.
EXECUTE.
DATASET NAME CrimeDC.
DATASET ACTIVATE CrimeDC.
*Compute a new variable, total number of 311 calls for service.
COMPUTE CFS = CFS1 + CFS2.
EXECUTE.
VARIABLE LEVEL FishID (NOMINAL).
*************************************************************.
Now onto the good stuff, estimating our model. Here we are looking at the fixed effects of bars and 311 calls on crime on street segments, but also estimating a random intercept for each grid cell. As of V25, SPSS lets you specify an option to print the solution for the random statements, which we can capture in a new SPSS dataset using the OMS command.
So first we declare our new dataset to dump the results in, Catter
. Then we specify an OMS command to capture the random effect estimates, and then estimate our negative binomial model. I swear SPSS did not use to be like this, but now you need to end the OMS command before you putz with that dataset.
*************************************************************.
DATASET DECLARE Catter.
OMS
/SELECT TABLES
/IF SUBTYPES='Empirical Best Linear Unbiased Predictions'
/DESTINATION FORMAT=SAV OUTFILE='Catter' VIEWER=YES
/TAG='RandTable'.
*SOLUTION option only as of V25.
GENLINMIXED
/FIELDS TARGET=TotalCrime
/TARGET_OPTIONS DISTRIBUTION=NEGATIVE_BINOMIAL
/FIXED EFFECTS=TotalLic CFS
/RANDOM USE_INTERCEPT=TRUE SUBJECTS=FishID SOLUTION=TRUE
/SAVE PREDICTED_VALUES(PredRanEff).
OMSEND TAG='RandTable'.
EXECUTE.
DATASET ACTIVATE Catter.
*************************************************************.
And now we can navigate over to the saved table and make our caterpillar plot. Because we have over 500 areas, I sort the results and don’t display the X axis. But this lets you see the overall distribution and spot any outliers.
*************************************************************.
*Lets make a caterpillar plot.
FORMATS Prediction Std.Error LowerBound UpperBound (F4.2).
SORT CASES BY Prediction (D).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Var1 Prediction LowerBound UpperBound
/GRAPHSPEC SOURCE=INLINE
/FRAME INNER=YES.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Var1=col(source(s), name("Var1"), unit.category())
DATA: Prediction=col(source(s), name("Prediction"))
DATA: LowerBound=col(source(s), name("LowerBound"))
DATA: UpperBound=col(source(s), name("UpperBound"))
SCALE: cat(dim(1), sort.data())
GUIDE: axis(dim(1), null())
GUIDE: axis(dim(2), label("BLUP"))
SCALE: linear(dim(2), include(0))
ELEMENT: edge(position(region.spread.range(Var1*(LowerBound + UpperBound))), size(size."0.5")))
ELEMENT: point(position(Var1*Prediction), color.interior(color.black), size(size."1"))
END GPL.
*************************************************************.
And here is my resulting plot.
And I show in the linked code some examples for not only random intercepts, but you can do the same thing for random slopes. Here is an example doing a model where I let the TotalLic effect (the number of alcohol licenses on the street segment) vary by neighborhood grid cell. (The flat 0 estimates and consistent standard errors are grid cells with 0 licenses in the entire area.)
The way to interpret these estimates are as follows. The fixed effect part of the regression equation here is: 0.247 + 0.766*Licenses
. That alcohol license effect though varies across the study area, some places have a random slope of +2, so the equation could then be thought of as 0.247 + (0.766 + 2)*Licenses
(ignoring the random intercept part). So the effect of bars in that area is much larger. Also there are places with negative effects, so the effects of bars in those places are smaller. You can do the same type of thought experiments simply with the reported variance components, but I find the caterpillar plots to be a really good visual to show what those random effects actually mean.
For other really good multilevel modelling resources, check out the Centre for Multilevel Modelling, and Germán Rodríguez’s online notes. Eventually I will get around to uploading my seminar class notes and code snippets, but in the mean time if you see a week and would like my code examples, always feel free to email.
]]>Here I wanted to focus on a different problem though — how long to keep cases open? There are diminishing returns to keeping cases open indefinitely, and so PDs should be able to right size the backend of detective open cases as well as the front end triaging. Here my suggested solution is to estimate a survival model of the probability of a case being solved, and then you can estimate an expected return on investment given the time you put in.
Here is a simplified example. Say the table below shows the (instantaneous) probability of a case being solved per weeks put into the investigation.
Week 1 20%
Week 2 10%
Week 3 5%
Week 4 3%
Week 5 1%
In survival model parlance, this would be the hazard function in discrete time increments. And then we have diminishing probabilities over time, which should also be true (e.g. a higher probability of being solved right away, and gets lower over time). The expected return of investigating this crime at time t is the cumulative probability of the crime being solved at time t, multiplied by whatever value you assign to the case being solved. The costs of investigating will be fixed (based on the detective salary), so is just a multiple of t*invest_costs
.
So just to fill in some numbers, lets say that it costs the police department $1,000 a week to keep an investigation going. Also say a crime has a return of $10,000 if it is solved (the latter number will be harder to figure out in practice, as cost of crime estimates are not a perfect fit). So filling in our table, we have below our detective return on investment estimates (note that the cumulative probability of being solved is not simply the sum of the instantaneous probabilities, else it would eventually go over 100%). So return on investment (ROI), at week 1 is 10,000*0.2 = 2,000
, at week 2 is 10,000*0.28 = 2,800
, etc.
h(t) solved% cumcosts ROI
Week 1 20% 20% 1,000 2,000
Week 2 10% 28% 2,000 2,800
Week 3 5% 32% 3,000 3,200
Week 4 3% 33% 4,000 3,300
Week 5 1% 34% 5,000 3,400
So the cumulative costs outweigh the total detective resources devoted to the crime by Week 4 here. So in practice (in this hypothetical example) you may say to a detective you get 4 weeks to figure it out, if not solved by then it should be closed (but not cleared), and you should move onto other things. In the long run (I think) this strategy will make sure detective resources are balanced against actual cases solved.
This right sizes investigation lengths from a global perspective, but you also might consider whether to close a case on an individual casebycase basis. In that case you wouldn’t calculate the sunk cost of the investigation so far, it is just the probability of the case being solved going forward relative to future necessary resources. (You do the same table, just start the cumcosts and solved percent columns from scratch whenever you are making that decision.)
In an actual applied setting, you can estimate the survival function however you want (e.g. you may want a cure mixturemodel, so not all cases will result in 100% being solved given infinite time). It is also the case that different crimes will not only have different survival curves, but also will have different costs of crime (e.g. a murder has a greater cost to society than a theft) and probably different investigative resources needed (detective costs may also get lower over time, so are not constant). You can bake that all right into this estimate. So you may say the cost of a murder is infinite, and you should forever keep that case open investigating it. A burglary though may be a very short time interval before it should be dropped (but still have some initial investment).
Another neat application of this is that if you can generate reasonable returns to solving crimes, you can right size your overall detective bureau. That is you can make a quantitative argument I need X more detectives, and they will help solve Y more crimes resulting in Z return on investment. It may be we should greatly expand detective bureaus, but have them only keep many cases open a short time period. I’m thinking of the recent officer shortages in Dallas, where very few cases are assigned at all. (Some PDs have patrol officers take initial detective duties on the crime scene as well.)
There are definitely difficulties with applying this approach. One is that getting the cost of solving a crime estimate is going to be tough, and bridges both quantitative cost of crime estimates (although many of them are sunk costs after the crime has been perpetrated, arresting someone does not undo the bullet wound), likelihood of future reoffending, and ethical boundaries as well. If we are thinking about a detective bureau that is overbooked to begin with, we aren’t deciding on assigning individual cases at that point, but will need to consider preempting current investigations for new ones (e.g. if you drop case A and pick up case B, we have a better ROI). And that is ignoring the estimating survival part of different cases, which is tricky using observational data as well (selection biases in what cases are currently assigned could certainly make our survival curve estimates too low or too high).
This problem has to have been tackled in different contexts before (either by actuaries or in other business/medical contexts). I don’t know the best terms to google though to figure it out — so let me know in the comments if there is related work I should look into on solving this problem.
]]>There was one point of contention though between us — ASEBP folks advocate for the failing fast model of evaluation, whereas I advocated for planning more long term experiments. In particular, I suggest this chart to plan your experiments. So say if you have an area with only 10 crimes per month, I would suggest you should do the experiment for at least 4 months, so if what your are doing is 50% effective at reducing crime, you will conclude it has at least weak evidence of effectiveness using my WDD test. If you think 50% is too high of a bar, if you do it for 12 months it only needs to be alittle over 25% effective to tell if it is working.
The ideal behind failing fast and innovating I totally get, but whether or not we can actually see if something is effective in the short run with low baseline crime counts may be a road block to this idea in practice. Note that this is not me bagging on people doing experiments — what is most likely to happen if you do an experiment with low power is you will conclude it is not effective, even if it partially works. So I’m more concerned the BetaGov fail fast model is likely to throw out costeffective interventions that don’t appear on their face to be effective, as opposed to false positives.^{1}
Am I being too negative though? And also can we create a monitoring tool to give more immediate feedback — so instead of waiting a year and seeing the results, evaluating the efficacy of an intervention over time? To do this I am giving cusum charts a try, and did a little simulation to show how it might look in practice. SPSS Code to replicate the findings here.
So what I did was simulate a baseline control area with 10 crimes per time period, and a treated area that had a 20% reduction (so goes down to 8 crimes on average per time period). Here is the time series of those two areas, black line is the control area, and red line is the treated area. Time periods can be whatever you want (e.g. days, weeks, months), what matters is the overall average and the difference between the two series.
Based on this graph, you can’t really visually tell if the red treated area is doing any better than the black control area — they overlap too much. But we aren’t just interested in the effect for any one time period, but in the cumulative effect over time. To calculate that, you just subtract the black line from the red line, and take the cumulative sum of that difference. The next chart shows that statistic, along with 100 simulated lines showing what happens when you do the same cumulative statistic to data with no changes.
So you can see here that it takes about 13 time periods to show the cumulative effects are outside of the simulation boundaries, but you might conclude there is suggestive evidence of effectiveness after say 8+ time periods. Going further out it still shows the cumulative number of crimes prevented over the life of the intervention, so goes down to around 75 crimes prevented by 25 time periods.
The number of time periods necessary to show divergence is dependent on how effective the intervention is. So if we have the same baseline average of 10 crimes per time period, but the intervention is 50% effective (so reduces to an average of 5 crimes per time period), you can tell it diverges by period 6 in this second simulation example.
If we go back to my power chart I made for the WDD test, you can see that these effective time periods are close to my power chart suggestions for the weak evidence line. So this cusum approach is maybe slightly more diagnostic, and has the benefit you may be able to stop the experiment early if it is really effective.
You should still commit to running the experiment though for a set amount of time. The amount of time should be based on how effective you think the experiment should be, as well as costbenefit analyses. If something costs alot of money (e.g. overtime) the effectiveness threshold to make it worthwhile in practice is a much higher bar than something that is closer to zero cost (such as shifting around current assignments). Given that ASEBP is advocating for lower level officers to experiment, they are more likely to be the latter type of low cost interventions.
There are still a few issues with using cusum charts like this though. One, this is very dependent on having a control area that is the same level of crime counts. So my WDD test only needs the parallel trends assumption, but this needs both the parallel trends and equal levels. (There are ways to get rid of the equal levels assumption though, one is to take the differences in differences and calculate those cusums over time.)
Another is that you need to reset cusum charts after a particular time period — you can see the simulations are random walks, and so grow much wider over time. I’m not sure at that point though if you should choose a new control area, or just stick with the prior one though. In the first example you can see the red line overestimates the effectiveness — the first chart the true estimate should be 2*time period
(estimated 75 versus should be 50 after 25 time periods). For the second the true effect is 5*time period
, so the estimate is a slight underestimate (estimated 100 versus should be 125 after 25 time periods).
But this is about the best meet in the middle of actively monitoring place based crime interventions and my advocacy for planning long term interventions that I can drum up for now. It is short term feedback, but you should be committed to running the experiment for a longer period of time. The sequential monitoring allows you to stop early if the intervention is really effective, see this example for A/B tests. But otherwise you are often better off just planning a long term intervention and not peek at the short term results.
Besides the technical stats portion of being able to tell if it diverges from a control area, it may also be behavioral, in that you need a longer period of time to generate deterrence, or for officers to effectively implement the strategy. You notice in these examples if you only did 5 time periods, they meander about 0 and so don’t appear to be effective. It takes longer time periods, even with the 50% effective intervention, to know if the intervention was effective given these low of baseline crime counts.
For a simplified example consider the network below, with individuals baseline probabilities of future risk noted in the nodes. Lets say the local treatment effect reduces the probability to 0, and the spillover effect reduces the probability by half, and you can only treat 1 node. Who do you treat?
We could select the person with the highest baseline probability (B), and the reduced effect ends up being 0.5(B) + 0.1(E) = 0.6
(the 0.1 is for the spillover effect for E). We could choose node A, which is a higher baseline probability and has the most connections, and the reduced effect is 0.4(A) + 0.05(C) + 0.05(D) + 0.1(E) = 0.6
. But it ends up in this network the optimal node to choose is E, because the spillovers to A and B justify choosing a lower probability individual, 0.2(E) + 0.2(A) + 0.25(B) = 0.65
.
Using this idea of a local effect and a spillover effect, I formulated an integer linear program with the same idea of a local treatment effect and a spillover effect:
Where is the reduction in the probability due to the local effect, and is the reduction in the probability due to the spillover effect. These probabilities are fixed values you know at the onset, e.g. estimated from some model like in Wheeler, Worden, and Silver (2019) (and Papachristos has related work using the network itself to estimate risk). Each node, i, then gets two decision variables; will equal 1 if that node is treated, and will equal 1 if the node gets a spillover effect (depending on who is treated). Actually the findings in WP show that these effects are not additive (you don’t get extra effects if you are treated and your neighbors are treated, or if you have multiple neighbors treated), and this makes it easier to keep the problem on the probability scale. So we then have our constraints:
Constraint 1 is that these are binary 0/1 decision variables. Constraint 2 is we limit the number of people treated to K (a value that we choose). Constraint 3 ensures that if a local decision variable is set to 1, then the spillover variable has to be set to 0. If the local is 0, it can be either 0 or 1. Constraint 4 looks at the neighbor relations. For Node i, if any of its neighbors local treated decision variable is set to 1, the Spillover decision variable can be set to 1.
So in the end, if the number of nodes is n
, we have 2*n
decision variables and 2*n + 1
constraints, I find it easier just to look at code sometimes, so here is this simple network and problem formulated in python using networkx and pulp. (Here is a full file of the code and data used in this post.) (Update: I swear I’ve edited this inline code snippet multiple times, if it does not appear I have coded constraints 3 & 4, check out the above linked code file. Maybe it is causing problems being rendered.)
####################################################
import pulp
import networkx
Nodes = ['a','b','c','d','e']
Edges = [('a','c'),
('a','d'),
('a','e'),
('b','e')]
p_l = {'a': 0.4, 'b': 0.5, 'c': 0.1, 'd': 0.1,'e': 0.2}
p_s = {'a': 0.2, 'b': 0.25, 'c': 0.05, 'd': 0.05,'e': 0.1}
K = 1
G = networkx.Graph()
G.add_edges_from(Edges)
P = pulp.LpProblem("Choosing Network Intervention", pulp.LpMaximize)
L = pulp.LpVariable.dicts("Treated Units", [i for i in Nodes], lowBound=0, upBound=1, cat=pulp.LpInteger)
S = pulp.LpVariable.dicts("Spillover Units", [i for i in Nodes], lowBound=0, upBound=1, cat=pulp.LpInteger)
P += pulp.lpSum( p_l[i]*L[i] + p_s[i]*S[i] for i in Nodes)
P += pulp.lpSum( L[i] for i in Nodes ) == K
for i in Nodes:
P += pulp.lpSum( S[i] ) <= 1 + 1*L[i]
ne = G.neighbors(i)
P += pulp.lpSum( L[j] for j in ne ) >= S[i]
P.solve()
#Should select e for local, and a & b for spillover
print(pulp.value(P.objective))
print(pulp.LpStatus[P.status])
for n in Nodes:
print([n,L[n].varValue,S[n].varValue])
####################################################
And this returns the correct results, that node E is chosen in this example, and A and B have the spillover effects. In the linked code I provided a nicer function to just pipe in your network, your two probability reduction estimates, and the number of treated units, and it will pipe out the results for you.
For an example with a larger network for just proof of concept, I conducted the same analysis, choosing 20 people to treat in a network of 311 nodes I pulled from Rostami and Mondani (2015). I simulated some baseline probabilities to pipe in, and made it so the local treatment effect was a 50% reduction in the probability, and a spillover effect was a 20% reduction. Here red squares are treated, pink circles are the spillover, and nontreated are grey circles. It did not always choose the locally highest probability (largest nodes), but did tend to choose highly connected folks also with a high probability (but also chose some isolate nodes with a high probability as well).
This problem is solved in an instant. And I think out of the box this will work for even large networks of say over 100,000 nodes (I have let CPLEX churn on problems with near half a million decision variables on my desktop overnight). I need to check myself to make 100% sure though. A simple way to make the problem smaller if needed though is to conduct the analysis on subsets of connected components, and then shuffle the results back together.
Looking at the results, it is very similar to my choosing representatives work (Wheeler et al., 2019), and I think you could get similar results with just piping in 1’s for each of the local and spillover probabilities. One of the things I want to work on going forward though is treatment noncompliance. So if we are talking about giving some of these folks social services, they don’t always take up your offer (this is a problem in choose rep’s for call ins as well). WP actually relied on this to draw control nodes in their analysis. I thought for a bit the problem with treatment noncompliance in this setting was intractable, but another paper on a totally different topic (Bogle et al., 2019) has given me some recent hope that it can be solved.
This same idea is also is related to hot spots policing (think spatial diffusion of benefits). And I have some ideas about that to work on in the future as well (e.g. how wide of net to cast when doing hot spots interventions given geographical constraints).
]]>
Two tricky parts to this: 1) making the north arrow and scale bar, and 2) figuring out the dimensions to make regular hexagons. As an illustration I use the shooting victim data from Philly (see the working paper for all the details) full data and code to replicate here. I will walk through a bit of it though.
First to start out, I just use these three libraries, and set the working directory to where my data is.
library(ggplot2)
library(rgdal)
library(proj4)
setwd('C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\HexagonMap_ggplot\\Analysis')
Now I read in the Philly shooting data, and then an outline of the city that is projected. Note I read in the shapefile data using rgdal
, which imports the projection info. I need that to be able to convert the latitude/longitude spherical coordinates in the shooting data to a local projection. (Unless you are making a webmap, you pretty much always want to use some type of local projection, and not spherical coordinates.)
#Read in the shooting data
shoot < read.csv('shootings.csv')
#Get rid of missing
shoot < shoot[!is.na(shoot$lng),c('lng','lat')]
#Read in the Philly outline
PhilBound < readOGR(dsn="City_Limits_Proj.shp",layer="City_Limits_Proj")
#Project the Shooting data
phill_pj < proj4string(PhilBound)
XYMeters < proj4::project(as.matrix(shoot[,c('lng','lat')]), proj=phill_pj)
shoot$x < XYMeters[,1]
shoot$y < XYMeters[,2]
It is a bit of work to make a nice basemap in R and ggplot, but once that upfront work is done then it is really easy to make more maps. To start, the GISTools
package has a set of functions to get a north arrow and scale bar, but I have had trouble with them. The ggsn
package imports the north arrow as a bitmap instead of vector, and I also had a difficult time with its scale bar function. (I have not figured out the cartography
package either, I can’t keep up with all the mapping stuff in R!) So long story short, this is my solution to adding a north arrow and scale bar, but I admit better solutions probably exist.
So basically I just build my own polygons and labels to add into the map where I want. Code is motivated based on the functions in GISTools
.
#creating north arrow and scale bar, motivation from GISTools package
arrow_data < function(xb, yb, len) {
s < len
arrow.x = c(0,0.5,1,0.5,0)  0.5
arrow.y = c(0,1.7 ,0,0.5,0)
adata < data.frame(aX = xb + arrow.x * s, aY = yb + arrow.y * s)
return(adata)
}
scale_data < function(llx,lly,len,height){
box1 < data.frame(x = c(llx,llx+len,llx+len,llx,llx),
y = c(lly,lly,lly+height,lly+height,lly))
box2 < data.frame(x = c(llxlen,llx,llx,llxlen,llxlen),
y = c(lly,lly,lly+height,lly+height,lly))
return(list(box1,box2))
}
x_cent < 830000
len_bar < 3000
offset_scaleNum < 64300
arrow < arrow_data(xb=x_cent,yb=67300,len=2500)
scale_bxs < scale_data(llx=x_cent,lly=65000,len=len_bar,height=750)
lab_data < data.frame(x=c(x_cent, x_centlen_bar, x_cent, x_cent+len_bar, x_cent),
y=c( 72300, offset_scaleNum, offset_scaleNum, offset_scaleNum, 66500),
lab=c("N","0","3","6","Kilometers"))
This is about the best I have been able to automate the creation of the north arrow and scale bar polygons, while still having flexibility where to place the labels. But now we have all of the ingredients necessary to make our basemap. Make sure to use coord_fixed()
for maps! Also for background maps I typically like making the outline thicker, and then have borders for smaller polygons lighter and thinner to create a hierarchy. (If you don’t want the background map to have any color, use fill=NA
.)
base_map < ggplot() +
geom_polygon(data=PhilBound,size=1.5,color='black', fill='darkgrey', aes(x=long,y=lat)) +
geom_polygon(data=arrow, fill='black', aes(x=aX, y=aY)) +
geom_polygon(data=scale_bxs[[1]], fill='grey', color='black', aes(x=x, y = y)) +
geom_polygon(data=scale_bxs[[2]], fill='white', color='black', aes(x=x, y = y)) +
geom_text(data=lab_data, size=4, aes(x=x,y=y,label=lab)) +
coord_fixed() + theme_void()
#Check it out
base_map
This is what it looks like on my windows machine in RStudio — it ends up looking alittle different when I export the figure straight to PNG though. Will get to that in a minute.
Now you have your basemap you can superimpose whatever other data you want. Here I wanted to visualize the spatial distribution of shootings in Philly. One option is a kernel density map. I tend to like aggregated count maps though better for an overview, since I don’t care so much for drilling down and identifying very specific hot spots. And the counts are easier to understand than densities.
In geom_hex
you can supply a vertical and horizontal parameter to control the size of the hexagon — supplying the same for each does not create a regular hexagon though. The way the hexagon is oriented in geom_hex
the vertical parameter is vertex to vertex, whereas the horizontal parameter is side to side.
Here are three helper functions. First, wd_hex
gives you a horizontal width length given the vertical parameter. So if you wanted your hexagon to be vertex to vertex to be 1000 meters (so a side is 500 meters), wd_hex(1000)
returns just over 866. Second, if for your map you wanted to convert the numbers to densities per unit area, you can use hex_area
to figure out the size of your hexagon. Going again with our 1000 meters vertex to vertex hexagon, we have a total of hex_area(1000/2)
is just under 650,000 square meters (or about 0.65 square kilometers).
For maps though, I think it makes the most sense to set the hexagon to a particular area. So hex_dim
does that. If you want to set your hexagons to a square kilometer, given our projected data is in meters, we would then just do hex_dim(1000^2)
, which with rounding gives us vert/horz measures of about (1241,1075) to supply to geom_hex
.
#ggplot geom_hex you need to supply height and width
#if you want a regular hexagon though, these
#are not equal given the default way geom_hex draws them
#https://www.varsitytutors.com/high_school_mathhelp/howtofindtheareaofahexagon
#get width given height
wd_hex < function(height){
tri_side < height/2
sma_side < height/4
width < 2*sqrt(tri_side^2  sma_side^2)
return(width)
}
#now to figure out the area if you want
#side is simply height/2 in geom_hex
hex_area < function(side){
area < 6 * ( (sqrt(3)*side^2)/4 )
return(area)
}
#So if you want your hexagon to have a regular area need the inverse function
#Gives height and width if you want a specific area
hex_dim < function(area){
num < 4*area
den < 6*sqrt(3)
vert < 2*sqrt(num/den)
horz < wd_hex(height)
return(c(vert,horz))
}
my_dims < hex_dim(1000^2) #making it a square kilometer
sqrt(hex_area(my_dims[1]/2)) #check to make sure it is square km
#my_dims also checks out with https://hexagoncalculator.apphb.com/
Now onto the good stuff. I tend to think discrete bins make nicer looking maps than continuous fills. So through some trial/error you can figure out the best way to make those via cut
. Also I make the outlines for the hexagons thin and white, and make the hexagons semitransparent. So you can see the outline for the city. I like how by default areas with no shootings are not given any hexagon.
lev_cnt < seq(0,225,25)
shoot_count < base_map +
geom_hex(data=shoot, color='white', alpha=0.85, size=0.1, binwidth=my_dims,
aes(x=x,y=y,fill=cut(..count..,lev_cnt))) +
scale_fill_brewer(name="Count Shootings", palette="OrRd")
We have come so far, now to automate exporting the figure to a PNG file. I’ve had trouble getting journals recently to not bungle vector figures that I forward them, so I am just like going with high res PNG to avoid that hassle. If you render the figure and use the GUI to export to PNG, it won’t be as high resolution, so you can often easily see aliasing pixels (e.g. the pixels in the North Arrow for the earlier base map image).
png('Philly_ShootCount.png', height=5, width=5, units="in", res=1000, type="cairo")
shoot_count
dev.off()
Note the font size/location in the exported PNG are often not quite exactly as they are when rendered in the RGUI window or RStudio on my windows machine. So make sure to check the PNG file.
My notes on how to get this to follow. Data and code to follow along can be downloaded from here.
First in my do file, I have a typical start up that sets the working directory and logs the results to a text file. I use set more off
so I don’t have to do the annoying this and tell Stata to keep scrolling down. The next part is partly idiosyncratic to my Stata work set up — I call Stata from a centralized install location here at EPPS in UTD. I don’t have write access there, so to install commands I need to set my own place to install them on my local machine. So I add a location to adopath
that is on my machine, and I also do net set ado
to that same location.
Finally, for here I ssc installed grstyle
and palettes
. The code is currently commented out, as I only need to install it once. But it is good for others to know what extra packages they need to fully replicate your results.
**************************************************************************
*START UP STUFF
*Set the working directory and plain text log file
cd "C:\Users\axw161530\Dropbox\Documents\BLOG\Stata_NiceMargins\Analysis"
*log the results to a text file
log using "LogitModels.txt", text replace
*so the output just keeps going
set more off
*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns\V15"
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns\V15"
*In this script I use
*net install http://www.statajournal.com/software/sj183/gr0073/
*ssc install grstyle, replace
*ssc install palettes, replace
**************************************************************************
Here is what I did to change my default graph settings. Again check out Ben Jann’s awesome website he made an all the great examples. That will be more productive than me commenting on every individual line.
**************************************************************************
*Graph Settings
grstyle clear
set scheme s2color
grstyle init
grstyle set plain, box
grstyle color background white
grstyle set color Set1
grstyle yesno draw_major_hgrid yes
grstyle yesno draw_major_ygrid yes
grstyle color major_grid gs8
grstyle linepattern major_grid dot
grstyle set legend 4, box inside
grstyle color ci_area gs12%50
**************************************************************************
So here is pretty straight forward. I read in the data as a CSV file, generate a new variable that is the weekly average number of crimes within 1000 feet in the historical crime data (see the working paper for more details). One trick I like to use with regression models with many terms is to make a global that specifies those variables, so I don’t need to retype them a bunch. I named it $ContVars
here. Finally for simplicity in this script I am just examining the burglary incidents, so I get rid of the other crimes using the keep
command.
**************************************************************************
*DATA PREP
*Getting the data
import delimited CrimeStrings_withData.csv
*Making the previous densities per time period
generate buff_1000_1 = buff_1000 * (7/1611)
*control variables used in the regression
global ContVars "d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 whiteperc blackperc hispperc asianperc under17 propmove perpoverty perfemheadhouse perunemploy perassist i.month c.dateint"
*For here I am just examining burglary incidents
keep if crimetype == 3
**************************************************************************
So basically what I want to do in the end is to draw an interaction effect between a dummy variable (whether a crime resulted in an arrest) and a continuous variable (the historical crime density at a location). I am predicting whether a crime results in a nearrepeat follow up — hot spots with more crime on average will have more nearrepeats simply by chance.
When displaying that interaction effect though, I only want to limit it to the support of the historical crime density in the sample. Or stated another way, the historical crime density variable basically ranges from 0 to 2.5 in the sample — I don’t care what the interaction effect is then at a historical crime density of 3.
To do that in Stata, I use summarize
to get the min/max of that historical crime density and pipe them into a global. The Grid global will then tell Stata how often to calculate those effects. Too few and the plot may not look smooth, too many and it will take margins forever to calculate the results. Here 100 points is plenty.
*I will need this later to draw the margins over the support
*Of the prior crime density
summarize buff_1000_1
global MyMin = r(min)
global MyMax = r(max)
global Grid = ($MyMax$MyMin)/100
This may seem overkill, as I could just fill in those values by hand later. If you look at my replication code for my paper though, I ended up doing this same thing for four different crimes and two different estimates, so I wanted as automated approach that avoids as many magic numbers as possible.
Now I estimate my logistic regression model, with my interaction effect and the global $ContVars
control variables I specified earlier. Here I am predicting whether a burglary has a follow up nearrepeat crime (within 1000 feet and 7 days). I think an arrest will reduce that probability.
*Now estimate the logit model
logit future0_1000_7 i.arrest c.buff_1000_1 i.arrest#c.buff_1000_1 $ContVars
Note that the estimate of the interaction effect looks like this:

future0_1000_7  Coef. Std. Err. z P>z [95% Conf. Interval]
+
1.arrest  .0327821 .123502 0.27 0.791 .2748415 .2092774
buff_1000_1  1.457795 .0588038 24.79 0.000 1.342542 1.573048

arrest#c.buff_1000_1 
1  .5013652 .2742103 1.83 0.067 1.038807 .0360771
So how exactly do I interpret this? It is very difficult to interpret the coefficients directly — it is much easier to make graphs and visualize what those effects actually mean on the probability of a nearrepeat burglary occurring.
Now the good stuff. Basically I want to show the predicted probability of a nearrepeat follow up crime, conditional on whether an arrest occurred, as well as the historical crime density. The first line uses quietly
, so I don’t get the full margins table in the output. The second is just all the goodies to make my nice grey scale plot. Note I name the plot — this will be needed for later combining multiple plots.
*Create the two margin plots
quietly margins arrest, at(c.buff_1000_1=($MyMin ($Grid) $MyMax))
marginsplot, recast(line) noci title("Residential Burglary, Predictive Margins") xtitle("Historical Crime Density") ytitle("Pr(Future Crime = 1)") plot1opts(lcolor(black)) plot2opts(lcolor(gs6) lpattern("")) legend(on order(1 "no arrest" 2 "arrest")) name(main)
You could superimpose confidence intervals on the prior plot, but those are the pointwise intervals for the probability for each individual line, they don’t directly tell you about the difference between the two lines. Viz. the difference in lines often can lead to misinterpretation (e.g. remember the Cleveland example of viz. the differences in exports/imports originally drawn by Playfair). Also superimposing multiple error bands tend to get visually messy. So a solution is to directly graph the estimate of the difference between those two probabilities in a separate graph. (Another idea though I’ve seen is here, a CI of the difference set to the midpoint of the two lines.)
quietly margins, dydx(arrest) at(c.buff_1000_1=($MyMin ($Grid) $MyMax))
marginsplot, recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) title("Residential Burglary, Average Marginal Effects of Arrest") xtitle("Historical Crime Density") ytitle("Effects on Pr(Future Crime)") name(diff)
Yay for the fact that Stata can now draw transparent areas. So here we can see that even though the marginal effect grows at higher prior crime densities — suggesting an arrest has a larger effect on reducing near repeats in hot spots, the confidence interval of the difference grows larger as well.
To end I combine the two plots together (same image at the beginning of the post), and then export them to a higher resolution PNG.
*Now combining the plots together
graph combine main diff, xsize(6.5) ysize(2.7) iscale(.8) name(comb)
graph close main diff
graph export "BurglaryMarginPlot.png", width(6000) replace
I am often doing things interactively in the Stata shell when I am writing up scripts. Including redoing charts. To be able to redo a chart with the same name, you need to not only use graph close
, but also graph drop
it from memory. Then just dropping all the data and using exit
will finish out your script and close down Stata entirely.
**************************************************************************
*FINISHING UP THE SCRIPT
*closing the combined graph
graph close comb
*This is necessary if you want to reuse the plot names
graph drop _all
*Finish the script.
drop _all
exit, clear
**************************************************************************
]]>