Buffalo shootings paper published

My article examining spatial shifts in shootings in Buffalo pre/post Covid, in collaboration with several of my Buffalo colleagues, is now published in the Journal of Experimental Criminology (Drake et al., 2022).

If you do not have access to that journal, you can always just email, or check out the open access pre-print. About the only difference is a supplement we added in response to reviewers, including maps of different grid cell areas, here is a hex grid version of the changes:

The idea behind this paper was to see if given the dramatic increase in shootings in Buffalo after Covid started (Kim & Phillips, 2021), they about doubled (similar to NYC), did spatial hot spots change? The answer is basically no (and I did a similar analysis in NYC as well).

While other papers have pointed out that crime increases disproportionately impact minority communities (Schleimer et al., 2022), which is true, it stands to be very specific what the differences in my work and this are saying. Imagine we have two neighborhoods:

Neighborhood A, Disadvantaged/Minority, Pre 100 crimes, Post 200 crimes
Neighborhood B,    Advantaged/Majority, Pre   1 crimes, Post   2 crimes

The work that I have done has pointed to these increases due to Covid being that relative proportions/rates are about the same (shootings ~doubled in both Buffalo/NYC). And that doubling was spread out pretty much everywhere. It is certainly reasonable to interpret this as an increased burden in minority communities, even if proportional trends are the same everywhere.

This proportional change tends to occur when crime declines as well (e.g. Weisburd & Zastrow, 2022; Wheeler et al., 2016). And this just speaks to the stickiness of hot spots of crime. Even with large macro changes in temporal crime trends, crime hot spots are very durable over time. So I really think it makes the most sense for police departments to have long term strategies to deal with hot spots of crime, and they don’t need to change targeted areas very often.


Power and bias in logistic regression

Michael Sierra-Arévalo, Justin Nix and Bradley O’Guinn have a recent article about examining officer fatalities following gunshot assaults (Sierra-Arévalo, Nix, & O-Guinn). They do not find that distance to a Level 1/2 trauma ERs make a difference in the survival probabilities, which conflicts with prior work of mine with Gio Circo (Circo & Wheeler, 2021). Justin writes this as a potential explanation for the results:

The results of our multivariable analysis indicated that proximity to trauma care was not significantly associated with the odds of officers surviving a gunshot wound (see Table 2 on p. 9 of the post-print). On the one hand, this was somewhat surprising given that proximity to trauma care predicts survival of gunshot wounds among the general public.1 On the other hand, police have specialized equipment, such as ballistic vests and tourniquets, that reduce the severity of gunshot wounds or allow them to be treated immediately.

I think it is pretty common when results do not pan out, people turn to theoretical (or sociological) reasons why their hypothesis may be invalid. While these alternatives are often plausible, often equally plausible are simpler data based reasons. Here I was concerned about two factors, 1) power and 2) omitted severity of gun shot wound factors. I did a quick simulation in R to show power seems to be OK, but the omitted severity confounders may be more problematic in this design, although only bias the effect towards 0 (it would not cause the negative effect estimate MJB find).

Power In Logistic Regression

First, MJB’s sample size is just under 1,800 cases. You would think offhand this is plenty of power for whatever analysis right? Well, power just depends on the relevant effect size, a small effect and you need a bigger sample. My work with Gio found a linear effect in the logistic equation of 0.02 (per minute driving increases the logit). We had 5,500 observations, and our effect had a p-value just below 0.05, hence why a first thought was power. Also logistic regression is asymptotic, it is common to have small sample biases in situations even up to 1000 observations (Bergtold et al., 2018). So lets see in a simple example ignoring the other covariates:

# Some upfront work
logistic <- function(x){1/(1+exp(-x))}

# Scenario 1, no covariates omitted
n <- 2000; 
de <- 0.02
dist <- runif(n,5,200)
p <- logistic(-2.5 + de*dist)
y <- rbinom(n,1,p)

# Variance is small enough, seems reasonably powered
summary(glm(y ~ dist, family = "binomial"))

Here with 2000 cases, taking the intercept from MJB’s estimates and the 0.02 from my paper, we see 2000 observations is plenty enough well powered to detect that same 0.02 effect in mine and Gio’s paper. Note when doing post-hoc power analysis, you don’t take the observed effect (the -0.001 in Justin’s paper), but a hypothetical effect size you think is reasonable (Gelman, 2019), which I just take from mine and Gio’s paper. Essentially saying “Is Justin’s analysis well powered to detect an effect of the same size I found in the Philly data”.

One thing that helps MJB’s design here is more variance in the distance parameter, looking intra city the drive time distances are smaller, which will increase the standard error of the estimate. If we pretend to limit the distances to 30 minutes, this study is more on the fence as to being well enough powered (but meets the threshold in this single simulation):

# Limited distance makes the effect have a higher variance
n <- 2000; 
de <- 0.02
dist <- runif(n,1,30)
p <- logistic(-2.5 + de*dist)
y <- rbinom(n,1,p)

# Not as much variation in distance, less power
summary(glm(y ~ dist, family = "binomial"))

For a more serious set of analysis you would want to do these simulations multiple times and see the typical result (since they are stochastic), but this is good enough for me to say power is not an issue in this design. If people are planning on replications though, intra-city with only 1000 observations is really pushing it with this design though.

Omitted Confounders

One thing that is special about logistic regression, unlike linear regression, even if an omitted confounder is uncorrelated with the effect of interest, it can still bias the estimates (Mood, 2010). So even if you do a randomized experiment your effects could be biased if there is some large omitted effect from the regression equation. Several people interpret this as logistic regression is fucked, but like that linked Westfall article I think that is a bit of an over-reaction. Odds ratios are very tricky, but logistic regression as a method to estimate conditional means is not so bad.

In my paper with Gio, the largest effect on whether someone would survive was based on the location of the bullet wound. Drive time distances then only marginal pushed up/down that probability. Here are conditional mean estimates from our paper:

So you can see that being shot in the head, drive time can make an appreciable difference over these ranges, from ~45% to 55% probability of death. Even if the location of the wound is independent of drive time (which seems quite plausible, people don’t shoot at your legs because you are far away from a hospital), it can still be an issue with this research design. I take Justin’s comment about ballistic vests as reducing death as essentially taking the people in the middle of my graph (torso and multiple injuries) and pushing them into the purple line at the bottom (extremities). But people shot in the head are not impacted by the vests.

So lets see what happens to our effect estimates when we generate the data with the extremities and head effects (here I pulled the estimates all from my article, baseline reference is shot in head and negative effect is reduction in baseline probability when shot in extremity):

# Scenario 3, wound covariate omitted
dist <- runif(n,5,200)
ext_wound <- rbinom(n,1,0.8)
ef <- -4.8
pm <- logistic(0.2 + de*dist + ef*ext_wound)
ym <- rbinom(n,1,pm)

# Biased downward (but not negative)
summary(glm(ym ~ dist, family = "binomial"))

You can see here the effect estimate is biased downward by a decent margin (less than half the size of the true effect). If we estimate the correct equation, we are on the money in this simulation run:

What happens if we up the sample size? Does this bias go away? Unfortunately it does not, here is an example with 10,000 observations:

# Scenario 3, wound covariate ommitted larger sample
n2 <- 10000
dist <- runif(n2,5,200)
ext_wound <- rbinom(n2,1,0.8)
ef <- -4.8
pm <- logistic(0.2 + de*dist + ef*ext_wound)
ym <- rbinom(n2,1,pm)

# Still a problem
summary(glm(ym ~ dist, family = "binomial"))

So this omission is potentially a bigger deal – but not in the way Justin states in his conclusion. The quote earlier suggests the true effect is 0 due to vests, I am saying here the effect in MJB’s sample is biased towards 0 due to this large omitted confounder on the severity of the wound. These are both plausible, there is no way based just on MJB’s data to determine if one interpretation is right and the other is wrong.

This would not explain the negative effect estimate MJB finds though in their paper, it would only bias towards 0. To be fair, Jessica Beard critiqued mine and Gio’s paper in a similar vein (saying the police wound location data had errors), this would make our drive time estimates be biased towards 0 as well, so if that factor may be even larger than me and Gio even estimated.

Potential robustness checks here are to simply do a linear regression instead of logistic with the same data (my graph above shows a linear regression would be fine for the data if I included interaction effects with wound location). And another would be to look at the unconditional marginal distribution of distance vs probability of death. If that is highly non-linear, it is likely due to omitted confounders in the data (I suspect it may plateau as well, eg the first 30 minutes make a big difference, but after that it flattens out, you’ve either stabilized someone or they are gone at that point).


In the case of intra-city public violence, the policy implication of drive times on survival are relevant when people are determining whether to keep open or close trauma centers. I did not publish this in my paper with Gio (you can see the estimates in the replication code), but we actually estimated counter-factual increased deaths by taking away facilities. Its marginal effect is around 10~20 homicides over the 4.5 years if you take away one of the facilities in Philadelphia. I don’t know if reducing 5 homicides per year is sufficient justification to keep a trauma facility open, but officer shootings are themselves much less frequent, and so the marginal effects are very unlikely to justify keeping a trauma facility open/closed by themselves.

You could technically figure out the optimal location to site a new trauma facility from mine and Gio’s paper, but probably a more reasonable response would be to site resources to get people to the ER faster. Philly already does scoop and run (Winter et al., 2021), where officers don’t wait for an ambulance. Another possibility though is to proactively locate ambulances to get to scenes faster (Hosler et al., 2019). Again though it just isn’t as relevant/feasible outside of major urban areas though to do that.

Often times social science authors do an analysis, and then in the policy section say things that are totally reasonable on their face, but are not supported by the empirical analysis. Here the suggestion that officers should increase their use of vests by MJB is totally reasonable, but nothing in their analysis supports that conclusion (ditto with the tourniquets statement). You would need to measure those incidents that had those factors, and see its effect on officer survival to make that inference. MJB could have made the opposite statement, since drive time doesn’t matter, maybe those things don’t make a difference in survival, and be equally supported by the analysis.

I suspect MJB’s interest in the analysis was simply to see if survival rates were potential causes of differential officer deaths across states (Sierra-Arévalo & Nix, 2020). Which is fine to look at by itself, even if it has no obviously direct policy implications. Talking back and forth with Justin before posting this, he did mention it was a bit of prodding from a reviewer to add in the policy implications. Which it goes for both (reviewers or original writers), I don’t think we should pad papers with policy recommendations (or ditto for theoretical musings) that aren’t directly supported by the empirical analysis we conduct.


  • Bergtold, J. S., Yeager, E. A., & Featherstone, A. M. (2018). Inferences from logistic regression models in the presence of small samples, rare events, nonlinearity, and multicollinearity with observational data. Journal of Applied Statistics, 45(3), 528-546.
  • Circo, G. M., & Wheeler, A. P. (2021). Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims. Applied Spatial Analysis and Policy, 14(2), 379-393.
  • Gelman, A. (2019). Don’t calculate post-hoc power using observed estimate of effect size. Annals of Surgery, 269(1), e9-e10.
  • Hosler, R., Liu, X., Carter, J., & Saper, M. (2019). RaspBary: Hawkes Point Process Wasserstein Barycenters as a Service.
  • Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26(1), 67-82.
  • Sierra-Arévalo, M., & Nix, J. (2020). Gun victimization in the line of duty: Fatal and nonfatal firearm assaults on police officers in the United States, 2014–2019. Criminology & Public Policy, 19(3), 1041-1066.
  • Sierra-Arévalo, Michael, Justin Nix, & Bradley O’Guinn (2022). A National Analysis of Trauma Care Proximity and Firearm Assault Survival among U.S. Police. Forthcoming in Police Practice and Research. Post-print available at
  • Winter, E., Hynes, A. M., Shultz, K., Holena, D. N., Malhotra, N. R., & Cannon, J. W. (2021). Association of police transport with survival among patients with penetrating trauma in Philadelphia, Pennsylvania. JAMA network open, 4(1), e2034868-e2034868.

Using Natural Frequencies to Reason about Tests

Hackernews shared an article about visualizing Bayes Theorem using Venn diagrams, with the common example of going from sensitivity/specificity of a test to switch the probability of ‘do I have cancer’. I actually do not find Bayes Theorem to be really helpful to think through this problem. I much prefer Gerd Gigerenzer’s approach using natural frequencies.

So here is an example, say we have a Covid rapid test that has a sensitivity of 90%. This metric is “if you have Covid, how often does the test say you have Covid”. In conditional probability terms we would write this P( Test Says + | Actually Have Covid ).

Often times we want to know the opposite conditional though, “if we test positive, what is the actual probability we have Covid?”. So we want to know P( Actually Have Covid | Test Says + ). This is switching the conditional from the earlier metric. To figure that out though, we need some more information though. One thing we need is an estimate of “Actually Have Covid” in the overall population. Pretend this is 5% for our scenario.

Another is the false positive rate of the test, P( Test Says + | Do Not Have Covid ). Lets say for our example this is 10%. Now how do we figure out P( Actually Have Covid | Test Says + )? Lets use this nice little tree diagram, where we consider a hypothetical scenario of 1000 people getting Covid rapid tests:

So first thing to note is that the first split in the tree is something that doesn’t have anything to do with the tests at all – it is the overall proportion of Covid cases in the population. Here 5% means out of our 1000 people that take rapid tests, we expect only 50 of them to actually have Covid.

The second level of the splits are the two data based pieces of information, the sensitivity is the split on the left side. The test captures 90% of the true positives, so 45 out of the 50 get a positive result on the rapid test. The right split is the false negative rate of 10%, of those 950 without Covid, 95 will get a false positive result.

So in the end, we have 45 true positives out of 45 + 95 total positive tests (the colored balls in the tree diagram). That is our personal estimate of I tested positive for Covid, do I actually have Covid, P( Actually Have Covid | Test Says + ). This estimate is 32%, or a metric we sometimes like to call the Positive Predictive Value of the test. I have an Excel spreadsheet where you can insert different overall proportions in the underlying population (the prevalence) as well as different estimates of a tests sensitivity/false-positive rate, and it spits out the nice diagram.

In the end even if you have a very sensitive test with a low false positive rate, if the prevalence is very low you will have a low positive predictive value. For example in my work, in predicting chronic offenders involved with shootings, among the top 1000 predictions the PPV was only 3%. Because shooting someone or being shot is very rare. If you compare those top 1000 they have around 200 times higher probability than a random person, but still overall that probability is quite low.

Note for data science folks that these different metrics all relate to the ROC curve. In my head I often translate sensitivity to “capture rate” – the proportion of true cases the test captures in the wild, I can never remember sensitivity. Sometimes in ROC curves I label this as the True Positive Rate, although software often uses 1 – Specificity vs Sensitivity.

Based on the above scenario you may be thinking to yourself ‘I can’t fill out all of the unknowns’, especially such as knowing the underlying prevalence of the outcome in the population. For overall prevalence’s you can often make reasonable guesses. For error rates for tests, that often takes some digging or guess work as to typical error rates though. So even if we don’t have uber precise estimates, we can still make reasonable decisions based on the range of likely scenarios filling in the info we need.

Knowing the Right Conditional Probability?

While in general I like the tree diagram approach, it doesn’t help with the more general issue of knowing the correct probability you care about to make decisions. Totally agree with Gigerenzer’s overall approach (he is basically glass half full compared to Kahneman/Tversky, if you present info the right way we don’t make so biased as decisions), but even scientific experts often make conditional probability mistakes.

Much of the public discussion about Covid in particular is making a specious set of probability estimates to justify positions. Just copy-pasting my Hackernews comment giving examples from the Rogan/Gupta interview, but think that is very symbolic overall of the news coverage:

I think a more basic problem is people don’t even know how to formulate meaningful probability hypothetical/counterfactuals to begin with (let alone update them). For Covid stuff pretty much all an individual should care about is:

P(Bad Stuff | I take action A) > P(Bad Stuff | I take action B)

So you take action B in this scenario (simplifying to ignore costs, many of these decisions are costless). We get a bunch of meaningless drivel in the news though that doesn’t help anyone make any meaningful probability estimates to help them make decisions. I think the Rogan/Gupta interview is a good example. We get various non-sense comparisons such as:

P(Bad Stuff | Covid, Person 5, No Vaccine) < P(Bad Stuff | Covid, Person 50, Vaccine)

[Rogan to Gupta why is it OK for 50 year old to feel safe and not a young kid without a vaccine? Irrelevant counter-factual unless someone invents a reverse aging treatment.]

P(Heart Inflammation | Covid Vaccine, Young Male) > P(Death | Covid, Young Male)

[Rogan saying side effects for young people outweigh benefits. This is true but death is quite a bit worse than the side effects, and this does not consider other Bad Stuff from Covid like long haulers.]

Knowing Bayes Theorem doesn’t help someone figure out the right probability statement they should be interested in to begin with.

The news overall is very frustrating, especially scientific coverage. Often times news entirely avoids any specific numeric estimate, but is often not even clear what the estimate is. A common one from news coverage is P(Bad Thing | ?????), I often can’t tell what composes the conditional on the right hand side. Sometimes it is those tested positive, sometimes it is the overall population in an area, sometimes it is those in the hospital, etc. Those different conditionals can greatly change how you interpret that information.


  • Gigerenzer, G. (2011). What are natural frequencies?. BMJ, 343.
  • Wheeler, A. P., Worden, R. E., & Silver, J. R. (2019). The accuracy of the violent offender identification directive tool to predict future gun violence. Criminal Justice and Behavior, 46(5), 770-788.

Blogging Year in Review 2021

In total views of the blog for 2021, I will have a trickle of a few more views today, but I will not crack the 100k mark. So the blog viewership has not really grown over the past few years, just variance around 90k views per year.

Most of my traffic is a trickle of referrals for old blog posts from search engines. So my top posts of 2021 would be a quite boring old list if I did that.

I have to go down over 20 posts before ones I posted this year come into the views ranking. Typically I get a one time bump of 100~200 views for a single post when I first post it (I have never topped 600 views in one day). But after that it is just competing for search traffic referrals. (Those posts in 2021 are highlighted by the blue bar on the left in this screengrab.)

In other news, I have not written a blog post about it, but the move to a private sector data science gig was a good one for me (much less stressful than being an academic). Two years in I can safely make that assessment.

But, I have continued to do some academic papers on the side. The Buffalo paper was accepted at Journal of Experimental Crim, and the NIJ paper is under review for the IJOTCC open science special issue. So I can still do some criminology work to scratch that itch on the side.

In Covid times everything is remote, but I do enjoy participating in various groups (even if over zoom). As I posted on the blog, always feel free to send me an email to ask me anything.

Learning a fair loss function in pytorch

Most of the time when we are talking about deep learning, we are discussing really complicated architectures – essentially complicated sets of (mostly linear) equations. A second innovation in the application of deep learning though is the use of back propagation to fit under-determined systems. Basically you can feed these systems fairly complicated loss functions, and they will chug along with no problem. See a prior blog post of mine of creating simple regression weights for example.

Recently in the NIJ challenge, I used pytorch and the non-linear fairness function defined by NIJ. In the end me and Gio used an XGBoost model, because the data actually do not have very large differences in false positive rates between racial groups. I figured I would share my example here though for illustration of using pytorch to learn a complicated loss function that has fairness constraints. And here is a link to the data and code to follow along if you want.

First, in text math the NIJ fairness function was calculated as (1 - BS)*(1 - FP_diff), where BS is the Brier Score and FP_diff is the absolute difference in false positive rates between the two groups. My pytorch code to create this loss function looks like this (see the pytorch_mods.py file):

# brier score loss function with fairness constraint
def brier_fair(pred,obs,minority,thresh=0.5):
    bs = ((pred - obs)**2).mean()
    over = 1*(pred > thresh)
    majority = 1*(minority == 0)
    fp = 1*over*(obs == 0)
    min_tot = (over*minority).sum().clamp(1)
    maj_tot = (over*majority).sum().clamp(1)
    min_fp = (fp*minority).sum()
    maj_fp = (fp*majority).sum()
    min_rate = min_fp/min_tot
    maj_rate = maj_fp/maj_tot
    diff_rate = torch.abs(min_rate - maj_rate)
    fin_score = (1 - bs)*(1 - diff_rate)
    return -fin_score

I have my functions all stashed in different py files. But here is an example of loading up all my functions, and fitting my pytorch model to the training recidivism data. Here I set the threshold to 25% instead of 50% like the NIJ competition. Overall the model is very similar to a linear regression model.

import data_prep as dp
import fairness_funcs as ff
import pytorch_mods as pt

# Get the train/test data
train, test = dp.get_y1_data()
y_var = 'Recidivism_Arrest_Year1'
min_var = 'Race' # 0 is White, 1 is Black
x_vars = list(train)

# Model learning fair loss function
m2 = pt.pytorchLogit(loss='brier_fair',activate='ident',

I have a burn in start to get good initial parameter estimates with a more normal loss function before going into the more complicated function. Another approach would be to initialize the weights to the solution for a linear regression equation though. After that burn in though it goes into the NIJ defined fairness loss function.

Now I have functions to see how the different model metrics in whatever sample. Here you can see the model is quite balanced in terms of false positive rates in the training sample:

# Seeing training sample fairness
m2pp = m2.predict_proba(train[x_vars])[:,1]

But of course in the out of sample test data it is not perfectly balanced. In general you won’t be able to ensure perfect balance in whatever fairness metrics out of sample.

# Seeing test sample fairness
m2pp = m2.predict_proba(test[x_vars])[:,1]

It actually ends up that the difference in false positive rates between the two racial groups, even in models that do not incorporate the fairness constraint in the loss function, are quite similar. Here is a model using the same architecture but just the usual Brier Score loss. (Code not shown, see the m1 model in the 01_AnalysisFair.py file in the shared dropbox link earlier.)

You can read mine and Gio’s paper (or George Mohler and Mike Porter’s paper) about why this particular fairness function is not the greatest. In general it probably makes more sense to use an additive fairness loss instead of multiplicative, but in this dataset it won’t matter very much no matter how you slice it in terms of false positive rates. (It appears in retrospect the Compas data that started the whole false positive thing is somewhat idiosyncratic.)

There are other potential oddities that can occur with such fair loss functions. For example if you had the choice between false positive rates for (white,black) of (0.62,0.60) vs (0.62,0.62), the latter is more fair, but the minority class is worse off. It may make more sense to have an error metric that sets the max false positive rate you want, and then just has weights for different groups to push them down to that set threshold.

These aren’t damning critiques of fair loss functions (these can be amended/changed to behave better), but in the end defining fair loss functions will be very tricky. Both for empirical reasons as well as for ethical ones – they will ultimately involve quite a few trade-offs.

genrules: using a genetic algo to find high risk cases

So I recently participated in the Decoding Maternal Morbidity Data Challenge. I did not win, but will share my work anyway. I created a genetic algorithm in python to identify sets of association rules that result in high relative risks, genrules. Here I intentionally used a genetic algorithm, with the idea I wanted not just one final model, but a host of different potential rules with the goal of exploratory data analysis.

You can go see how it works by checking out the notebook using the nuMoM2b data (which you have to request, I cannot upload that to github). Here are rules I found to predict high relative risk for infections:

And I have other code to summarize these, it happens to govt insurance is very commonly in the set of rules found.

I actually like it better just as a convenient tool (that can take weights), and go through every possible permutation of a set of categorical data. I uploaded a second notebook showing off NIBRS and predicting officer assaults (see my prior blog post on this).

So here one of the rules is stolen property, and the assailant using their motor vehicle as a weapon. So perhaps people running with stolen property in a vehicle? (While I built quite a bit of machinery to look through higher level rules, why bother with higher sets than 3 variables in the vast majority of circumstances.)

This shows the risk of competing in challenges, so a few weekends lost with no reward. This was a fuzzy competition, and something that appears I misread was in the various notes the challenge asked for the creation of novel algorithms. It appears from the titles of the winners people just used typical machine learning approaches and did a deep dive into the data. I did the opposite, created a general tool that could be used by many (who are better experts on the topical material than me).

Also note this approach is very data mining. While I use p-values as a mechanism to penalize too complicated of outcomes in the genetic algorithm, I wouldn’t take those at face value. With large datasets you will always mine some sets that are ultimately confounded.

Splines in Stata traj models

On my prior trajectories using Stata post, Nandita Krishnan asks if we can estimate trajectory groups using linear splines (instead of polynomials). I actually prefer restricted cubic splines over polynomials, so it wouldn’t bother me if people did these as a default replacement for the polynomial functions. Also see this paper by Francis et al. using b-splines.

In the Stata traj library, you can just use time varying covariates to swap out the higher order polynomials with the spline effects (whether linear, non-linear, cyclic, etc.) you want. Below is a simple example using linear splines.

Example Linear Spline

First as a note, I am not the author of the traj library, several people have commented about issues installing on my prior blog post. Please email questions to Bobby Jones and/or Dan Nagin. The only update to note, when I updated Stata to V17 and re-installed traj, I needed to specify a “https” site instead of a “http” one:

net from https://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

That is about the only potential hint I can give if you are having troubles installing the add-on module. (You could download all the files yourself and install locally I imagine as well.)

Now here I make a simple simulated dataset, with a linear bend in one group and the other a simple line (with some error for both):

set more off
set seed 10
set obs 2000
generate caseid = _n
generate group = ceil(caseid/1200) - 1

forval i = 1/10 {
  generate y_`i' = 5 + 0.5*`i' + rnormal(0,0.2) if group==0
  replace y_`i' = 2 + 0.5*`i' + rnormal(0,0.4) if group==1 & `i'<6
  replace y_`i' = 4.5 + -0.2*(`i'-5) + rnormal(0,0.4) if group==1 & `i'>=6

* Check out the trajectories
reshape long y_ , i(caseid) j(t)
graph twoway scatter y t, c(L) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

So now we are going to make our spline variables. It is actually easier in long format than wide. Here I do a linear spline, although in Stata you can do restricted cubic splines as well (just google “mkspline Stata” to get the nice PDF docs). So I make the two spline variables, and then just reshape back into wide format:

mkspline s1_ 5 s2_=t, marginal
reshape wide y_ s1_ s2_, i(caseid) j(t)

And now we can estimate our trajectory model, including our additional s2_* linear spline as a time-varying covariate:

traj, var(y_*) indep(s1_*) model(cnorm) min(0) max(20) order(1 1) sigmabygroup tcov(s2_*)

And here is the output:

==== traj stata plugin ====  Jones BL  Nagin DS,  build: Mar 17 2021
2000 observations read.
2000 observations used in the trajectory model.
                        Maximum Likelihood Estimates
                        Model: Censored Normal (cnorm)

                                       Standard       T for H0:
 Group   Parameter        Estimate        Error     Parameter=0   Prob > |T|
 1       Intercept         1.99290      0.01432         139.183       0.0000
         Linear            0.50328      0.00395         127.479       0.0000
         s2_1             -0.70311      0.00629        -111.781       0.0000
 2       Intercept         4.98822      0.00581         859.016       0.0000
         Linear            0.50310      0.00160         314.224       0.0000
         s2_1             -0.00382      0.00255          -1.498       0.1341
 1       Sigma             0.40372      0.00319         126.466       0.0000
 2       Sigma             0.20053      0.00129         154.888       0.0000
  Group membership
 1       (%)              40.00001      1.09566          36.508       0.0000
 2       (%)              59.99999      1.09566          54.761       0.0000
 BIC= -3231.50 (N=20000)  BIC= -3221.14 (N=2000)  AIC= -3195.94  ll=  -3186.94

 Entropy = unknown function /log()

Not sure about the unknown function log error message. But otherwise is what we expected. Surprisingly to me trajplot works out of the box:


I figured I would need to write matrix code to get this plot, but no issues! If you are wondering why is the estimated spline effect here ~-0.7 instead of -0.2 (how I simulated the data), it is because these are cumulative effects, not independent ones. To get the -0.2, you can do post-hoc Wald tests:

matrix list e(b)
lincom _b[s2_1G1] + _b[linear1]

Which gives as output:

. matrix list e(b)

       interc1     linear1      s2_1G1     interc2     linear2      s2_1G2      sigma1      sigma2      theta2
y1   1.9928963   .50327955  -.70311464   4.9882231   .50310322  -.00382206   .40372185   .20052795   .40546454

. lincom _b[s2_1G1] + _b[linear1]

 ( 1)  linear1 + s2_1G1 = 0

             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
         (1) |  -.1998351    .003097   -64.52   0.000    -.2059051    -.193765

While this is for a linear spline, more complicated cubic splines (or whatever you want, you could do cyclic splines if you have repeating data for example) will just result in more variables going into the tcov() parameters.

Note a major limitation of this, you need to specify the spline knot locations upfront. The method does not do a search for the knots – here I needed to tell mkspline to set a knot location at t=5. I think this is an OK limitation, as specifying knots at regular locations along the domain of the data can approximate many non-linear functions, and it is not much different than assuming a specific number of polynomials (but will tend to behave better in the tails).

Alternate models search for the changepoints themselves, but that would be tough to extend to mixture models (quite a few unknown parameters), better to either have flexible fixed splines, or just go for a continuous multi-level model formulation instead of discrete mixture classes.

Regression Discontinuity Designs

Regression Discontinuity Designs (RDD) are one way to evaluate predictive model systems with causal outcomes associated with what you do with that information. For a hypothetical example, imagine you have a predictive model assessing the probability that you will be diagnosed with diabetes in the next two years. Those that score above 30% probability get assigned a caseworker, to try to prevent that individual from contracting diabetes. How do you know how effective that caseworker is in reducing diabetes in those high risk individuals?

The RDD design works like this – you have your running variable (here the predicted probability), and the threshold (over 30%) that gets assigned a treatment. You estimate the probability of the actual outcome (it may be other outcomes besides just future diabetes, such as the caseworker may simply reduce overall medical costs even if the person still ends up being diagnosed with diabetes). You then estimate the dip in the predicted outcome just before and just after the threshold. The difference in those two curves is the estimated effect.

Here is an example graph illustrating (with fake data I made, see the end of the post). The bump in the line (going from the blue to the red) is then the average treatment effect of being assigned a caseworker, taking into account the underlying trend that higher risk people here are more likely to have higher medical bills.

A few things to note about RDD – so there is a tension between estimating the underlying curve and the counterfactual bump at the threshold. Theoretically values closer to the threshold should be more relevant, so some (see the Wikipedia article linked earlier) try to estimate non-linear weighted curves, giving cases closer to the threshold higher weights. This often produces strange artifacts (that Andrew Gelman likes to point out on his blog) that can miss-estimate the RDD effect. This is clearly the case in noisy data if the line dips just before the threshold and curves up right after the threshold.

So you can see in my code at the end I prefer to estimate this using splines, as opposed to weighted estimators that have a bit of noise. (Maybe someday I will code up a solution to do out of sample predictive evaluations for this as an additional check.) And with this approach it is easy to incorporate other covariates (and look at treatment heterogeneity if you want). Note that while wikipedia says the weighted estimator is non-parametric this is laughably wrong (it is two straight lines in their formulation, a quite restrictive for actually) – while I can see some theoretical justification for the weighted estimators, in practice these are quite noisy and not very robust to minor changes in the weights (and you always need to make some parametric assumptions in this design, both for the underlying curve as well as the standard error around the threshold bump).

There are additional design details you need to be worried about, such as fuzzy treatments or self selection. E.g. if a person can fudge the numbers a bit and choose which side of the line they are on, it can cause issues with this design. In those cases there may be a reasonable workaround (e.g. an instrumental variable design), but the jist of the research design will be the same.

Last but not least, to actually conduct this analysis you need to cache the original continuous prediction. In trying to find real data for this blog post, many criminal justice examples of risk assessment people only end up saving the final categories (low, medium, high) and not the original continuous risk instrument.

If folks have a good public data example that I could show with real data please let me know! Scouting for a bit (either parole/probabation risk assessment, or spatial predictive policing) has not turned up any very good examples for me to construct examples with (also health examples would be fine as well).

This is the simulated data (in python), the RDD graph, and the statsmodel code for how I estimate the RDD bump effect. You could of course do more fancy things (such as penalize the derivatives for the splines), but this I think would be a good way to estimate the RDD effect in many designs where appropriate.

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Simulating data
n_cases = 3000 # total number of cases
# pretend this is predicted prob
prob = np.random.beta(3,10,size=n_cases)
# pretend this is med costs over a year
med_cost = 3000 + 5000*prob + -500*(prob > 0.3) + np.random.normal(0,500,n_cases)
df = pd.DataFrame(zip(prob,med_cost), columns=['Prob','MedCost'])
# could do something fancier with non-linear effects for prob

# Fitting regression model

# Knots are small distance from threshold
# (Could also do a knot right on threshold)
mod = smf.ols(formula='MedCost ~ bs(Prob,knots=[0.2,0.25,0.35,0.4]) + I(Prob > 0.3)', data=df)
res = mod.fit()

# Plotting fit

# Getting standard errors
prob_se = res.get_prediction().summary_frame()
prob_se['Prob'] = prob
low = prob_se[prob_se['Prob'] <= 0.3].copy()
high = prob_se[prob_se['Prob'] > 0.3].copy()

# Getting effect for threshold bump
coef = res.summary2().tables[1]
ci = coef.iloc[1,4:6].astype(int).to_list()

fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(df['Prob'], df['MedCost'], c='grey',
           edgecolor='k', alpha=0.15, s=5, zorder=1)
ax.axvline(0.3, linestyle='solid', alpha=1.0, 
           color='k',linewidth=1, zorder=2)
                zorder=3, color='darkblue')
                zorder=3, color='red')
ax.set_xlabel('Predicted Prob Diabetes')
ax.set_ylabel('Medical Costs')
ax.set_title(f'RDD Reduced Cost Estimate {ci[0]} to {ci[1]} (95% CI)')
ax.text(0.3,6500,'Threshold',rotation=90, size=9,
         ha="center", va="center",
         bbox=dict(boxstyle="round", ec='k',fc='grey'))
plt.savefig('RDD.png', dpi=500, bbox_inches='tight')

How to interpret the Area Under the Curve (AUC) stat

One of the questions I often ask in data science interviews is ‘How would you explain the area under the curve statistic to a business person?’. Maybe it is too easy a question even for juniors, as I can’t remember anyone getting it wrong. While there is no correct answer per se, the most logical response is you focus on discussing true positives and false positives, and how the predictive model can be tuned to capture more true positives at the expense of generating more false positives. Only after that do you then even bother to show the ROC curve, and say we calculate the area under the curve (AUC) as a measure of how well the model can discriminate the two classes.

The most recent situation I remember this happened in real life, I actually said to the business rep that the AUC does not directly translate to revenue, but is a good indication that a model is good in an absolute sense (we know others have AUCs typically around 0.7 to 0.8 for this problem, and over 0.5 is better than random). And it is often good in a relative sense – a model with an AUC of 0.8 is typically better than a model with and AUC of 0.75 (although not always, you need to draw the ROC curve and make sure the larger AUC curve dominates the other curve and that they do not cross). So while I try to do my best explaining technical statistical content, I often punt to simpler ‘here are the end outcomes we care about’ (which don’t technically answer the question) as opposed to ‘here is how the sausage is made’ explanations.

One alternative and simple explanation of AUC though for binary models is to take the Harrell’s C index interpretation, which for binary predictions is equivalent to the AUC statistic. So for this statistic you could say something like ‘If I randomly sample a negative case and a positive case, the positive case will have a higher predicted risk {AUC} percent of the time.’ I do like this interpretation, which illustrates that the AUC is all just about rank ordering the predictions, and a more discriminating model will have a higher AUC (although it says nothing about calibration).

Text analysis, alt competition sites, and ASC

A bit of a potpourri blog post today. First, I am not much of a natural language processing wiz. But based on the work of Peter Baumgartner at RTI (assigning reduced form codes based on text descriptions), I was pointed out the simpletransformers library. It is very easy to download complicated NLP architectures (like RoBERTa with 100 million+ parameters) and retrain them to your idiosyncratic data.

Much of the issue working with text data is the cleaning, and with these extensive architectures they are not so necessary. See for example this blog post on classifying different toxic comments. Out of the box the multi-label classification gets an AUC score pretty damn close to the winning entry in the Kaggle contest this data was developed for. No text munging necessary.

Playing around on my personal machine I have been able to download and re-tune the pretrained RoBERTa model – doing that same model as the blog post (with just all the defaults for the model), it takes around 7 hours of my GPU.

The simpletransformers library has a ton of different pre-set architectures for different problems. But the ones I have played around with with labelled data (e.g. you have text data on the right hand side, and want to predict a binary or multinomial outcome), I have had decent success with.

Another text library I have played around with (although have not had as much success in production) is dirty_cat. This is for unsupervised modeling, which unfortunately is a harder task to evaluate what is successful than supervised learning.

Alt Competition Sites

I recently spent two days trying to work on a recent Kaggle competition, a follow up to the toxic comments one above. My solution is nowhere close to the current leaderboard though, and given the prize total (and I expect something like 5,000 participants), this just isn’t worth my time to work on it more.

Two recent government competitions I did compete in though, the NIJ recidivism, and the NICHD maternal morbidity. (I will release my code for the maternal morbidity when the competition is fully scored, it is a fuzzy one not a predictive best accuracy one.) Each of these competitions had under 50 teams participate, so it is much less competition than Kaggle. The CDC has a new one as well, for using a network based approach to violence and drug problems.

For some reason these competitions are not on the Challenge.gov website. Another site I wanted to share as well is DataDriven competitions. If I had found that sooner I might have given the floodwater competition a shot.

I have mixed feeling about the competitions, and they are risky. I probably spent for NIJ and NICHD what I would consider something like $10,000 to $20,000 of my personal time on the code solutions (for each individually). I knew NIJ would not have many submissions (I did not participate in the geographic forecasting, and saw some people win with silly strategies). If you submitted anything in the student category you would have won close to the same amount as my team did (as not all the slots were filled up). And the NICHD was quite onerous to do all the paperwork, so I figured would also be low turnout (and the prizes are quite good). So whether I think it is worth it for me to give a shot is guessing the total competition pool, level of effort to submit a good submission, and how the prizes are divvied up as well as the total dollar amount.

The CDC violence one is strangely low prizes, so I wouldn’t bother to submit unless I already had some project I was working on anyway. I think a better use of the Fed challenges would be to have easier pilot work, and based on the pilot work fund larger projects. So consider the initial challenge sort of equivalent to a grant proposal. This especially makes sense for generating fairness algorithms (not so much for who has the best hypertuned XGBoost model on a particular train/test dataset).

Missing ASC

The American Society of Criminology conference is going on now in Chicago. A colleague emailed the other day asking if I was coming, and I do feel some missing of meeting up with friends. The majority of presentations are quite bad (both for content and presentation style), so it is more of an excuse to have a beer with friends than anything.

I debated with my wife about taking a family vacation to Chicago during this conference earlier in the year. We decided against it for the looming covid – I correctly predicted it would still be quite prevalent (and I am guessing it will be indefinitely at this point given vaccine hesitancy and new variants). I incorrectly predicted though I wouldn’t be able to get a vaccine shot until October (so very impressed with the distribution on that front). Even my son has a shot (didn’t even try to guess when that would happen). So I am not sure if I made the correct choice in retrospect – the risk of contraction is as high as I guessed, but risk of adverse effects given we have the vaccines are very low.