Using quantile regression to evaluate police response times

Jeff Asher recently had a post on analysis of response times across many agencies. One nitpick though (and ditto for prior analyses I have seen, such as Scott Mourtgos and company), is that you should not use linear models (or means in general) to describe response time distributions. They are very heavily right skewed, and so the mean tends to be not representative of the bulk of the data.

When evaluating changes in response time, imagine two simplistic scenarios. One, every single call increases by 5 minutes, so what used to be 5 is now 10, 20 is now 25, 60 is now 65, etc. That is probably not realistic for response times, it is probably calls in the tail (ones that take a very long time to wait for an opening in the queue) are what changes. E.g. 5 is still 5, 20 is still 20, but 60 is now 120. In the latter scenario, the left tail of the distribution does not change, only the right tail. In both scenarios the mean shifts.

I think a natural way to model the problem is instead of focusing on means, is to use quantile regression. It allows you to generalize the entire distribution (look at the left and right tails) and still control for individual level circumstances. Additionally, often emergency agencies set goals along the lines of “I want to respond to 90% of emergency events with X minutes”. Quantile regression is a great tool to describe that 90% make. Here I am going to show an example using the New Orleans calls for service data and python.

First, we can download the data right inside of python without saving it directly to disk. I am going to be showing off estimating quantile regression with the statsmodel library. I do the analysis for 19 through 22, but NOLA has calls for service going back to the early 2010s if folks are interested.

import pandas as pd
import statsmodels.formula.api as smf

# Download data, combo 19/20/21/22
y19 = 'https://data.nola.gov/api/views/qf6q-pp4b/rows.csv?accessType=DOWNLOAD'
y20 = 'https://data.nola.gov/api/views/hp7u-i9hf/rows.csv?accessType=DOWNLOAD'
y21 = 'https://data.nola.gov/api/views/3pha-hum9/rows.csv?accessType=DOWNLOAD'
y22 = 'https://data.nola.gov/api/views/nci8-thrr/rows.csv?accessType=DOWNLOAD'
yr_url = [y19,y20,y21,y22]
res_pd = [pd.read_csv(url) for url in yr_url]
data = pd.concat(res_pd,axis=0) # alittle over 1.7 million

Now we do some data munging. Here I eliminate self initiated events, as well as those with missing data. There then are just a handful of cases that have 0 minute arrivals, which to be consistent with Jeff’s post I also eliminate. I create a variable, minutes, that is the minutes between the time created and the time arrived on scene (not cleared).

# Prepping data
data = data[data['SelfInitiated'] == 'N'].copy() # no self init
data = data[~data['TimeArrive'].isna()].copy()   # some missing arrive
data['begin'] = pd.to_datetime(data['TimeCreate'])
data['end'] = pd.to_datetime(data['TimeArrive'])
dif = data['end'] - data['begin']
data['minutes'] = dif.dt.seconds/60
data = data[data['minutes'] > 0].copy() # just a few left over 0s

# Lets look at the distribution
data['minutes'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])

For quantiles, for the entire sample the median time is around 20 minutes, the 10th percentile is under 3 minutes and the 90th percentile is around 5 hours. Using the mean here (which in Jeff’s post varies from 50 to 146 minutes over the same 4 year period), can be somewhat misleading.

An important component of response times is differentiating between different priority calls. NOLA in their data, higher numbers are higher priority. Zero priority are things NOLA says don’t necessarily need an officer at all. So it could be those “0 priority” calls are really just dragging the overall average down over time, although they may have little to do with clearance rates or public safety overall. The priority category fields also has sub-categories, e.g. 1A is higher priority than 1B. To keep the post simple I just breakdown by integer leading values, not the sub letter-categories.

# Priority just do 1/2/3
# 3 is highest priority
data['PriorCat'] = data['Priority'].str[0]
# Only 5 cases of 3s, will eliminate these as well
data.groupby('PriorCat')['minutes'].describe()

Here you can really see the right skewness – priority 2 calls the mean is 25 minutes, but the median is under 10 minutes for the entire sample. A benefit of quantile regression I will use in a bit, the few outlying cases (beyond the quantiles of interest), really don’t impact the analysis. So those cases that take almost 24 hours (I imagine they are just auto-filled in like that in the data), really don’t impact estimates of smaller quantiles. But they can have noticeable influence on mean estimates.

Some final data munging, to further simplify I drop the 16 cases of priority 3s and 4s, and add in a few more categorical covariates for hour of the day, and look at months over time as categorical. (These decisions are more so to make the results easier to parse in a blog post in simpler tables, it would take more work to model a non-linear continuous over time variable, say via a spline, and make a reasonable ordinal encoding for the sub-priority categories.)

# only worry about 0/1/2s
data = data[data['PriorCat'].isin(['0','1','2'])].copy()
# Total in the end almost 600k cases

# Some factor date variables
def dummy_stats(vdate,begin_date):
    bd = pd.to_datetime(begin_date)
    year = vdate.dt.year
    month = vdate.dt.month
    week_day = vdate.dt.dayofweek
    hour = vdate.dt.hour
    diff_days = (vdate - bd).dt.days
    # if binary, turn week/month into dummy variables
    return diff_days, week_day, hour, month, year

dn, wd, hr, mo, yr = dummy_stats(data['begin'],'1/1/2022')
data['Hour'] = hr
data['Month'] = mo
data['Year'] = yr

# Lets just look at months over time
data['MoYr'] = data['Year'] + data['Month']/100

Now finally onto the modeling stuff. For those familiar with regression, quantile regression instead of predicting the mean predicts a quantile of the distribution. Here I show predicting the 50th quantile (the median). For those not familiar with regression, this is not all that different than doing a pivot table/group by, but aggregating by quantiles instead of means. Regression is somewhat different than the simpler pivot table, since you “condition on” other continuous factors (here I “control for” hour of day), but in broad strokes is similar.

Here I use a patsy “R style” formula, and fit a categorical covariate for the 0/1/2 categories, hour of day, and the time varying months over time (to see the general trends). The subsequent regression table is big, so will show in parts:

# Quantile regression for median
mod = smf.quantreg("minutes ~ C(PriorCat, Treatment(reference='2')) + C(Hour) + C(MoYr)", data)
res50 = mod.fit(q=0.5)
res50.summary()

First, I use 2 priority events as the referent category, so you can see (in predicting the median of the distribution), priority 1 events have a median 24 minutes longer than priority 2, and priority 0 have a median two hours later. You can see some interesting patterns in the hour of the day effects (which are for the overall effects, not broken down by priority). So there are likely shift changes at 06:00, 14:00, and 22:00 that result in longer wait times.

But of most interest are patterns over time, here is the latter half of that table, showing median estimates over the months in this sample.

You could of course make the model more complicated, e.g. look at spatial effects or incorporate other direct measures of capacity/people on duty. But here it is complicated enough for an illustrative blog post. January-2019 is the referent category month, and we can see some slight decreases in a few minutes around the start of the pandemic, but have been clearly been increasing at the median time fairly noticeably starting later in 2021.

As opposed to interpreting regression coefficients, I think it is easier to see model predictions. We can just make sample data points, here at noon over the different months, and do predictions over each different priority category:

# Predictions for different categories
hour = 12
prior_cat = [0,1,2]
oos = data.groupby(['PriorCat','MoYr'],as_index=False)['Hour'].size()
oos['Hour'] = 12
oos['Q50'] = res50.predict(oos)

print(oos[oos['PriorCat'] == '0'])
print(oos[oos['PriorCat'] == '1'])
print(oos[oos['PriorCat'] == '2'])

So here for priority 0, 130 has creeped up to 143.

And for priority 1, median times 35 to 49.

Note that the way I estimated the regression equation, the increase/decrease per month is forced to be the same across the different priority calls. So, the increase among priority 2 calls is again around 13 minutes according to the model.

But this assumption is clearly wrong. Remember my earlier “fast” and “slow” example, with only the slow calls increasing. That would suggest the distributions for the priority calls will likely have different changes over time. E.g. priority 0 may increase by alot, but priority 2 will be almost the same. You could model this in the formula via an interaction effect, e.g. something like "minutes ~ C(PriorCat)*C(MoYr) + C(Hour)", but to make the computer spit out a solution a bit faster, I will subset the data to just priority 2 calls.

Here the power of quantile regression is we can look at different distributions. Estimating extreme quantiles is tough, but looking at the 10th/90th (as well as the median) is pretty typical. I do those three quantiles, and generate model predictions over the months (again assuming a call at 12).

# To save time, I am only going to analyze
# Priority 2 calls now
p2 = data[data['PriorCat'] == '2'].copy()
m2 = smf.quantreg("minutes ~ C(MoYr) + C(Hour)", p2)
oos2 = oos[oos['PriorCat'] == '2'].copy()

# loop over different quantiles
qlist = [0.1, 0.5, 0.9]
for q in qlist:
    res = m2.fit(q=q)
    oos2[f'Q_{q}'] = res.predict(oos2)

oos2

So you can see my story about fast and slow calls plays out, although even when restricted to purportedly high risk calls. When looking at just priority 2 calls in New Orleans, the 10th percentile stays very similar over the period, although does have a slight increase from under 4 to almost 5 minutes. The 50th percentile has slightly more growth, but is from 10 minutes to 13 minutes. The 90th percentile though has more volatility – grew from 30 to 60 in small increases in 2022, and late 2022 has fairly dramatic further growth to 70/90 minutes. And you can see how the prior model that did not break out priority 0/1 calls changed this estimate for the left tail for the priority 2 left tail as well. (So those groups likely also had large shifts across the entire set.)

So my earlier scenario is overly simplistic, we can see some increase in the left tails of the distribution as well in this analysis. But, the majority of the increase is due to changes in the long right tail – calls that used to take less than 30 minutes are now taking 90 minutes to arrive. Which still likely has implications for satisfaction with police and reporting behavior, maybe not so much though with clearance or direct public safety.

No easy answers here in terms of giving internet advice to New Orleans. If working with NOLA, I would like to get estimates of officer capacity per shift, so I could incorporate into the quantile regression model that factor directly. That would allow you to precisely quantify how officer capacity impacts the distribution of response times. So not just “response times are going up” but “the decrease in capacity from A to B resulted in X increase in the 90th percentile of response times”. So if NOLA had goals set they could precisely state where officer capacity needed to be to have a shot of obtaining that goal.

Simulating Group Based Trajectories (in R)

The other day I pointed out on Erwin Kalvelagen’s blog how mixture models are a solution to fit regression models with multiple lines (where identification of which particular function/line is not known in advance).

I am a big fan of simulating data when testing out different algorithms for simply the reason it is often difficult to know how an estimator will behave with your particular data. So we have a bunch of circumstances with mixture models (in particular here I am focusing on repeated measures group based traj type mixture models) that it is hard to know upfront how they will do. Do you want to estimate group based trajectories, but have big N and small T? Or the other way, small N and big T? (Larger sample sizes tend to result in identifying more mixtures as you might imagine (Erosheva et al., 2014).) Do you have sparse Poisson data? Or high count Poisson data? Do you have 100,000 data points, and want to know how big of data and how long it may take? These are all good things to do a simulation to see how it behaves when you know the correct answer.

These are relevant no matter what the particular algorithm – so the points are all the same for k-medoids for example (Adepeju et al., 2021; Curman et al., 2015). Or whatever clustering algorithm you want to use in this circumstance. So here I show a few different simulations showing:

  • GBTM can recover the correct underlying equations
  • AIC/BIC fit stats have a difficult time distinguishing the correct number of groups
  • If the underlying model is a random effects instead of latent clusters, AIC/BIC performs quite well

The last part is because GBTM models have a habit of spitting out solutions, even if the true underlying data process has no discrete groups. This is what Skardhamar (2010) did in his article. It was focused on life course, but it applies equally to the spatial analysis GBTM myself and others have done as well (Curman et al., 2015; Weisburd et al., 2004; Wheeler et al., 2016). I’ve pointed out in the past that even if the fit for GBTM looks good, the underlying data can suggest a random effects model will work quite well, and Greenberg (2016) makes pretty much the same point as well.

Example in R

In the past I have shown how to use the crimCV package to fit these group based traj models, specifically zero-inflated Poisson models (Nielsen et al., 2014). Here I will show a different package, the R flexmix package (Grün & Leisch, 2007). This will be Poisson mixtures, but they have an example of doing zip models in there docs if you want.

So first, I load in the flexmix library, set the seed, and generate longitudinal data for three different Poisson models. One thing to note here, mixture models don’t assign an observation 100% to an underlying mixture, but the data I simulate here is 100% in a particular group.

################################################
library("flexmix")
set.seed(10)

# Generate simulated data
n <- 200 #number of individuals
t <- 10   #number of time periods
dat <- expand.grid(t=1:t,id=1:n)

# Setting up underlying 3 models
time <- dat$t
p1 <- 3.5 - time
p2 <- 1.3 + -1*time + 0.1*time^2
p3 <- 0.15*time
p_mods <- data.frame(p1,p2,p3)

# Selecting one of these by random
# But have different underlying probs
latent <- sample(1:3, n, replace=TRUE, prob=c(0.35,0.5,0.15))
dat$lat <- expand.grid(t=1:t,lat=latent)$lat
dat$sel_mu <- p_mods[cbind(1:(n*t), dat$lat)]
dat$obs_pois <- rpois(n=n*t,lambda=exp(dat$sel_mu))
################################################

Now that is the hard part really – figuring out exactly how you want to simulate your data. Here it would be relatively simple to increase the number of people/areas or time period. It would be more difficult to figure out underlying polynomial functions of time.

Next part we fit a 3 mixture model, then assign the highest posterior probabilities back into the original dataset, and then see how we do.

################################################
# Now fitting flexmix model
mod3 <- flexmix(obs_pois ~ time + I(time^2) | id, 
                model = FLXMRglm(family = "poisson"),
                data = dat, k = 3)
dat$mix3 <- clusters(mod3)

# Seeing if they overlap with true labels
table(dat$lat, dat$mix3)/t
################################################

So you can see that the identified groupings are quite good. Only 4 groups out of 200 are mis-placed in this example.

Next we can see if the underlying equations were properly recovered (you can have good separation between groups, but the polynomial fit may be garbage).

# Seeing if the estimated functions are close
rm3 <- refit(mod3)
summary(rm3)

This shows the equations are really as good as you could expect. The standard errors are as wide as they are because this isn’t really all that large a data sample for generalized linear models.

So this shows that if I feed in the correct underlying equation (almost, I could technically submit different equations with/without quadratic terms for example). But what about the real world situation in which you do not know the correct number of groups? Here I fit models for 1 to 8 groups, and then use the typical AIC/BIC to see which group it selects:

################################################
# If I look at different groups will AIC/BIC
# pick the right one?

group <- 1:8
left_over <- group[!(group %in% 3)]
aic <- rep(-1, 8)
bic <- rep(-1, 8)
aic[3] <- AIC(mod3)
bic[3] <- BIC(mod3)

for (i in left_over){
  mod <- flexmix(obs_pois ~ time + I(time^2) | id, 
                 model = FLXMRglm(family = "poisson"),
                 data = dat, k = i)
  aic[i] <- AIC(mod)
  bic[i] <- BIC(mod)
}

fit_stats <- data.frame(group,aic,bic)
fit_stats
################################################

Here it actually fit the same model for 3/5 groups (sometimes even if you tell flexmix to fit 5 groups, it will only return a smaller number). You can see that the fit stats for group 4 through are almost the same. So while AIC/BIC did technically pick the right number in this simulated example, it is cutting the margin pretty close to picking 4 groups in this data instead of 3.

So the simulation Skardhamar (2010) did was slightly different than this so far. What he did was simulate data with no underlying trajectory groups, and then showed GBTM tended to spit out solutions. Here I will show that is the case as well. I simulate random intercepts and a simple linear trend over time.

################################################
# Simulate random effects model
library(lme4)
rand_eff <- rnorm(n=n,0,1.5)
dat$re <- expand.grid(t=1:t,re=rand_eff)$re
dat$re_pois <- rpois(n=n*t,lambda=exp(dat$sel_mu))
dat$mu_re <- 3 + -0.2*time + dat$re
dat$re_pois <- rpois(n=n*t,lambda=exp(dat$mu_re))

re_mod <- glmer(re_pois ~ 1 + time + (1 | id), 
                data = dat, family = poisson(link = "log"))
summary(re_mod)
################################################

So you can see that the random effects model is all fine and dandy – recovers both the fixed coefficients, as well as estimates the correct variance for the random intercepts.

So here I go and see how the AIC/BIC compares for the random effects models vs GBTM models for 1 to 8 groups (I stuff the random effects model in the first row for group 0):

################################################
# Test AIC/BIC for random effects vs GBTM
group <- 0:8
left_over <- 1:8
aic <- rep(-1, 9)
bic <- rep(-1, 9)
aic[1] <- AIC(re_mod)
bic[1] <- BIC(re_mod)

for (i in left_over){
  mod <- flexmix(re_pois ~ time + I(time^2) | id, 
                 model = FLXMRglm(family = "poisson"),
                 data = dat, k = i)
  aic[i+1] <- AIC(mod)
  bic[i+1] <- BIC(mod)
}

fit_stats <- data.frame(group,aic,bic)
fit_stats
################################################

So it ends up flexmix will not give us any more solutions than 2 groups. But that the random effect fit is so much smaller (either by AIC/BIC) than the GBTM you wouldn’t likely make that mistake here.

I am not 100% sure how well we can rely on AIC/BIC for these different models (R does not count the individual intercepts as a degree of freedom here, so k=3 instead of k=203). But no reasonable accounting of k would flip the AIC/BIC results for these particular simulations.

One of the things I will need to experiment with more, I really like the idea of using out of sample data to validate these models instead of AIC/BIC – no different than how Nielsen et al. (2014) use leave one out CV. I am not 100% sure if that is possible in this set up with flexmix, will need to investigate more. (You can have different types of cross validation in that context, leave entire groups out, or forecast missing data within an observed group.)

References

Adepeju, M., Langton, S., & Bannister, J. (2021). Anchored k-medoids: a novel adaptation of k-medoids further refined to measure long-term instability in the exposure to crime. Journal of Computational Social Science, 1-26.

Grün, B., & Leisch, F. (2007). Fitting finite mixtures of generalized linear regressions in R. Computational Statistics & Data Analysis, 51(11), 5247-5252.

Curman, A. S., Andresen, M. A., & Brantingham, P. J. (2015). Crime and place: A longitudinal examination of street segment patterns in Vancouver, BC. Journal of Quantitative Criminology, 31(1), 127-147.

Erosheva, E. A., Matsueda, R. L., & Telesca, D. (2014). Breaking bad: Two decades of life-course data analysis in criminology, developmental psychology, and beyond. Annual Review of Statistics and Its Application, 1, 301-332.

Greenberg, D. F. (2016). Criminal careers: Discrete or continuous?. Journal of Developmental and Life-Course Criminology, 2(1), 5-44.

Nielsen, J. D., Rosenthal, J. S., Sun, Y., Day, D. M., Bevc, I., & Duchesne, T. (2014). Group-based criminal trajectory analysis using cross-validation criteria. Communications in Statistics-Theory and Methods, 43(20), 4337-4356.

Skardhamar, T. (2010). Distinguishing facts and artifacts in group-based modeling. Criminology, 48(1), 295-320.

Weisburd, D., Bushway, S., Lum, C., & Yang, S. M. (2004). Trajectories of crime at places: A longitudinal study of street segments in the city of Seattle. Criminology, 42(2), 283-322.

Wheeler, A. P., Worden, R. E., & McLean, S. J. (2016). Replicating group-based trajectory models of crime at micro-places in Albany, NY. Journal of Quantitative Criminology, 32(4), 589-612.

Using regularization to generate synthetic controls and conformal prediction for significance tests

When viewing past synthetic control results, one of things that has struck me is that the matching of the pre-trends is really good — almost too good in many cases (appears to be fitting to noise, although you may argue that is a feature in terms of matching exogenous shocks). For example, if you end up having a pre-treatment series of 10 years, and you have a potential donor pool the size of 30, you could technically pick 10 of them at random, fit a linear regression predicting the 10 observations in the treated unit, based on 10 covariates of the donor pool outcomes over the same pre time period, and get perfect predictions (ignoring the typical constraints one places on the coefficients).

So how do we solve that problem? One solution is to use regularized regression results (e.g. ridge regression, lasso), when the number of predictors is greater than the number of observations. So I can cast the matching procedure into a regression problem to generate the weights. Those regression procedures are typically used for forecasting, but don’t have well defined standard errors, and so subsequently are typically only used for point forecasts. One way to make inferences though is to generate the synthetic weights (here using lasso regression), and then use conformal prediction intervals to do our hypothesis testing of counterfactual trends.

Here I walk through an example using state panel crime data in R, full code and data can be downloaded here.

A Synthetic Control Example

So first, these are the packages we need to replicate the results. conformalInference is not on CRAN yet, so use devtools to install it.

#library(devtools)
#install_github(repo="ryantibs/conformal", subdir="conformalInference")
library(conformalInference)
library(glmnet)
library(Synth)

Then I have prepped a nice state panel dataset of crime rates and counts from 1960 through 2014. I set a hypothetical treatment start year in 2005 just so I have a nice 10 years post data for illustration. That is a pretty good length pre-panel though, and a good number of potential donors.

MyDir <- "C:\\Users\\axw161530\\Desktop\\SynthIdeas"
setwd(MyDir)

TreatYear <- 2005

LongData <- read.csv("CrimeStatebyState_Edited.csv")
summary(LongData)

Next I prep my data, currently it is in long panel format, but I need it in wide format to fit the regression equations I want. I am just matching on violent crime rates here. I take out NY, as it is missing a few years of data. (This dataset also includes DC.) Then I split it up into my pre intervention and post intervention set.

#Changing the data to wide for just the violent offenses
wide <- LongData[,c('State','Year','Violent.Crime.rate')]
names(wide)[3] <- 'VCR'
wide <- reshape(wide, idvar="Year", timevar="State", direction="wide")
summary(wide)
#Take out NY because of NAs
wide <- wide[,c(1:33,35:52)]

wide_pre <- as.matrix(wide[wide$Year < TreatYear,])
wide_post <- as.matrix(wide[wide$Year >= TreatYear,])

Now onto the good stuff, we can estimate our lasso regression using the pre-data to get our weights. This constrains the coefficients to be positive and below 1. But does not have the constraint they sum to 1. I just choose Alabama as an example treated unit — I intentionally chose a state and year that should not have any effects for illustration and to check the coverage of my technique vs more traditional analyses.

You can see in my notes this is different than traditional synth in that it has an intercept as well. I was surprised, but the predictions in sample were really bad without the intercept no matter how I sliced it.

res <- glmnet(x=wide_pre[,3:51],y=wide_pre[,2],family="gaussian",
       lower.limits=0,upper.limits=1,intercept=TRUE,standardize=FALSE,
       alpha=1) #need the intercept, predictions suck otherwise

Even though this does not constrain the coefficients to sum to 1, it ends up with weights really close to that ideal anyway (sum of the non-intercept coefficients is just over 1.01). When I use crossvalidation it does not choose weights that sum to unity, but in sample the above code and the cv.glmnet are really similar in terms of predictions.

co_ridge <- as.matrix(coef(res))
fin <- co_ridge[,"s99"]
active <- fin[fin > 0] #Does not include intercept

If you print active we then have for our state weights (and the intercept is pretty tiny, -22). So not quite sure why eliminating the intercept was causing such problems in this example. So North Carolina just sneaks in, but otherwise the synthetic control is a mix of Arkansas, California, Kentucky, and Texas. The intercept is just a level shift, so we are still matching curves otherwise, so that does not bother me very much.

VCR.AR 0.2078156362
VCR.CA 0.1201658279
VCR.IL 0.1543015666
VCR.KY 0.2483613907
VCR.NC 0.0002896238
VCR.TX 0.2818272850

If we look at our predictions for the pre-time period, Alabama had the typical crime path, with a big raise going into the early 90’s and then a fall afterward (black line), and our in-sample predictions from the lasso regression are decent.

pre_pred <- predict(res,newx=wide_pre[,3:51],s=min(res$lambda)) #for not cv results

plot(wide_pre[,1],wide_pre[,2],type='l',xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_pre[,1],pre_pred,bg='red',pch=21) #Not too shabby
legend(1960,800,legend=c("Observed Albama","Predicted"),col=c("black","black"), pt.bg=c("black","red"), lty=c(1,NA), pch=c(NA,21))

Now to evaluate post intervention, we are going to generate conformal prediction intervals using a jackknife approach. Basically doing all the jazz of above, but leaving one pre year out at a time, and trying to predict Alabama’s violent crime rate for that left out year. Repeat that same process for all prior years, and we can get a calculation of the standard error of our prediction. Then apply that standard error to future years, so we can tell if the observed trend is different than the counterfactual we estimated (given the counterfactual has errors). I generate both 90% prediction intervals, as well as 99% prediction intervals.

train_fun <- function(x, y, out=NULL){
  return( glmnet(x,y,alpha=1,standardize=FALSE,intercept=TRUE,nlambda=100,
                lower.limits=0,upper.limits=1,family="gaussian")
  )
}

pred_fun = function(out, newx) {
    return(predict(out, newx, s=min(out$lambda)))
}

limits_10 <- conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
                                 train.fun=train_fun,predict.fun=pred_fun,alpha=0.10,
                                 verbose=TRUE)

limits_01 <- conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
               train.fun=train_fun,predict.fun=pred_fun,alpha=0.01,
               verbose=TRUE)

plot(wide_post[,1],wide_post[,2],type='l',ylim=c(150,650),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
legend("topright",legend=c("Observed Albama","Predicted","90% Pred. Int.","99% Pred. Int."),cex=0.7,
       col=c("black","black","grey","grey"), pt.bg="red", lty=c(1,NA,1,1), pch=c(NA,21,NA,NA), lwd=c(1,1,1,3))

Then at the end of the above code snippet I made a plot. Black line is observed for Alabama from 05-14. Red dots are the estimated counterfactual based on the pre-weights. The lighter grey lines are then the prediction intervals. So we can see it is just outside the 90% intervals 3 times in the later years (would only expect 1 time), but all easily within the 99% intervals.

Note these are prediction intervals, not confidence intervals. Thinking about it I honestly don’t know whether we want prediction or confidence intervals in this circumstance, but prediction will be wider.

So this approach just matches on the pre-treated same outcome observations. To match on additional covariates, you can add them in as rows into the pre-treatment dataset (although you would want to normalize the values to a similar mean and standard deviation as the pre-treated outcome series).

You may also add in other covariates, like functions of time (although this changes the nature of the identification). So for example say you incorporate a linear and quadratic trend in time, and lasso only chooses those two time factors and no control areas. You are doing something more akin to interrupted time series analysis at that point (the counterfactual is simply based on your estimate of the pre-trend). Which I think is OK sometimes, but is quite different than using control areas to hopefully capture random shocks.

Comparing to Traditional Synth results

To see whether my error intervals are similar to the placebo approach, I used the old school synth R package. It isn’t 100% comparable, as it makes you match on at least one covariate, so here I choose to also match on the average logged population over the pre-treatment period.

#NY is missing years
LongData_MinNY <- LongData[as.character(LongData$State) != "NY",c("State","Year","Violent.Crime.rate","Population")]
LongData_MinNY$StateNum <- as.numeric(LongData_MinNY$State)
LongData_MinNY$State <- as.character(LongData_MinNY$State)
LongData_MinNY$LogPop <- log(LongData_MinNY$Population)    

state_nums <- unique(LongData_MinNY$StateNum)
    
dataprep.out <- dataprep(foo = LongData_MinNY,
                         dependent = "Violent.Crime.rate",
                         predictors = c("LogPop"),
                         unit.variable = "StateNum",
                         unit.names.variable = "State",
                         time.variable = "Year",
                         treatment.identifier = 2,
                         controls.identifier = state_nums[!state_nums %in% 2],
                         time.optimize.ssr = 1960:(TreatYear-1),
                         time.predictors.prior = 1960:(TreatYear-1),
                         time.plot = 1960:2014
                         )

synth_res <- synth(dataprep.out)
synth_tables <- synth.tab(dataprep.res = dataprep.out, synth.res = synth_res)
synth_tables$tab.w #a bunch of little weights across the board
path.plot(synth.res = synth_res, dataprep.res = dataprep.out, tr.intake=TreatYear,Xlab='',Ylab='Violent Crime Rate per 100,000',
      Legend=c("Alabama","Synthetic Control"), Legend.position=c("topleft"))

Looking at the weights, it is a bunch of little ones for many different states. Looking at the plot, it doesn’t appear to be any better fit than the lasso approach.

And then I just do the typical approach and use placebo checks to do inference. I loop over my 49 placebos (-1 state for NY, but +1 state because this list includes DC).

#Dataframes to stuff the placebos check results into
Predicted <- data.frame(dataprep.out$Y0plot %*% synth_res$solution.w)
names(Predicted) <- "TreatPred"

Pred_MinTreat <- data.frame(TreatPred = Predicted$TreatPred - LongData_MinNY[LongData_MinNY$StateNum == 2,"Violent.Crime.rate"])

#Now I just need to loop over the other states and collect their results for the placebo tests

placebos <- state_nums[!state_nums %in% 2]
for (i in placebos){
  dataprep.plac <- dataprep(foo = LongData_MinNY,
                           dependent = "Violent.Crime.rate",
                           predictors = c("LogPop"),
                           unit.variable = "StateNum",
                           unit.names.variable = "State",
                           time.variable = "Year",
                           treatment.identifier = i,
                           controls.identifier = state_nums[!state_nums %in% i],
                           time.optimize.ssr = 1960:(TreatYear-1),
                           time.predictors.prior = 1960:(TreatYear-1),
                           time.plot = 1960:2014
  )
  synth_resP <- synth(dataprep.plac)
  synth_tablesP <- synth.tab(dataprep.res = dataprep.plac, synth.res = synth_resP)
  nm <- paste0("S.",i)
  Predicted[,nm] <- dataprep.plac$Y0plot %*% synth_resP$solution.w
  Pred_MinTreat[,nm] <- Predicted[,nm] - LongData_MinNY[LongData_MinNY$StateNum == i,"Violent.Crime.rate"]
}

If you look at the synth estimates for Alabama (grey circles), they are almost exactly the same as the lasso predictions (red circles), even though the weights are very different.

PredRecent <- Predicted[1960:2014 >= TreatYear,]
DiffRecent <- Pred_MinTreat[1960:2014 >= TreatYear,]

plot(wide_post[,1],wide_post[,2],type='l',ylim=c(100,700),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
points(wide_post[,1],PredRecent$TreatPred,bg='grey',pch=21)
legend("topright",legend=c("Observed Albama","Lasso Pred.","90% Pred. Int.","99% Pred. Int.","Synth Pred."),cex=0.6,
       col=c("black","black","grey","grey"), pt.bg=c(NA,"red",NA,NA,"grey"), lty=c(1,NA,1,1,NA), pch=c(NA,21,NA,NA,21), lwd=c(1,1,1,3,1))

But when we look at variation in our placebo results (thin, purple lines), they are much wider than our conformal prediction intervals.

plot(wide_post[,1],wide_post[,2]-post_pred,type='l',ylim=c(-500,500),xlab='',ylab='Observed - Predicted (Violent Crime Rates)')
points(wide_post[,1],post_pred-post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_01$lo-post_pred,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up-post_pred,col='grey',lwd=3)

for (i in 2:ncol(PredRecent)){
  lines(wide_post[,1],DiffRecent[,i],col='#9400D340',lwd=0.5)
}

legend(x=2005.5,y=-700,legend=c("Observed Albama","Lasso Pred.","99% Pred. Int.","Placebos"),
       col=c("black","black","grey",'#9400D3'), pt.bg=c(NA,"red",NA,NA), lty=c(1,NA,1,1), 
       pch=c(NA,21,NA,NA), lwd=c(1,1,3,0.5), xpd=TRUE, horiz=TRUE, cex = 0.45)

So I was hoping they would be the same (conformal would cover the placebo at the expected rate), but alas they are not. So I’m not sure if my conformal intervals are too small, or the placebo checks are extra noisy. I can’t prove it, but I suspect the placebo checks are somewhat noisy, mainly because there will always be some intervention that is idiosyncratic to specific donors over long periods of time that makes them no longer good counterfactuals. This seems especially true if you consider predictions further out from the treatment year. Although I find the logic of the placebo checks pretty convincing, so I am somewhat torn.

Since we have in this example 49 donors, the two-tailed p-value for being outside the placebos would be 2/(49+1)=0.04. Here we would need an intervention that either increased violent crime rates by plus/minus 400 per 100,000, pretty much an impossible standard given a baseline of only 400 crimes per 100,000 as of 2004. The 99% conformal intervals are still pretty wide, with an increase/decrease of about 150 violent crimes per 100,000 needed to be a significant change. The two lines way outside 400 happen to be Alaska and Wyoming, not DC, so maybe a tiny population state results in higher volatility problem. But besides them there are a bunch of placebo states around plus/minus 300 as well.

So caveat emptor if you want to use this idea in your own work, I don’t know if my suggestion is good or bad. Here it suggests its more diagnostic (smaller intervals) than the placebo checks, and isn’t limited by the number of potential donors in setting the alpha level for your tests (e.g. if you only have 10 potential donors your placebo checks are only 90% intervals).

Since this is just one example, there are a few things I would need to know before recommending it more generally. One is that it may not work with smaller pre time series and/or a smaller donor pool. (Not sure of any better way of checking than via a ton of different simulations.)

More general notes

Doing some more lit review while preparing this post, I appear to be like 15th in line to suggest this approach (so don’t take it as novel). In terms of using the lasso to estimate the synth weights, it seems Susan Athey and colleagues proposed something similar in addition to using other machine learning techniques. Also see Amjad et al. 2018 in the Journal of Machine Learning, and this workshop by Alex Hollingsworth and Coady Wing. I am not even the first one to think to use conformal prediction intervals apparently, see this working paper (Chernozhukov, Wuthrich, and Zhu, 2019) posted just a few weeks prior.

There is another R package, gsynth, that appears to solve the problem of p > n via a variable reduction technique (Xu, 2017). Xu also discusses how incorporating more information is really making different identification assumptions. So again just getting good predictions/minimizing the in-sample mean square error is not necessarily the right approach to get correct causal inferences.

Just a blog post, so again can’t say if this is an improvement over other work offhand. This is just illustrative that the bounds for the conformal prediction may be smaller than the typical permutation based approach. Casting it as a regression problem I intuitively grok more, and think opens up more possibilities. For example, you may want to use binomial logistic models instead of linear for the fitting process (so takes into account more volatility for smaller population states).

 

Some more testing coefficient contrasts: Multinomial models and indirect effects

Testing the equality of two coefficients is one of my more popular posts. This is a good thing — often more interesting hypotheses are to test two parameters against each other, as opposed to a strict null hypothesis of a coefficient against zero. Every now an then I get questions about applying this idea to new situations in which it is not always straightforward how to figure out. So here are a few examples using demonstration R code.

Multinomial Models

One question I received about applying the advice was to test coefficients across different contrasts in multinomial models. It may not seem obvious, but the general approach of extracting out the coefficients and the covariance between those estimates works the same way as most regression equations.

So in a quick example in R:

library(nnet)
data(mtcars)
library(car)

mtcars$cyl <- as.factor(mtcars$cyl)  
mtcars$am <- as.factor(mtcars$am)  
mod <- multinom(cyl ~ am + hp, data=mtcars, Hess=TRUE)
summary(mod)

And the estimates for mod are:

> summary(mod)
Call:
multinom(formula = cyl ~ am + hp, data = mtcars)

Coefficients:
  (Intercept)       am1        hp
6   -42.03847  -3.77398 0.4147498
8   -92.30944 -26.27554 0.7836576

Std. Errors:
  (Intercept)       am1        hp
6    27.77917  3.256003 0.2747842
8    31.93525 46.854100 0.2559052

Residual Deviance: 7.702737 
AIC: 19.70274 

So say we want to test whether the hp effect is the same for 6 cylinders vs 8 cylinders. To test that, we just grab the covariance and construct our test:

#Example constructing test by hand
v <- vcov(mod)
c <- coef(mod)
dif <- c[1,3] - c[2,3]
se <- sqrt( v[3,3] + v[6,6] - 2*v[3,6])
z <- dif/se
#test stat, standard error, and two-tailed p-value
dif;se;2*(1 - pnorm(abs(z)))

Which we end up with a p-value of 0.0002505233, so we would reject the null that these two effects are equal to one another. Note to get the variance-covariance estimates for the parameters you need to set Hess=TRUE in the multinom call.

Another easier way though is to use the car libraries function linearHypothesis to conduct the same test:

> linearHypothesis(mod,c("6:hp = 8:hp"),test="Chisq")
Linear hypothesis test

Hypothesis:
6:hp - 8:hp = 0

Model 1: restricted model
Model 2: cyl ~ am + hp

  Df  Chisq Pr(>Chisq)    
1                         
2  1 13.408  0.0002505 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can see although this is in terms of a Chi-square test, it results in the same p-value. The Wald test however can be extended to testing multiple coefficient equalities, and a popular one for multinomial models is to test if any coefficients change across different levels of the dependent categories. The idea behind that test is to see if you can collapse that category with another that is equivalent.

To do that test, I created a function that does all of the contrasts at once:

#Creating function to return tests for all coefficient equalities at once
all_tests <- function(model){
  v <- colnames(coef(model))
  d <- rownames(coef(model))
  allpairs <- combn(d,2,simplify=FALSE)
  totn <- length(allpairs) + length(d)
  results <- data.frame(ord=1:totn)
  results$contrast <- ""
  results$test <- ""
  results$Df <- NULL
  results$Chisq <- NULL
  results$pvalue <- NULL
  iter <- 0
  for (i in allpairs){
    iter <- iter + 1
    l <- paste0(i[1],":",v)
    r <- paste0(i[2],":",v)
    test <- paste0(l," = ",r)
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i[1]," vs ",i[2])
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]    
  }
  ref <- model$lab[!(model$lab %in% d)]
  for (i in d){
    iter <- iter + 1
    test <- paste0(i,":",v," = 0")
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i," vs ",ref)
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]  
  }
  return(results)
}

Not only does this construct the test of the observed categories, but also tests whether each set of coefficients is simultaneously zero, which is the appropriate contrast for the referent category.

> all_tests(mod)
  ord contrast                                                            test Df        Chisq       pvalue
1   1   6 vs 8 6:(Intercept) = 8:(Intercept) and 6:am1 = 8:am1 and 6:hp = 8:hp  3    17.533511 0.0005488491
2   2   6 vs 4                    6:(Intercept) = 0 and 6:am1 = 0 and 6:hp = 0  3     5.941417 0.1144954481
3   3   8 vs 4                    8:(Intercept) = 0 and 8:am1 = 0 and 8:hp = 0  3 44080.662112 0.0000000000

User beware of multiple testing with this, as I am not sure as to the appropriate post-hoc correction here when examining so many hypotheses. This example with just three is obviously not a big deal, but with more categories you get n choose 2, or (n*(n-1))/2 total contrasts.

Testing the equality of multiple indirect effects

Another example I was asked about recently was testing whether you could use the same procedure to calculate indirect effects (popular in moderation and mediation analysis). Those end up being a bit more tricky, as to define the variance and covariance between those indirect effects we are not just dealing with adding and subtracting values of the original parameters, but are considering multiplications.

Thus to estimate the standard error and covariance parameters of indirect effects folks often use the delta method. In R using the lavaan library, here is an example (just taken from a code snippet Yves Rosseel posted himself), to estimate the variance-covariance matrix model defined indirect parameters.

#function taken from post in
#https://groups.google.com/forum/#!topic/lavaan/skgZRyzqtYM
library(lavaan)
vcov.def <- function(model){
  m <- model
  orig <- vcov(m)
  free <- m@Fit@x
  jac <- lavaan:::lavJacobianD(func = m@Model@def.function, x = free)
  vcov_def <- jac %*% orig %*% t(jac)
  estNames <- subset(parameterEstimates(m),op==":=")
  row.names(vcov_def) <- estNames$lhs
  colnames(vcov_def) <- estNames$lhs
  #I want to print the covariance table estimates to make sure the
  #labels are in the correct order
  estNames$se2 <- sqrt(diag(vcov_def))
  estNames$difSE <- estNames$se - estNames$se2
  print(estNames[,c('lhs','se','se2','difSE')])
  print('If difSE is not zero, labels are not in right order')
  return(vcov_def)
}

Now here is an example of testing individual parameter estimates for indirect effects.

set.seed(10)
n <- 100
X1 <- rnorm(n)
X2 <- rnorm(n)
X3 <- rnorm(n)
M <- 0.5*X1 + 0.4*X2 + 0.3*X3 + rnorm(n)
Y <- 0.1*X1 + 0.2*X2 + 0.3*X3 + 0.7*M + rnorm(n)
Data <- data.frame(X1 = X1, X2 = X2, X3 = X3, Y = Y, M = M)
model <- ' # direct effect
             Y ~ X1 + X2 + X3 + d*M
           # mediator
             M ~ a*X1 + b*X2 + c*X3
           # indirect effects
             ad := a*d
             bd := b*d
             cd := c*d
         '
model_SP.fit <- sem(model, data = Data)
summary(model_SP.fit)

#now apply to your own sem model
defCov <- vcov.def(model_SP.fit)

Unfortunately as far as I know, the linearHypothesis function does not work for lavaan objects, so if we want to test whether the indirect effect of whether ad = bd we need to construct it by hand. But with the vcov.def function we have those covariance terms we needed.

#testing hypothesis that "ad = bd"
#so doing "ad - bd = 0"
model_SP.param <- parameterEstimates(model_SP.fit)
model_SP.defined <- subset(model_SP.param, op==":=")
dif <- model_SP.defined$est[1] - model_SP.defined$est[2]
var_dif <- defCov[1,1] + defCov[2,2] - 2*defCov[1,2]
#so the test standard error of the difference is 
se_dif <- sqrt(var_dif)
#and the test statistic is
tstat <- dif/se_dif 
#two tailed p-value
dif;se_dif;2*(1 - pnorm(abs(tstat)))

To test whether all three indirect parameters are equal to each other at once, one way is to estimate a restricted model, and then use a likelihood ratio test of the restricted vs the full model. It is pretty easy in lavaan to create coefficient restrictions, just set what was varying to only be one parameter:

restrict_model <- ' # direct effect
                      Y ~ X1 + X2 + X3 + d*M
                    # mediator
                      M ~ a*X1 + a*X2 + a*X3
                    # indirect effects
                      ad := a*d
                  '

model_SP.restrict <- sem(restrict_model, data = Data)
lavTestLRT(model_SP.fit, model_SP.restrict)

If folks know of an easier way to do the Wald tests via lavaan models let me know, I would be interested!

 

Pooling multiple outcomes into one regression equation

Something that came up for many of my students this last semester in my Seminar in Research class is that many were interested in multiple outcomes. The most common one is examining different types of delinquency for juveniles (often via surveys), but it comes up in quite a few other designs as well (e.g. different crime outcomes for spatial research, different measures of perceptions towards police, different measures of fear of crime, etc.).

Most of the time students default to estimating separate equations for each of these outcomes, but in most circumstances I was telling the students they should pool these outcomes into one model. I think that is the better default for the majority of situations. So say we have a situation with two outcomes, violent crimes and property crimes, and we have one independent variable we are interested in, say whether an individual was subjected to a particular treatment. We might then estimate two separate equations:

E[# Violent Crimes]  = B0v + B1v*(Treatment) 
    
E[# Property Crimes] = B0p + B1p*(Treatment)

By saying that I think by default we should think about pooling is basically saying that B1v is going to be close to equal to B1p in the two equations. Pooling the models together both lets us test that assertion, as well as get a better estimate of the overall treatment effect. So to pool the models we would stack the outcomes together, and then estimate something like:

E[# Crimes (by type)] = B0 + B1*(Treatment) + B2*(Outcome = Violent) + B3(Treatment*Outcome = Violent)

Here the B3 coefficient tests whether the treatment effect is different for the violent crime outcome as opposed to the property crime, and the dummy variable B2 effect controls for any differences in the levels of the two overall (that is, you would expect violent incidents to be less common than property crime incidents).

Because you will have multiple measures per individual, you can correct for that (by clustering the standard errors). But in the case you have many outcomes you might also want to consider a multi-level model, and actually estimate random effects for individuals and outcomes. So say instead of just violent and property crimes, but had a survey listing for 20 different types of delinquency. In that case you might want to do a model that looks like:

Prob(Delinquency_ij) = f[B0 + B1*(Treatment_j) + d_j + g_i]

Where one is estimating a multi-level logistic regression equation for delinquency type i within individual j, and the g_i and d_j are the random effects for delinquency types and individuals respectively. In the case you do not have many outcomes (say only 10), the random effect distribution might be hard to estimate. In that case I would just use fixed effects for the outcome dummy variables. But I can imagine the random effects for persons are of interest in many different study designs. And this way you get one model — instead of having to interpret 20+ models.

Also you can still estimate differential treatment effects across the different items if you want to, such as by looking at the interaction of the outcome types and the treatment. But in most cases in criminology I have come across treatments are general. That is, we would expect them to decrease/increase all crime types, not just some specific violent or property crime types. So to default pooling the treatment effect estimate makes sense.


To go a bit farther — juvenile delinquency is not my bag, but offhand I don’t understand why those who examine surveys of delinquency items use that multi-level model more often. Often times people aggregate the measures altogether into one overall scale, such as saying someone checked yes to 2 out of 10 violent crime outcomes, and checked yes to 5 out of 10 property crime outcomes. Analyzing those aggregated outcomes is another type of pooling, but one I don’t think is appropriate, mainly because it ignores the overall prevalence for the different items. For example, you might have an item such as "steal a car", and another that is "steal a candy bar". The latter is much more serious and subsequently less likely to occur. Going with my prior examples, pooling items together like this would force the random effects for the individual delinquency types, g_i, to all equal zero. Just looking at the data one can obviously tell that is not a good assumption.

Here I will provide an example via simulation to demonstrate this in Stata. First I generate an example dataset that has 1,000 individuals and 20 yes/no outcomes. They way the data are simulated is that each individual has a specific amount of self_control that decreases the probability of an outcome (with a coefficient of -0.5), they are nested within a particular group (imagine a different school) that affect whether the outcome occurs or not. In addition to this, each individual has a random intercept (drawn from a normal distribution), and each question has a fixed different prevalence.

*Stata simulation
clear
set more off
set seed 10
set obs 1000
generate caseid = _n
generate group = ceil(caseid/100) 
generate self_control = rnormal(0,1)
generate rand_int = rnormal(0,1)

*generating 20 outcomes that just have a varying intercept for each
forval i = 1/20 { 
  generate logit_`i' = -0.4 -0.5*self_control -0.1*group + 0.1*(`i'-10) + rand_int
  generate prob_`i' = 1/(1 + exp(-1*logit_`i'))
  generate outcome_`i' = rbinomial(1,prob_`i')
}
drop logit_* prob_* rand_int
summarize prob_*

And here is that final output:

. summarize prob_*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      prob_1 |      1,000    .1744795    .1516094   .0031003   .9194385
      prob_2 |      1,000    .1868849     .157952   .0034252   .9265418
      prob_3 |      1,000     .199886    .1642414   .0037841   .9330643
      prob_4 |      1,000      .21348    .1704442   .0041804   .9390459
      prob_5 |      1,000    .2276601    .1765258    .004618   .9445246
-------------+---------------------------------------------------------
      prob_6 |      1,000     .242416    .1824513   .0051012   .9495374
      prob_7 |      1,000    .2577337    .1881855   .0056347   .9541193
      prob_8 |      1,000    .2735951    .1936933   .0062236   .9583033
      prob_9 |      1,000     .289978    .1989401   .0068736    .962121
     prob_10 |      1,000    .3068564    .2038919    .007591   .9656016
-------------+---------------------------------------------------------
     prob_11 |      1,000    .3242004    .2085164   .0083827   .9687729
     prob_12 |      1,000    .3419763    .2127823   .0092562   .9716603
     prob_13 |      1,000    .3601469    .2166605   .0102197   .9742879
     prob_14 |      1,000    .3786715    .2201237   .0112824   .9766776
     prob_15 |      1,000    .3975066    .2231474   .0124542   .9788501
-------------+---------------------------------------------------------
     prob_16 |      1,000    .4166057    .2257093    .013746   .9808242
     prob_17 |      1,000    .4359203    .2277906   .0151697   .9826173
     prob_18 |      1,000       .4554    .2293751   .0167384   .9842454
     prob_19 |      1,000     .474993    .2304504   .0184663   .9857233
     prob_20 |      1,000    .4946465    .2310073   .0203689   .9870643

You can see from this list that each prob* variable then has a different overall prevalence, from around 17% for prob_1, climbing to around 50% for prob_20.

Now if you wanted to pool the items into one overall delinquency scale, you might estimate a binomial regression model (note this is not a negative binomial model!) like below (see Britt et al., 2017 for discussion).

*first I will show the binomial model in Britt
egen delin_total = rowtotal(outcome_*)
*Model 1
glm delin_total self_control i.group, family(binomial 20) link(logit)

Which shows for the results (note that the effect of self-control is too small, it should be around -0.5):

. glm delin_total self_control i.group, family(binomial 20) link(logit)

Iteration 0:   log likelihood =  -3536.491  
Iteration 1:   log likelihood = -3502.3107  
Iteration 2:   log likelihood = -3502.2502  
Iteration 3:   log likelihood = -3502.2502  

Generalized linear models                         No. of obs      =      1,000
Optimization     : ML                             Residual df     =        989
                                                  Scale parameter =          1
Deviance         =  4072.410767                   (1/df) Deviance =   4.117706
Pearson          =  3825.491931                   (1/df) Pearson  =    3.86804

Variance function: V(u) = u*(1-u/20)              [Binomial]
Link function    : g(u) = ln(u/(20-u))            [Logit]

                                                  AIC             =     7.0265
Log likelihood   = -3502.250161                   BIC             =  -2759.359

------------------------------------------------------------------------------
             |                 OIM
 delin_total |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
             |
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
             |
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------

One of the things I wish the Britt paper mentioned was that the above binomial model is equivalent to the a logistic regression model on the individual outcomes — but one that forces the predictions for each item to be the same across a person. So if you reshape the data from wide to long you can estimate that same binomial model as a logistic regression on the 0/1 outcomes.

*reshape wide to long
reshape long outcome_, i(caseid) j(question)
*see each person now has 20 questions each
*tab caseid

*regression model with the individual level data, should be equivalent to the aggregate binomial model
*Model 2
glm outcome_ self_control i.group, family(binomial) link(logit)

And here are the results:

. glm outcome_ self_control i.group, family(binomial) link(logit)

Iteration 0:   log likelihood = -12204.638  
Iteration 1:   log likelihood = -12188.762  
Iteration 2:   log likelihood = -12188.755  
Iteration 3:   log likelihood = -12188.755  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,989
                                                  Scale parameter =          1
Deviance         =  24377.50934                   (1/df) Deviance =   1.219546
Pearson          =  19949.19243                   (1/df) Pearson  =   .9980085

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.219975
Log likelihood   = -12188.75467                   BIC             =  -173583.3

------------------------------------------------------------------------------
             |                 OIM
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
             |
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
             |
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------

So you can see that Model 1 and Model 2 are exactly the same (in terms of estimates for the regression coefficients).

Model 2 though should show the limitations of using the binomial model — it predicts the same probability for each delinquency item, even though prob_1 is less likely to occur than prob_20. So for example, if we generate the predictions of this model, we can see that each question has the same predicted value.

predict prob_mod2, mu
sort question
by question: summarize outcome_ prob_mod2

And here are the results for the first four questions:

.     by question: summarize outcome_ prob_mod2

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

By construction, the binomial model on the aggregated totals is a bad fit to the data. It predicts that each question should have a probability of around 32% of occurring. Although you can’t fit the zero-inflated model discussed by Britt via the individual level logit approach (that I am aware of), that approach has the same limitation as the generic binomial model approach. Modeling the individual items just makes more sense when you have the individual items. It is hard to think of examples where such a restriction would be reasonable for delinquency items.

So here a simple update is to include a dummy variable for each item. Here I also cluster according to whether the item is nested within an individual caseid.

*Model 3
glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

And here are the results:

.     glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

Iteration 0:   log pseudolikelihood = -11748.056  
Iteration 1:   log pseudolikelihood = -11740.418  
Iteration 2:   log pseudolikelihood = -11740.417  
Iteration 3:   log pseudolikelihood = -11740.417  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,970
                                                  Scale parameter =          1
Deviance         =  23480.83406                   (1/df) Deviance =   1.175805
Pearson          =  19949.15609                   (1/df) Pearson  =   .9989562

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.177042
Log pseudolikelihood = -11740.41703               BIC             =  -174291.8

                             (Std. Err. adjusted for 1,000 clusters in caseid)
------------------------------------------------------------------------------
             |               Robust
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3858319   .0334536   -11.53   0.000    -.4513996   -.3202641
             |
       group |
          2  |  -.0620222   .1350231    -0.46   0.646    -.3266626    .2026182
          3  |   -.049852   .1340801    -0.37   0.710    -.3126442    .2129403
          4  |   .0549271   .1383412     0.40   0.691    -.2162167    .3260709
          5  |  -.1329942   .1374758    -0.97   0.333    -.4024419    .1364535
          6  |  -.4103578   .1401212    -2.93   0.003    -.6849904   -.1357253
          7  |  -.3145033   .1452201    -2.17   0.030    -.5991296   -.0298771
          8  |  -.2803599   .1367913    -2.05   0.040    -.5484659    -.012254
          9  |  -.4543686   .1431314    -3.17   0.002    -.7349011   -.1738362
         10  |  -.5962359   .1457941    -4.09   0.000    -.8819872   -.3104847
             |
    question |
          2  |   .1453902   .1074383     1.35   0.176    -.0651851    .3559654
          3  |   .1643203   .1094113     1.50   0.133     -.050122    .3787625
          4  |   .1262597   .1077915     1.17   0.241    -.0850078    .3375272
          5  |   .1830563    .105033     1.74   0.081    -.0228047    .3889173
          6  |   .3609468   .1051123     3.43   0.001     .1549304    .5669633
          7  |    .524749    .100128     5.24   0.000     .3285017    .7209963
          8  |   .5768412   .1000354     5.77   0.000     .3807754     .772907
          9  |   .7318797   .1021592     7.16   0.000     .5316513    .9321081
         10  |    .571682   .1028169     5.56   0.000     .3701646    .7731994
         11  |    .874362   .0998021     8.76   0.000     .6787535     1.06997
         12  |   .8928982   .0998285     8.94   0.000     .6972379    1.088559
         13  |   .8882734   .1023888     8.68   0.000      .687595    1.088952
         14  |   .9887095   .0989047    10.00   0.000     .7948599    1.182559
         15  |   1.165517   .0977542    11.92   0.000     .9739222    1.357111
         16  |   1.230355   .0981687    12.53   0.000     1.037948    1.422762
         17  |   1.260403   .0977022    12.90   0.000      1.06891    1.451896
         18  |   1.286065    .098823    13.01   0.000     1.092376    1.479755
         19  |   1.388013   .0987902    14.05   0.000     1.194388    1.581638
         20  |   1.623689   .0999775    16.24   0.000     1.427737    1.819642
             |
       _cons |  -1.336376   .1231097   -10.86   0.000    -1.577666   -1.095085
------------------------------------------------------------------------------

You can now see that the predicted values for each individual item are much more reasonable. In fact they are a near perfect fit.

predict prob_mod3, mu
by question: summarize outcome_ prob_mod2 prob_mod3

And the results:

.     by question: summarize outcome_ prob_mod2 prob_mod3

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .183    .0672242   .0475809   .4785903

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .205    .0729937   .0546202   .5149203

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .208    .0737455    .055606   .5196471

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .202    .0722336   .0536408   .5101407

If you want, you can also test whether any "treatment" effect (or here the level of a persons self control), has differential effects across the different delinquency items.

*Model 4
glm outcome_ self_control i.group i.question (c.self_control#i.question), family(binomial) link(logit) cluster(caseid)
*can do a test of all the interactions equal to zero at once
testparm c.self_control#i.question

I’ve omitted this output, but here of course the effect of self control is simulated to be the same across the different items, so one would fail to reject the null that any of the interaction terms are non-zero.

Given the way I simulated the data, the actual correct model is a random effects one. You should notice in each of the prior models the effect of self control is too small. One way to estimate that model in Stata is to below:

*Model 5
melogit outcome_ self_control i.group i.question || caseid:

And here are the results:

. melogit outcome_ self_control i.group i.question || caseid:

Fitting fixed-effects model:

Iteration 0:   log likelihood = -11748.056  
Iteration 1:   log likelihood = -11740.418  
Iteration 2:   log likelihood = -11740.417  
Iteration 3:   log likelihood = -11740.417  

Refining starting values:

Grid node 0:   log likelihood =  -10870.54

Fitting full model:

Iteration 0:   log likelihood =  -10870.54  
Iteration 1:   log likelihood = -10846.176  
Iteration 2:   log likelihood = -10845.969  
Iteration 3:   log likelihood = -10845.969  

Mixed-effects logistic regression               Number of obs     =     20,000
Group variable:          caseid                 Number of groups  =      1,000

                                                Obs per group:
                                                              min =         20
                                                              avg =       20.0
                                                              max =         20

Integration method: mvaghermite                 Integration pts.  =          7

                                                Wald chi2(29)     =    1155.07
Log likelihood = -10845.969                     Prob > chi2       =     0.0000
------------------------------------------------------------------------------
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.4744173   .0372779   -12.73   0.000    -.5474807    -.401354
             |
       group |
          2  |  -.0648018   .1642767    -0.39   0.693    -.3867783    .2571746
          3  |  -.0740465   .1647471    -0.45   0.653    -.3969449    .2488519
          4  |    .036207   .1646275     0.22   0.826     -.286457     .358871
          5  |  -.1305605   .1645812    -0.79   0.428    -.4531337    .1920126
          6  |  -.5072909   .1671902    -3.03   0.002    -.8349776   -.1796042
          7  |  -.3732567    .165486    -2.26   0.024    -.6976032   -.0489102
          8  |  -.3495889   .1657804    -2.11   0.035    -.6745126   -.0246653
          9  |  -.5593725   .1675276    -3.34   0.001    -.8877205   -.2310245
         10  |  -.7329717   .1673639    -4.38   0.000    -1.060999   -.4049445
             |
    question |
          2  |   .1690546   .1240697     1.36   0.173    -.0741177    .4122268
          3  |    .191157   .1237894     1.54   0.123    -.0514657    .4337797
          4  |   .1467393   .1243586     1.18   0.238    -.0969991    .3904776
          5  |   .2130531   .1235171     1.72   0.085     -.029036    .4551422
          6  |   .4219282   .1211838     3.48   0.000     .1844123    .6594441
          7  |   .6157484   .1194133     5.16   0.000     .3817027    .8497941
          8  |   .6776651   .1189213     5.70   0.000     .4445837    .9107465
          9  |   .8626735   .1176486     7.33   0.000     .6320865    1.093261
         10  |   .6715272   .1189685     5.64   0.000     .4383532    .9047012
         11  |   1.033571   .1167196     8.86   0.000     .8048051    1.262338
         12  |    1.05586    .116615     9.05   0.000     .8272985    1.284421
         13  |   1.050297   .1166407     9.00   0.000     .8216858    1.278909
         14  |   1.171248   .1161319    10.09   0.000     .9436331    1.398862
         15  |   1.384883   .1154872    11.99   0.000     1.158532    1.611234
         16  |   1.463414   .1153286    12.69   0.000     1.237375    1.689454
         17  |   1.499836   .1152689    13.01   0.000     1.273913    1.725759
         18  |   1.530954   .1152248    13.29   0.000     1.305117     1.75679
         19  |   1.654674   .1151121    14.37   0.000     1.429058    1.880289
         20  |   1.941035   .1152276    16.85   0.000     1.715193    2.166877
             |
       _cons |  -1.591796   .1459216   -10.91   0.000    -1.877797   -1.305795
-------------+----------------------------------------------------------------
caseid       |
   var(_cons)|   1.052621   .0676116                      .9281064     1.19384
------------------------------------------------------------------------------
LR test vs. logistic model: chibar2(01) = 1788.90     Prob >= chibar2 = 0.0000

In that model it is the closest to estimating the correct effect of self control (-0.5). It is still small (at -0.47), but the estimate is within one standard error of the true value. (Another way to estimate this model is to use xtlogit, but with melogit you can actually extract the random effects. That will have to wait until another blog post though.)

Another way to think about this model is related to item-response theory, where individuals can have a latent estimate of how smart they are, and questions can have a latent easiness/hardness. In Stata you might fit that by the code below, but a warning it takes awhile to converge. (Not sure why, the fixed effects for questions are symmetric, so assuming a random effect distribution should not be too far off. If you have any thoughts as to why let me know!)

*Model 6
melogit outcome_ self_control i.group || caseid: || question:

For an academic reference to this approach see Osgood et al.,(2002). Long story short, model the individual items, but pool them together in one model!

Don’t include temporal lags of crime in cross-sectional crime models

In my 311 and crime paper a reviewer requested I conduct cross-lagged models. That is, predict crime in 2011 while controlling for prior counts of crime in 2010, in addition to the other specific variables of interest (here 311 calls for service). In the supplementary material I detail why this is difficult with Poisson models, as the endogenous effect will often be explosive in Poisson models, something that does not happen as often in linear models.

There is a second problem though with cross-lagged models I don’t discuss though, and it has to do with how what I think a reasonable data generating process for crime at places can cause cross-lagged models to be biased. This is based on the fact that crime at places tends to be very temporally stable (see David Weisburd’s, or Martin Andresen’s, or my work showing that). So when you incorporate temporal lags of crime in models, this makes the other variables of interest (311 calls, alcohol outlets, other demographics, whatever) biased, because they cause crime in the prior time period. This is equivalent to controlling for an intermediate outcome. For examples of this see some of the prior work on the relationship between crime and disorder by Boggess and Maskaly (2014) or O’Brien and Sampson (2015).1

So Boggess and Maskaley (BM) and O’Brien and Sampson (OS) their simplified cross-lagged model is:

(1) Crime_post = B0*Crime_pre + B1*physicaldisorder_pre

Where the post and pre periods are yearly counts of crime and indicators of physical disorder. My paper subsequently does not include the prior counts of crime, but does lag the physical disorder measures by a year to ensure they are exogenous.

(2) Crime_post = B1*physicaldisorder_pre

There are a few reasons to do these lags. The most obvious is to make explanatory variable of broken windows exogenous, by making sure it is in the past. The reasons for including lags of crime counts are most often strictly as a control variable. There are some examples where crime begets more crime directly, such as retaliatory violence, (or see Rosenfeld, 2009) but most folks who do the cross-lagged models do not make this argument.

Now, my whole argument rests on what I think is an appropriate model explaining counts of crime at places. Continuing with the physical disorder example, I think a reasonable cross-sectional model of crime at places is that there are some underlying characteristics of locations that tend to be pretty stable over fairly long periods of time, and then we have more minor stuff like physical disorder that provide small exogenous shocks to the system over time.

(3) Crime_i = B0*(physicaldisorder_i) + Z_i

Where crime at location i is a function of some fixed characteristic Z. I can’t prove this model is correct, but I believe it is better supported by data. To support this position, I would refer to the incredibly high correlations between counts of crime at places from year to year. This is true of every crime dataset I have worked with (at every spatial unit of analysis), and is a main point of Shaw and McKay’s work plus Rob Sampsons for neighborhoods in Chicago, as well as David Weisburd’s work on trajectories of crime at street segments in Seattle. Again, this very high correlation doesn’t strike me as reasonably explained by crime causes more crime, what is more likely is that there are a set of fixed characteristics that impact criminal behavior at a certain locations.

If a model of crime is like that in (3), there are then two problems with the prior equations. The first problem for both (1) and (2) is that lagging physical disorder measures by a year does not make any sense. The idea behind physical disorder (a.k.a. broken windows) is that visible signs of disorder prime people to behave in a particular way. The priming presumably needs to be recent to affect behavior. But this can simply be solved by not lagging physical disorder by a year in the model. The lagged physical disorder effect might approximate the contemporaneous effect, if physical disorder itself is temporally consistent over long periods. So if say we replace physical disorder with locations of bars, the lagged effect of bars likely does not make any difference, between bars don’t turn over that much (and when they do they are oft just replaced by another bar).

But what if you still include the lags of crime counts? One may think that this controls for the omitted Z_i effect, but the effect is very bad for the other exogenous variables, especially lagged ones or temporally consistent ones. You are probably better off with the omitted random effect, because crime in the prior year is an intermediate outcome. I suspect this bias can be very large, and likely biases the effects of the other variables towards zero by quite alot. This is because effect of the fixed characteristic is large, the effect of the exogenous characteristic is smaller, and the two are likely correlated at least to a small amount.

To show this I conduct a simulation. SPSS Code here to replicate it. The true model I simulated is:

(4)  BW_it = 0.2*Z_i + ew_it
(5)  Crime_it = 5 + 0.1*BW_it + 0.9*Z_i + ec_it`

I generated this for 25,000 locations and two time points (the t subscript), and all the variables are set to have a variance of 1 (all variables are normally distributed). The error terms (ew_it and ec_it) are not correlated, and are set to whatever value is necessary so the resultant variable on the left hand side has a variance of 1. With so many observations one simulation run is pretty representative of what would happen even if I replicated the simulation multiple times. This specification makes both BW (to stand for broken windows) and Z_i correlated.

In my run, what happens when we fit the cross-lagged model? The effect estimates are subsequently:

Lag BW:   -0.07
Lag Crime: 0.90

Yikes – effect of BW is in the opposite direction and nearly as large as the true effect. What about if you just include the lag of BW?

Lag BW: 0.22

The reason this is closer to the true effect is because of some round-about-luck. Since BW_it is correlated with the fixed effect Z_i, the lag of BW has a slight correlation to the future BW. This potentially changes how we view the effects of disorder on crime though. If BW is more variable, we can make a stronger argument that it is exogenous of other omitted variables. If it is temporally consistent it is harder to make that argument (it should also reduce the correlation with Z_i).

Still, the only reason this lag has a positive effect is that Z_i is omitted. For us to make the argument that this approximates the true effect, we have to make the argument the model has a very important omitted variable. Something one could only do as an act of cognitive dissonance.

How about use the contemporaneous effect of BW, but still include the lag counts of crime?

BW:        0.13
Lag Crime: 0.86

That is not as bad, because the lag of crime is now not an intermediate outcome. Again though, if we switch BW with something more consistent in time, like locations of bars, the lag will be an intermediate outcome, and will subsequently bias the effect. So what about a model of the contemporaneous effect of BW, omitting Z_i? The contemporaneous effect of BW will still be biased, since Z_i is omitted from the model.

BW: 0.32

But a way to reduce this bias is to introduce other control variables that approximate the omitted Z_i. Here I generate a set of 10 covariates that are a function of Z_i, but are otherwise not correlated with BW nor each other.

(6) Oth_it = 0.5*Z_i + eoth_it

Including these covariates in the model progressively reduces the bias. Here is a table for the reduction in the BW effect for the more of the covariates you add in, e.g. with 2 means it includes two of the control variables in the model.

BW (with 0):  0.32
BW (with 1):  0.25
BW (with 2):  0.21
BW (with 3):  0.19
BW (with 10): 0.14

So if you include other cross-sectional covariates in an attempt to control for Z_i it brings the effect of BW closer to its true effect. This is what I believe happens in the majority of social science research that use strictly cross-sectional models, and is a partial defense of what people sometimes refer to kitchen sink models.

So in brief, I think using lags of explanatory variables and lags of crime in the same model are very bad, and can bias the effect estimates quite alot.

So using lags of explanatory variables and lags of crime counts in cross-sectional models I believe are a bad idea for most research designs. It is true that it makes it their effects exogenous, but it doesn’t eliminate the more contemporaneous effect of the variable, and so we may be underestimating the effect to a very large extent. Whether of not the temporal lag effects crime has to do with how the explanatory variable itself arises, and so the effect estimated by the temporal lag is likely to be misleading (and may be biased upward or downward depending on other parts of the model).

Incorporating prior crime counts is likely to introduce more bias than it solves I think for most cross-lagged models. I believe simply using a cross-sectional model with a reasonable set of control variables will get you closer to the real effect estimates than the cross-lagged models. If you think Z_i is correlated with a variable of interest (or lags of crime really do cause future crime) I think you need to do the extra step and have multiple time measures and fit a real panel data model, not just a cross lagged one.

I’m still not sure though when you are better off fitting a panel model versus expanding the time for the cross-section though. For one example, I think you are better off estimating the effects of demographic variables in a cross-sectional model, as opposed to a panel one, over a short period of time, (say less than 10 years). This is because demographic shifts simply don’t occur very fast, so there is little variance within units for a short panel.


  1. I actually came up with the idea of using 311 calls independently of Dan O’Brien’s work, see my prospectus in 2013 in which I proposed the analysis. So I’m not totally crazy – although was alittle bummed to miss the timing abit! Four years between proposing and publishing the work is a bit depressing as well.

Testing the equality of two regression coefficients

The default hypothesis tests that software spits out when you run a regression model is the null that the coefficient equals zero. Frequently there are other more interesting tests though, and this is one I’ve come across often — testing whether two coefficients are equal to one another. The big point to remember is that Var(A-B) = Var(A) + Var(B) - 2*Cov(A,B). This formula gets you pretty far in statistics (and is one of the few I have memorized).

Note that this is not the same as testing whether one coefficient is statistically significant and the other is not. See this Andrew Gelman and Hal Stern article that makes this point. (The link is to a pre-print PDF, but the article was published in the American Statistician.) I will outline four different examples I see people make this particular mistake.

One is when people have different models, and they compare coefficients across them. For an example, say you have a base model predicting crime at the city level as a function of poverty, and then in a second model you include other control covariates on the right hand side. Let’s say the the first effect estimate of poverty is 3 (1), where the value in parentheses is the standard error, and the second estimate is 2 (2). The first effect is statistically significant, but the second is not. Do you conclude that the effect sizes are different between models though? The evidence for that is much less clear.

To construct the estimate of how much the effect declined, the decline would be 3 - 2 = 1, a decrease in 1. What is the standard error around that decrease though? We can use the formula for the variance of the differences that I noted before to construct it. So the standard error squared is the variance around the parameter estimate, so we have sqrt(1^2 + 2^2) =~ 2.23 is the standard error of the difference — which assumes the covariance between the estimates is zero. So the standard error around our estimated decline is quite large, and we can’t be sure that it is an appreciably different estimate of poverty between the two models.

There are more complicated ways to measure moderation, but this ad-hoc approach can be easily applied as you read other peoples work. The assumption of zero covariance for parameter estimates is not a big of deal as it may seem. In large samples these tend to be very small, and they are frequently negative. So even though we know that assumption is wrong, just pretending it is zero is not a terrible folly.

The second is where you have models predicting different outcomes. So going with our same example, say you have a model predicting property crime and a model predicting violent crime. Again, I will often see people make an equivalent mistake to the moderator scenario, and say that the effect of poverty is larger for property than violent because one is statistically significant and the other is not.

In this case if you have the original data, you actually can estimate the covariance between those two coefficients. The simplest way is to estimate that covariance via seemingly unrelated regression. If you don’t though, such as when you are reading someone else’s paper, you can just assume the covariance is zero. Because the parameter estimates often have negative correlations, this assumption will make the standard error estimate smaller.

The third is where you have different subgroups in the data, and you examine the differences in coefficients. Say you had recidivism data for males and females, and you estimated an equation of the effect of a treatment on males and another model for females. So we have two models:

Model Males  : Prob(Recidivism) = B_0m + B_1m*Treatment
Model Females: Prob(Recidivism) = B_0f + B_1f*Treatment

Where the B_0? terms are the intercept, and the B_1? terms are the treatment effects. Here is another example where you can stack the data and estimate an interaction term to estimate the difference in the effects and its standard error. So we can estimate a combined model for both males and females as:

Combined Model: Prob(Recidivism) = B_0c + B_1c*Treatment + B_2c*Female + B_3c(Female*Treatment)

Where Female is a dummy variable equal to 1 for female observations, and Female*Treatment is the interaction term for the treatment variable and the Female dummy variable. Note that you can rewrite the model for males and females as:

Model Mal.: Prob(Recidivism) =     B_0c      +      B_1c    *Treatment    ....(when Female=0)
Model Fem.: Prob(Recidivism) = (B_0c + B_2c) + (B_1c + B_3c)*Treatment    ....(when Female=1)

So we can interpret the interaction term, B_3c as the different effect on females relative to males. The standard error of this interaction takes into account the covariance term, unlike estimating two totally separate equations would. (You can stack the property and violent crime outcomes I mentioned earlier in a synonymous way to the subgroup example.)

The final fourth example is the simplest; two regression coefficients in the same equation. One example is from my dissertation, the correlates of crime at small spatial units of analysis. I test whether different places that sell alcohol — such as liquor stores, bars, and gas stations — have the same effect on crime. For simplicity I will just test two effects, whether liquor stores have the same effect as on-premise alcohol outlets (this includes bars and restaurants). So lets say I estimate a Poisson regression equation as:

log(E[Crime]) = Intercept + b1*Bars + b2*LiquorStores

And then my software spits out:

                  B     SE      
Liquor Stores    0.36  0.10
Bars             0.24  0.05

And then lets say we also have the variance-covariance matrix of the parameter estimates – which most stat software will return for you if you ask it:

                L       B  
Liquor_Stores    0.01
Bars            -0.0002 0.0025

On the diagonal are the variances of the parameter estimates, which if you take the square root are equal to the reported standard errors in the first table. So the difference estimate is 0.36 - 0.24 = 0.12, and the standard error of that difference is sqrt(0.01 + 0.0025 - 2*-0.002) =~ 0.13. So the difference is not statistically significant. You can take the ratio of the difference and its standard error, here 0.12/0.13, and treat that as a test statistic from a normal distribution. So the rule that it needs to be plus or minus two to be stat. significant at the 0.05 level applies.

This is called a Wald test specifically. I will follow up with another blog post and some code examples on how to do these tests in SPSS and Stata. For completeness and just because, I also list two more ways to accomplish this test for the last example.


There are two alternative ways to do this test though. One is by doing a likelihood ratio test.

So we have the full model as:

 log(E[Crime]) = b0 + b1*Bars + b2*Liquor_Stores [Model 1]
 

And we have the reduced model as:

 log(E[Crime]) = b4 + b5*(Bars + Liquor_Stores)  [Model 2]
 

So we just estimate the full model with Bars and Liquor Stores on the right hand side (Model 1), then estimate the reduced model (2) with the sum of Bars + Liquor Stores on the right hand side. Then you can just do a chi-square test based on the change in the log-likelihood. In this case there is a change of one degree of freedom.

I give an example of doing this in R on crossvalidated. This test is nice because it extends to testing multiple coefficients, so if I wanted to test bars=liquor stores=convenience stores. The prior individual Wald tests are not as convenient for testing more than two coefficients equality at once.


Here is another way though to have the computer more easily spit out the Wald test for the difference between two coefficients in the same equation. So if we have the model (lack of intercept does not matter for discussion here):

y = b1*X + b2*Z [eq. 1]

We can test the null that b1 = b2 by rewriting our linear model as:

y = B1*(X + Z) + B2*(X - Z) [eq. 2]

And the test for the B2 coefficient is our test of interest The logic goes like this — we can expand [eq. 2] to be:

y = B1*X + B1*Z + B2*X - B2*Z [eq. 3]

which you can then regroup as:

y = X*(B1 + B2) + Z*(B1 - B2) [eq. 4]

and note the equalities between equations 4 and 1.

B1 + B2 = b1; B1 - B2 = b2

So B2 tests for the difference between the combined B1 coefficient. B2 is a little tricky to interpret in terms of effect size for how much larger b1 is than b2 – it is only half of the effect. An easier way to estimate that effect size though is to insert (X-Z)/2 into the right hand side, and the confidence interval for that will be the effect estimate for how much larger the effect of X is than Z.

Note that this gives an equivalent estimate as to conducting the Wald test by hand as I mentioned before.

Translating between the dispersion term in a negative binomial regression and random variables in SPSS

NOTE!! – when I initially posted this I was incorrect, I thought SPSS listed the dispersion term in the form of Var(x) = mean + mean*dispersion. But I was wrong, and it is Var(x) = 1 + mean*dispersion (the same as Stata’s, what Cameron and Trivedi call the NB2 model, as cited in the Long and Freese Stata book for categorical variables.) The simulation in the original post worked out because my example I used the mean as 1, here I update it to have a mean of 2 to show the calculations are correct. (Also note that this parametrization is equivalent to Var(x) = mean*(1 + mean*dispersion), see Stata’s help for nbreg.)

When estimating a negative binomial regression equation in SPSS, it returns the dispersion parameter in the form of:

Var(x) = 1 + mean*dispersion

When generating random variables from the negative binomial distribution, SPSS does not take the parameters like this, but the more usual N trials with P successes. Stealing a bit from the R documentation for dnbinom, I was able to translate between the two with just a tedious set of algebra. So with our original distribution being:

Mean = mu
Variance = 1 + mu*a

R has an alternative representation closer to SPSS’s based on:

Mean = mu
Variance = mu + mu^2/x

Some tedious algebra will reveal that in this notation x = mu^2/(1 - mu + a*mu) (note to future self, using Solve in Wolfram Alpha could have saved some time, paper and ink). Also, R’s help for dbinom states that in the original N and P notation that p = x/(x + mu). So here with mu and a (again a is the dispersion term as reported by GENLIN in SPSS) we can solve for p.

x = mu^2/(1 - mu + a*mu)
p = x/(x + mu)

And since p is solved, R lists the mean of the distribution in the N and P notation as:

n*(1-p)/p = mu

So with p solved we can figure out N as equal to:

mu*p/(1-p) = n

So to reiterate, if you have a mean of 2 and dispersion parameter of 4, the resultant N and P notation would be:

mu = 2
a = 4
x = mu^2/(1 - mu + a*mu) = 2^2/(1 - 2 + 4*2) = 4/7
p = x/(x + mu) = (4/7)/(4/7 + 2) = 2/9
n = mu*p/(1-p) = 2*(2/9)/(7/9) = 4/7

Here we can see that in the N and P notation the similar negative binomial model results in a fractional number of successes, which might be a surprising result for some that it is even a possibility. (There is likely an easier way to do this translation, but forgive me I am not a mathematician!)

Now we would be finished, but unfortunately SPSS’s negative binomial random functions only take integer values and do not take values of N less than 1 (R’s dnbinom does). So we have to do another translation of the N and P notation to the gamma distribution to be able to draw random numbers in SPSS. Another representation of the negative binomial model is a mixture of Poisson distributions, with the distribution of the mixtures being from a gamma distribution. Wikipedia lists a translation from the N and P notation to a gamma with shape = N and scale = P/(1-P).

So I wrapped these computations up in an SPSS macros that takes the mean and the dispersion parameter, calculates N and P under the hood, and then draws a random variable from the associated negative binomial distribution.

DEFINE !NegBinRV (mu = !TOKENS(1)
       /disp = !TOKENS(1) 
       /out = !TOKENS(1) )
COMPUTE #x = !mu**2/(1 - !mu + !disp*!mu).
COMPUTE #p = #x / (#x + !mu).
COMPUTE #n = !mu*#p/(1 - #p).
COMPUTE #G = RV.GAMMA(#n,#p/(1 - #p)).
COMPUTE !Out = RV.POISSON(#G).
FORMATS !Out (F5.0).
!ENDDEFINE.

I am not sure if it is possible to use this gamma representation and native SPSS functions to calculate the corresponding CDF and PDF of the negative binomial distribution. But we can use R to do that. Here is an example of keeping the mean at 1 and varying the dispersion parameter between 0 and 5.

BEGIN PROGRAM R.
library(ggplot2)
x <- expand.grid(0:10,1:5)
names(x) <- c("Int","Disp")
mu <- 1
x$PDF <- mapply(dnbinom, x=x$Int, size=mu^2/(1 - mu + x$Disp*mu), mu=mu)
#add in poisson 
t <- data.frame(cbind(0:10,rep(0,11),dpois(0:10,lambda=1)))
names(t) <- c("Int","Disp","PDF")
x <- rbind(t,x)
p <- ggplot(data = x, aes(x = Int, y = PDF, group = as.factor(Disp))) + geom_line()
p
#for the CDF
x$CDF <- ave(x$PDF, x$Disp, FUN = cumsum) 
END PROGRAM.

Here you can see how the larger dispersion term can easily approximate the zero inflation typical in criminal justice data (see an applied example from my work). R will not take a dispersion parameter of zero in this notation (as the size would be divided by zero and not defined), so I just tacked on the Poisson distribution with a mean of zero.

Here is an example of generating random data from a negative binomial distribution with a mean of 2 and a dispersion parameter of 4. I then grab the PDF from R, and superimpose them both on a chart in SPSS (or perhaps I should call it a PMF, since it only has support on integer values). You can see the simulation with 10,000 observations is a near perfect fit (so a good sign I did not make any mistakes!)

*Simulation In SPSS.
INPUT PROGRAM.
LOOP Id = 1 TO 10000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME RandNB.

!NegBinRV mu = 2 disp = 4 out = NB.

*Making seperate R dataset of PDF.
BEGIN PROGRAM R.
mu <- 2
disp <- 4
x <- 0:11
pdf <- dnbinom(x=x,size=mu^2/(1 - mu + disp*mu),mu=mu)
#add in larger than 10
pdf[max(x)+1] <- 1 - sum(pdf[-(max(x)+1)])
MyDf <- data.frame(cbind(x,pdf))
END PROGRAM.
EXECUTE.
STATS GET R FILE=* /GET DATAFRAME=MyDf DATASET=PDF_NB.
DATASET ACTIVATE PDF_NB.
FORMATS x (F2.0).
VALUE LABELS x 11 '11 or More'.

*Now superimposing bar plot and PDF from separate datasets.
DATASET ACTIVATE RandNB.
RECODE NB (11 THRU HIGHEST = 11)(ELSE = COPY) INTO NB_Cat.
FORMATS NB_Cat (F2.0).
VALUE LABELS NB_Cat 11 '11 or More'.

GGRAPH
  /GRAPHDATASET NAME="Data" DATASET='RandNB' VARIABLES=NB_Cat[LEVEL=ORDINAL] COUNT()[name="COUNT"] 
  /GRAPHDATASET NAME="PDF" DATASET='PDF_NB' VARIABLES=x pdf
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: Data=userSource(id("Data"))
  DATA: NB_Cat=col(source(Data), name("NB_Cat"), unit.category())
  DATA: COUNT=col(source(Data), name("COUNT"))
  SOURCE: PDF=userSource(id("PDF"))
  DATA: x=col(source(PDF), name("x"), unit.category())
  DATA: den=col(source(PDF), name("pdf"))
  TRANS: den_per = eval(den*100)
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(summary.percent(NB_Cat*COUNT)), shape.interior(shape.square))
  ELEMENT: point(position(x*den_per), color.interior(color.black), size(size."8"))
END GPL.

Poisson regression and crazy predictions

Here is a problem I’ve encountered a few times in my own work (and others) with Poisson regression models and the exponential link function. It came up recently in some discussions on the scatterplot blog by Jeremy Freese (see 1 & 2) critiquing the PNAS paper on the effect of female named hurricanes on death tolls, so I figured I would expand up those thoughts a little here.

So the problem is when you estimate a Poisson regression model is that the exponential link function can become explosive for explanatory variables that have a large range. So to be clear, we have a Poisson regression model of the form (here E[Y] mean the expected value of Y):

log(E[Y]) = B1*(X)
    E[Y]  = e^(B1*X)

If X has a small range this may be fine, but if X has a large range it can become problematic. Consider if Y are hurricane deaths, X is monetary damage of the hurricane, and B1 = 0.01. Lets say the monetary damage ranges from 1 to 1000 (imagine these are in thousands of dollars, so range between $1,000 and $1 million). What happens with the predictions?

E[Deaths] = e^(0.01*   1) =     1.01
E[Deaths] = e^(0.01*   5) =     1.05
E[Deaths] = e^(0.01*  10) =     1.11
E[Deaths] = e^(0.01*  50) =     1.65
E[Deaths] = e^(0.01* 100) =     2.71
E[Deaths] = e^(0.01* 500) =   148
E[Deaths] = e^(0.01*1000) = 22026

These predictions are invariant to linear transformations – that is Z-scoring X in the original units doesn’t change the predictions (the same as expressing X in [dollars/1000] doesn’t make any difference than just by including [X] on the right hand side). The linear predictor of B1 will simply be scaled by the appropriate inverse transformation. Also I’d note that expressed in terms of incident rate ratios the effect would be e^0.01=1.01. This appears on its face a totally innocuous effect, and only in consideration of the variation in X does it appear to be absurd.

You can see that if the range of X were smaller, say between 1 and 100, the predictions might be fine. The predictions between 1 and 100 only vary by 1.7 deaths. The problem with these explosive predictions at larger values is that they are nonsense for most of social scientific research. A simple sanity check to see if this is occurring is to check the predicted value from your Poisson regression equation at the low end of X versus the high end (and just pretend all of the other explanatory variables are set to 0) exactly as I have done here. If the high end is crazy, you will need to consider some alternative model specification (or be very clear that the model can not be extrapolated to the larger values of X).

A useful alternate parametrization is simply to log X, and in this case when exponentiating the right hand side, it will make the predictor a power of the original metric.

So imagine we fit the model:

log(E[Y]) = B2*(log(X))
    E[Y]) = e^(B2*log(X)) 
          = x^B2

Lets say here that B2 = 0.5. What happens to our predictions again?

E[Deaths] =    1^0.5 =  1
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

Those predictions look a little bit easier to swallow at the larger ranges. Notice also the differences in predictions in the smaller stages? There is more discrimination for the smaller values than on the original scale, but the larger values are suppressed. Lets consider the predictions side by side for easier comparison.

   X      B1   B2
   -      --   --
   1       1    1
   5       1    2
  10       1    3
  50       2    7
 100       3   10
 500     148   22
1000   22026   32

A frequent problem with logging the explanatory variables is that they contain zeroes. A simple alternative is to treat log(0) as 0 and then have a separate dummy variable equal to 1 when X = 0. This model may not make Occam happy, as it implies a discontinuity at 0, but it is in my opinion a small price to pay. Also if there are a lot of zeroes this doesn’t strike me as totally unrealistic to have a mixture of what happens at 0 and then what happens at the higher values. So the full model written out would be:

log(E[Y]) = B3*D + B4*(log(X))

But the model is essentially discontinuous. When X=0, we treat log(X)=0 and D=1, so the model reduces to;

log(E[Y]) = B3*D :When X = 0

When X>0, D=0 and the model reduces to:

log(E[Y]) = B4*(log(X)) :When X > 0

Now, it certainly would be weird if B3>>0, as this would imply a high spike at 0, and then at the X value of 1 Y goes back down to 1 and then increases with X. If we expect B4 to be positive, then a negative value of B3 (or very close to 0) would make the most sense. It is still a discontinuity in the function, but one that may make theoretical sense. So imagine we fit the equation log(E[Y]) = B3*D + B4*(log(X)), lets say B4 is 0.5 (the same as B2), and that B3 is equal to -0.1. This would then make the set of predictions go:

E[Deaths] =  e^-0.01 =  0.9
E[Deaths] =    1^0.5 =  1.0
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

So in this made up example the discontinuity pretty much fits right in with the rest of the function. We may consider other non-linear transformations of X as well (splines or higher powers) but frequently an additional problem is that the bulk of the data lie in the lower end of the range. So for our dollars if it was highly right skewed, there may only be a few values at 100 or higher. These can be highly influential if you use powers of X (e.g. include X^2, X^3 etc. on the right hand side) – so splines are a better choice – but essentially no matter how you fit the function it will be hard to verify the fit at these values or extrapolate to those tails. So the fit in the original function may be fine – but it just implies unrealistic marginal effects in the tails.

So how do we verify one equation over the other? Visualizing count data in scatterplots tend to be harder than visualizing continuous data, especially if there is a stock pile of data at 0. The problem is simply exacerbated if the explanatory variable has a similar right skew, there will be a large mass near the origin of the plot and very sparse everywhere else.

My simple suggesting is to just bin the data at X values, which is very simple if X is integer valued, and then plot the mean and standard error of the Y value within those bins. As Poisson regression and its variants rely on asymptotic properties, if the error bars are too variable to deduce a pattern you should be concerned your sample size isn’t large enough to begin with.

If the bulk of the data only have a small range over X, then it will be hard in practice to differentiate between the two model parametrizations I suggest here (with the typical noisy data we have in the social sciences). So you may prefer logging the X variable simply to prevent the dramatic explosions in the tails of the data right from the start.

I do feel comfortable saying that if the ratio of the smallest to largest value for the independent variable is over 100, you should check the predictions of the exponential link function very closely (if the smallest value is 0 just estimate the ratio as if the smallest value is 1). If this ratio is 100 or larger, unless the linear predictor in the Poisson regression equation is very small (<<.01) the predictions may explode into very implausible ranges for the larger X values.

Negative Binomial regression and predicted probabilities in SPSS

For my dissertation I have been estimating negative binomial regression models predicting the counts of crimes at small places (i.e. street segments and intersections). When evaluating the fit of poisson regression models and their variants, you typically make a line plot of the observed percent of integer values versus the predicted percent by the models. This is particularly pertinent for data that have a high proportion of zeros, as the negative binomial may still under-predict the number of zeros.

I mistakenly thought that to make such a plot you could simply estimate the predicted value following the negative binomial regression model and then round the predictions. But I was incorrect, and to make the typical predicted versus observed plot you need to estimate the probability of an observation taking an integer value, and then take the mean of that probability over all the observations. That mean will subsequently be the predicted percent given the model. Fortunately I caught my mistake before I gave some talks on my work recently, and I will show how to make said calculations in SPSS. I have posted the data to replicate this work at this dropbox link, and so you can download the data and follow along.

First, I got some help on how to estimate the predicted probabilities via an answer to my question at CrossValidated. So that question lists the formula one needs to estimate the predicted probability for any integer value N after the negative binomial model. To calculate that value though we need to make some special SPSS functions, the factorial and the complete gamma function. Both have SPSS tech help pages showing how to calculate them.

For the factorial we can use a general relationship with the LNGAMMA function.


DEFINE !FACT (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(LNGAMMA((!1)+1)) )
!ENDDEFINE.

And for the complete gamma function we can use a relationship to the CDF of the gamma function.


DEFINE !GAMMAF (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(-1)/(!1)/(CDF.GAMMA(1,(!1),1) - CDF.GAMMA(1,(!1)+1,1)) )
!ENDDEFINE.

And given these two functions, we can create a macro that takes as parameters and returns the predicted probability we are interested in:

  • out – new variable name for predicted probability of taking on that integer value
  • PredN – the predicted mean of the variable conditional on the covariates
  • Disp – estimate of the dispersion parameter
  • Int – the integer value being predicted

DEFINE !PredNB (Out = !TOKENS(1)
               /PredN = !TOKENS(1)
                        /Disp = !TOKENS(1)
                        /Int = !TOKENS(1) )
COMPUTE #a = (!Disp)**(-1).
COMPUTE #mu = !PredN.
COMPUTE #Y = !Int.
COMPUTE #1 = (!GAMMAF(#Y + #a))/(!FACT(#Y)*!GAMMAF(#a)).
COMPUTE #2 = (#a/(#a+#mu))**#a.
COMPUTE #3 =  (#mu/(#a + #mu))**#Y.
COMPUTE !Out =  #1*#2*#3.
!ENDDEFINE.

But to make our plot we want to estimate this predicted probability over a range of values, so I created a helper macro that instead of taking only one integer value, takes the end integer value and will calculate the predicted probability of zero through N.


DEFINE !PredNBRange (Num = !TOKENS(1)
                    /Mean = !TOKENS(1)
                    /Disp = !TOKENS(1)
                    /Stub = !TOKENS(1) )
!DO !I = 0 !TO !Num
  !LET !Base = !CONCAT(!Stub,!I)
  !PredNB Out = !Base PredN = !Mean Disp = !Disp Int = !I.
!DOEND 
!ENDDEFINE.

The example data and code I have posted compares these values to the ones predicted from Stata, and shows my function agrees with Stata to about 7 decimal points. I won’t go through all of those commands here, but I will show how to make the predicted proportions plot after you have a vector of predicted probabilities (you can download all of the code and data and the link I reference prior in the post).

So lets say that you have a vector NB0 TO NB8, and these are the predicted probabilities of integer values 0 to 8 for the observations in your dataset. To subsequently get the mean of the predictions, you can use the AGGREGATE command. Having no variables specified on the BREAK subcommand tells SPSS to aggregate over all values in the dataset. Here I export the file to a new dataset named PredNBAgg.


DATASET DECLARE PredNBAgg.
AGGREGATE OUTFILE='PredNBAgg'
  /BREAK = 
  /NB0 TO NB8 = MEAN(NB0 TO NB8).

Now to merge later on to the observed proportions, I will reshape the dataset so the mean values are all in the same column using VARSTOCASES. Here I also make a category for the predicted probability of being 9 or higher (which isn’t typical for these types of plots, but something I believe is useful).


DATASET ACTIVATE PredNBAgg.
COMPUTE NB9_Plus = 1 - SUM(NB0 TO NB8).
VARSTOCASES /MAKE NBPred FROM NB0 TO NB9_Plus /INDEX Int.
COMPUTE Int = Int - 1. /*Index starts at 1 instead of 0 */.

Now I reactivate my original dataset, here named PredNegBin, calculate the binned observed values (with observations 9 and larger recoded to just 9) and then aggregate those values.


DATASET ACTIVATE PredNegBin.
RECODE TotalCrime (9 THRU HIGHEST = 9)(ELSE = COPY) INTO Int.
DATASET DECLARE PredObsAgg.
AGGREGATE OUTFILE='PredObsAgg'
  /BREAK = Int
  /TotalObs = N.

To get the predicted proportions within each category, I need to do another aggregation to get the total number of observations, and then divide the totals of each integer value with the total number of observations.


DATASET ACTIVATE PredObsAgg.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK = 
  /TotalN=SUM(TotalObs).
COMPUTE PercObs = TotalObs / TotalN.

Now we can go ahead and merge the two aggregated datasets together. I also go ahead and close the old PredNBAgg dataset and define a value label so I know that the 9 integer category is really 9 and larger.


MATCH FILES FILE = *
  /FILE = 'PredNBAgg'
  /BY Int.
DATASET CLOSE PredNBAgg.
VALUE LABELS Int 9 '9+'.

Now at this point you could make the plot with the predicted and observed proportions in seperate variables, but this would take two ELEMENT statements within a GGRAPH command (and I like to make line plots with both the lines and points, so it would actually take 4 ELEMENT statements). So what I do here is reshape the data one more time with VARSTOCASES, and make a categorical variable to identify if the proportion is the observed value or the predicted value from the model. Then you can make your chart.


VARSTOCASES /MAKE Dens FROM PercObs NBPred /Index Type /DROP TotalObs TotalN.
VALUE LABELS Type 
 1 'Observed'
 2 'Predicted'.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Int Dens Type
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Int=col(source(s), name("Int"), unit.category())
  DATA: Type=col(source(s), name("Type"), unit.category())
  DATA: Dens=col(source(s), name("Dens"))
  GUIDE: axis(dim(1), label("Total Crimes on Street Units"))
  GUIDE: axis(dim(2), label("Percent of Streets"))
  GUIDE: legend(aesthetic(aesthetic.color.interior), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color.black),("2",color.red)))
  ELEMENT: line(position(Int*Dens), color.interior(Type))
  ELEMENT: point(position(Int*Dens), color.interior(Type), color.exterior(color.white), size(size."7"))
END GPL.

And voila, here you can see the predicted values are so close to the observed that it is difficult to even see the observed values. Here instead of creating a legend I manually added labels to the chart. A better chart may be to subtract the observed from predicted (especially if you were comparing multiple poisson models), but it should be quite plain to see that the negative binomial fits quite well to the observed data in this instance.

Similar to Paul Allison’s experience, even with nearly 64% of the observations being zero, the negative binomial model fits just fine. I recently fit some other models with the same data (but a different outcome) in which the number of zeros were nearer to 90%. In that instance the negative binomial model would not converge, so estimating a zero inflated model was necessary. Here though it is clearly not necessary, and I would prefer the negative binomial model over a zip (or hurdle) as I see no obvious reason why I would prefer the complications of the different predicted zero equation in addition to the count equation.