Pooling multiple outcomes into one regression equation

Something that came up for many of my students this last semester in my Seminar in Research class is that many were interested in multiple outcomes. The most common one is examining different types of delinquency for juveniles (often via surveys), but it comes up in quite a few other designs as well (e.g. different crime outcomes for spatial research, different measures of perceptions towards police, different measures of fear of crime, etc.).

Most of the time students default to estimating separate equations for each of these outcomes, but in most circumstances I was telling the students they should pool these outcomes into one model. I think that is the better default for the majority of situations. So say we have a situation with two outcomes, violent crimes and property crimes, and we have one independent variable we are interested in, say whether an individual was subjected to a particular treatment. We might then estimate two separate equations:

E[# Violent Crimes]  = B0v + B1v*(Treatment) 
    
E[# Property Crimes] = B0p + B1p*(Treatment)

By saying that I think by default we should think about pooling is basically saying that B1v is going to be close to equal to B1p in the two equations. Pooling the models together both lets us test that assertion, as well as get a better estimate of the overall treatment effect. So to pool the models we would stack the outcomes together, and then estimate something like:

E[# Crimes (by type)] = B0 + B1*(Treatment) + B2*(Outcome = Violent) + B3(Treatment*Outcome = Violent)

Here the B3 coefficient tests whether the treatment effect is different for the violent crime outcome as opposed to the property crime, and the dummy variable B2 effect controls for any differences in the levels of the two overall (that is, you would expect violent incidents to be less common than property crime incidents).

Because you will have multiple measures per individual, you can correct for that (by clustering the standard errors). But in the case you have many outcomes you might also want to consider a multi-level model, and actually estimate random effects for individuals and outcomes. So say instead of just violent and property crimes, but had a survey listing for 20 different types of delinquency. In that case you might want to do a model that looks like:

Prob(Delinquency_ij) = f[B0 + B1*(Treatment_j) + d_j + g_i]

Where one is estimating a multi-level logistic regression equation for delinquency type i within individual j, and the g_i and d_j are the random effects for delinquency types and individuals respectively. In the case you do not have many outcomes (say only 10), the random effect distribution might be hard to estimate. In that case I would just use fixed effects for the outcome dummy variables. But I can imagine the random effects for persons are of interest in many different study designs. And this way you get one model — instead of having to interpret 20+ models.

Also you can still estimate differential treatment effects across the different items if you want to, such as by looking at the interaction of the outcome types and the treatment. But in most cases in criminology I have come across treatments are general. That is, we would expect them to decrease/increase all crime types, not just some specific violent or property crime types. So to default pooling the treatment effect estimate makes sense.


To go a bit farther — juvenile delinquency is not my bag, but offhand I don’t understand why those who examine surveys of delinquency items use that multi-level model more often. Often times people aggregate the measures altogether into one overall scale, such as saying someone checked yes to 2 out of 10 violent crime outcomes, and checked yes to 5 out of 10 property crime outcomes. Analyzing those aggregated outcomes is another type of pooling, but one I don’t think is appropriate, mainly because it ignores the overall prevalence for the different items. For example, you might have an item such as "steal a car", and another that is "steal a candy bar". The latter is much more serious and subsequently less likely to occur. Going with my prior examples, pooling items together like this would force the random effects for the individual delinquency types, g_i, to all equal zero. Just looking at the data one can obviously tell that is not a good assumption.

Here I will provide an example via simulation to demonstrate this in Stata. First I generate an example dataset that has 1,000 individuals and 20 yes/no outcomes. They way the data are simulated is that each individual has a specific amount of self_control that decreases the probability of an outcome (with a coefficient of -0.5), they are nested within a particular group (imagine a different school) that affect whether the outcome occurs or not. In addition to this, each individual has a random intercept (drawn from a normal distribution), and each question has a fixed different prevalence.

*Stata simulation
clear
set more off
set seed 10
set obs 1000
generate caseid = _n
generate group = ceil(caseid/100) 
generate self_control = rnormal(0,1)
generate rand_int = rnormal(0,1)

*generating 20 outcomes that just have a varying intercept for each
forval i = 1/20 { 
  generate logit_`i' = -0.4 -0.5*self_control -0.1*group + 0.1*(`i'-10) + rand_int
  generate prob_`i' = 1/(1 + exp(-1*logit_`i'))
  generate outcome_`i' = rbinomial(1,prob_`i')
}
drop logit_* prob_* rand_int
summarize prob_*

And here is that final output:

. summarize prob_*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      prob_1 |      1,000    .1744795    .1516094   .0031003   .9194385
      prob_2 |      1,000    .1868849     .157952   .0034252   .9265418
      prob_3 |      1,000     .199886    .1642414   .0037841   .9330643
      prob_4 |      1,000      .21348    .1704442   .0041804   .9390459
      prob_5 |      1,000    .2276601    .1765258    .004618   .9445246
-------------+---------------------------------------------------------
      prob_6 |      1,000     .242416    .1824513   .0051012   .9495374
      prob_7 |      1,000    .2577337    .1881855   .0056347   .9541193
      prob_8 |      1,000    .2735951    .1936933   .0062236   .9583033
      prob_9 |      1,000     .289978    .1989401   .0068736    .962121
     prob_10 |      1,000    .3068564    .2038919    .007591   .9656016
-------------+---------------------------------------------------------
     prob_11 |      1,000    .3242004    .2085164   .0083827   .9687729
     prob_12 |      1,000    .3419763    .2127823   .0092562   .9716603
     prob_13 |      1,000    .3601469    .2166605   .0102197   .9742879
     prob_14 |      1,000    .3786715    .2201237   .0112824   .9766776
     prob_15 |      1,000    .3975066    .2231474   .0124542   .9788501
-------------+---------------------------------------------------------
     prob_16 |      1,000    .4166057    .2257093    .013746   .9808242
     prob_17 |      1,000    .4359203    .2277906   .0151697   .9826173
     prob_18 |      1,000       .4554    .2293751   .0167384   .9842454
     prob_19 |      1,000     .474993    .2304504   .0184663   .9857233
     prob_20 |      1,000    .4946465    .2310073   .0203689   .9870643

You can see from this list that each prob* variable then has a different overall prevalence, from around 17% for prob_1, climbing to around 50% for prob_20.

Now if you wanted to pool the items into one overall delinquency scale, you might estimate a binomial regression model (note this is not a negative binomial model!) like below (see Britt et al., 2017 for discussion).

*first I will show the binomial model in Britt
egen delin_total = rowtotal(outcome_*)
*Model 1
glm delin_total self_control i.group, family(binomial 20) link(logit)

Which shows for the results (note that the effect of self-control is too small, it should be around -0.5):

. glm delin_total self_control i.group, family(binomial 20) link(logit)

Iteration 0:   log likelihood =  -3536.491  
Iteration 1:   log likelihood = -3502.3107  
Iteration 2:   log likelihood = -3502.2502  
Iteration 3:   log likelihood = -3502.2502  

Generalized linear models                         No. of obs      =      1,000
Optimization     : ML                             Residual df     =        989
                                                  Scale parameter =          1
Deviance         =  4072.410767                   (1/df) Deviance =   4.117706
Pearson          =  3825.491931                   (1/df) Pearson  =    3.86804

Variance function: V(u) = u*(1-u/20)              [Binomial]
Link function    : g(u) = ln(u/(20-u))            [Logit]

                                                  AIC             =     7.0265
Log likelihood   = -3502.250161                   BIC             =  -2759.359

------------------------------------------------------------------------------
             |                 OIM
 delin_total |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
             |
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
             |
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------

One of the things I wish the Britt paper mentioned was that the above binomial model is equivalent to the a logistic regression model on the individual outcomes — but one that forces the predictions for each item to be the same across a person. So if you reshape the data from wide to long you can estimate that same binomial model as a logistic regression on the 0/1 outcomes.

*reshape wide to long
reshape long outcome_, i(caseid) j(question)
*see each person now has 20 questions each
*tab caseid

*regression model with the individual level data, should be equivalent to the aggregate binomial model
*Model 2
glm outcome_ self_control i.group, family(binomial) link(logit)

And here are the results:

. glm outcome_ self_control i.group, family(binomial) link(logit)

Iteration 0:   log likelihood = -12204.638  
Iteration 1:   log likelihood = -12188.762  
Iteration 2:   log likelihood = -12188.755  
Iteration 3:   log likelihood = -12188.755  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,989
                                                  Scale parameter =          1
Deviance         =  24377.50934                   (1/df) Deviance =   1.219546
Pearson          =  19949.19243                   (1/df) Pearson  =   .9980085

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.219975
Log likelihood   = -12188.75467                   BIC             =  -173583.3

------------------------------------------------------------------------------
             |                 OIM
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
             |
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
             |
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------

So you can see that Model 1 and Model 2 are exactly the same (in terms of estimates for the regression coefficients).

Model 2 though should show the limitations of using the binomial model — it predicts the same probability for each delinquency item, even though prob_1 is less likely to occur than prob_20. So for example, if we generate the predictions of this model, we can see that each question has the same predicted value.

predict prob_mod2, mu
sort question
by question: summarize outcome_ prob_mod2

And here are the results for the first four questions:

.     by question: summarize outcome_ prob_mod2

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

By construction, the binomial model on the aggregated totals is a bad fit to the data. It predicts that each question should have a probability of around 32% of occurring. Although you can’t fit the zero-inflated model discussed by Britt via the individual level logit approach (that I am aware of), that approach has the same limitation as the generic binomial model approach. Modeling the individual items just makes more sense when you have the individual items. It is hard to think of examples where such a restriction would be reasonable for delinquency items.

So here a simple update is to include a dummy variable for each item. Here I also cluster according to whether the item is nested within an individual caseid.

*Model 3
glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

And here are the results:

.     glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

Iteration 0:   log pseudolikelihood = -11748.056  
Iteration 1:   log pseudolikelihood = -11740.418  
Iteration 2:   log pseudolikelihood = -11740.417  
Iteration 3:   log pseudolikelihood = -11740.417  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,970
                                                  Scale parameter =          1
Deviance         =  23480.83406                   (1/df) Deviance =   1.175805
Pearson          =  19949.15609                   (1/df) Pearson  =   .9989562

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.177042
Log pseudolikelihood = -11740.41703               BIC             =  -174291.8

                             (Std. Err. adjusted for 1,000 clusters in caseid)
------------------------------------------------------------------------------
             |               Robust
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3858319   .0334536   -11.53   0.000    -.4513996   -.3202641
             |
       group |
          2  |  -.0620222   .1350231    -0.46   0.646    -.3266626    .2026182
          3  |   -.049852   .1340801    -0.37   0.710    -.3126442    .2129403
          4  |   .0549271   .1383412     0.40   0.691    -.2162167    .3260709
          5  |  -.1329942   .1374758    -0.97   0.333    -.4024419    .1364535
          6  |  -.4103578   .1401212    -2.93   0.003    -.6849904   -.1357253
          7  |  -.3145033   .1452201    -2.17   0.030    -.5991296   -.0298771
          8  |  -.2803599   .1367913    -2.05   0.040    -.5484659    -.012254
          9  |  -.4543686   .1431314    -3.17   0.002    -.7349011   -.1738362
         10  |  -.5962359   .1457941    -4.09   0.000    -.8819872   -.3104847
             |
    question |
          2  |   .1453902   .1074383     1.35   0.176    -.0651851    .3559654
          3  |   .1643203   .1094113     1.50   0.133     -.050122    .3787625
          4  |   .1262597   .1077915     1.17   0.241    -.0850078    .3375272
          5  |   .1830563    .105033     1.74   0.081    -.0228047    .3889173
          6  |   .3609468   .1051123     3.43   0.001     .1549304    .5669633
          7  |    .524749    .100128     5.24   0.000     .3285017    .7209963
          8  |   .5768412   .1000354     5.77   0.000     .3807754     .772907
          9  |   .7318797   .1021592     7.16   0.000     .5316513    .9321081
         10  |    .571682   .1028169     5.56   0.000     .3701646    .7731994
         11  |    .874362   .0998021     8.76   0.000     .6787535     1.06997
         12  |   .8928982   .0998285     8.94   0.000     .6972379    1.088559
         13  |   .8882734   .1023888     8.68   0.000      .687595    1.088952
         14  |   .9887095   .0989047    10.00   0.000     .7948599    1.182559
         15  |   1.165517   .0977542    11.92   0.000     .9739222    1.357111
         16  |   1.230355   .0981687    12.53   0.000     1.037948    1.422762
         17  |   1.260403   .0977022    12.90   0.000      1.06891    1.451896
         18  |   1.286065    .098823    13.01   0.000     1.092376    1.479755
         19  |   1.388013   .0987902    14.05   0.000     1.194388    1.581638
         20  |   1.623689   .0999775    16.24   0.000     1.427737    1.819642
             |
       _cons |  -1.336376   .1231097   -10.86   0.000    -1.577666   -1.095085
------------------------------------------------------------------------------

You can now see that the predicted values for each individual item are much more reasonable. In fact they are a near perfect fit.

predict prob_mod3, mu
by question: summarize outcome_ prob_mod2 prob_mod3

And the results:

.     by question: summarize outcome_ prob_mod2 prob_mod3

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .183    .0672242   .0475809   .4785903

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .205    .0729937   .0546202   .5149203

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .208    .0737455    .055606   .5196471

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .202    .0722336   .0536408   .5101407

If you want, you can also test whether any "treatment" effect (or here the level of a persons self control), has differential effects across the different delinquency items.

*Model 4
glm outcome_ self_control i.group i.question (c.self_control#i.question), family(binomial) link(logit) cluster(caseid)
*can do a test of all the interactions equal to zero at once
testparm c.self_control#i.question

I’ve omitted this output, but here of course the effect of self control is simulated to be the same across the different items, so one would fail to reject the null that any of the interaction terms are non-zero.

Given the way I simulated the data, the actual correct model is a random effects one. You should notice in each of the prior models the effect of self control is too small. One way to estimate that model in Stata is to below:

*Model 5
melogit outcome_ self_control i.group i.question || caseid:

And here are the results:

. melogit outcome_ self_control i.group i.question || caseid:

Fitting fixed-effects model:

Iteration 0:   log likelihood = -11748.056  
Iteration 1:   log likelihood = -11740.418  
Iteration 2:   log likelihood = -11740.417  
Iteration 3:   log likelihood = -11740.417  

Refining starting values:

Grid node 0:   log likelihood =  -10870.54

Fitting full model:

Iteration 0:   log likelihood =  -10870.54  
Iteration 1:   log likelihood = -10846.176  
Iteration 2:   log likelihood = -10845.969  
Iteration 3:   log likelihood = -10845.969  

Mixed-effects logistic regression               Number of obs     =     20,000
Group variable:          caseid                 Number of groups  =      1,000

                                                Obs per group:
                                                              min =         20
                                                              avg =       20.0
                                                              max =         20

Integration method: mvaghermite                 Integration pts.  =          7

                                                Wald chi2(29)     =    1155.07
Log likelihood = -10845.969                     Prob > chi2       =     0.0000
------------------------------------------------------------------------------
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.4744173   .0372779   -12.73   0.000    -.5474807    -.401354
             |
       group |
          2  |  -.0648018   .1642767    -0.39   0.693    -.3867783    .2571746
          3  |  -.0740465   .1647471    -0.45   0.653    -.3969449    .2488519
          4  |    .036207   .1646275     0.22   0.826     -.286457     .358871
          5  |  -.1305605   .1645812    -0.79   0.428    -.4531337    .1920126
          6  |  -.5072909   .1671902    -3.03   0.002    -.8349776   -.1796042
          7  |  -.3732567    .165486    -2.26   0.024    -.6976032   -.0489102
          8  |  -.3495889   .1657804    -2.11   0.035    -.6745126   -.0246653
          9  |  -.5593725   .1675276    -3.34   0.001    -.8877205   -.2310245
         10  |  -.7329717   .1673639    -4.38   0.000    -1.060999   -.4049445
             |
    question |
          2  |   .1690546   .1240697     1.36   0.173    -.0741177    .4122268
          3  |    .191157   .1237894     1.54   0.123    -.0514657    .4337797
          4  |   .1467393   .1243586     1.18   0.238    -.0969991    .3904776
          5  |   .2130531   .1235171     1.72   0.085     -.029036    .4551422
          6  |   .4219282   .1211838     3.48   0.000     .1844123    .6594441
          7  |   .6157484   .1194133     5.16   0.000     .3817027    .8497941
          8  |   .6776651   .1189213     5.70   0.000     .4445837    .9107465
          9  |   .8626735   .1176486     7.33   0.000     .6320865    1.093261
         10  |   .6715272   .1189685     5.64   0.000     .4383532    .9047012
         11  |   1.033571   .1167196     8.86   0.000     .8048051    1.262338
         12  |    1.05586    .116615     9.05   0.000     .8272985    1.284421
         13  |   1.050297   .1166407     9.00   0.000     .8216858    1.278909
         14  |   1.171248   .1161319    10.09   0.000     .9436331    1.398862
         15  |   1.384883   .1154872    11.99   0.000     1.158532    1.611234
         16  |   1.463414   .1153286    12.69   0.000     1.237375    1.689454
         17  |   1.499836   .1152689    13.01   0.000     1.273913    1.725759
         18  |   1.530954   .1152248    13.29   0.000     1.305117     1.75679
         19  |   1.654674   .1151121    14.37   0.000     1.429058    1.880289
         20  |   1.941035   .1152276    16.85   0.000     1.715193    2.166877
             |
       _cons |  -1.591796   .1459216   -10.91   0.000    -1.877797   -1.305795
-------------+----------------------------------------------------------------
caseid       |
   var(_cons)|   1.052621   .0676116                      .9281064     1.19384
------------------------------------------------------------------------------
LR test vs. logistic model: chibar2(01) = 1788.90     Prob >= chibar2 = 0.0000

In that model it is the closest to estimating the correct effect of self control (-0.5). It is still small (at -0.47), but the estimate is within one standard error of the true value. (Another way to estimate this model is to use xtlogit, but with melogit you can actually extract the random effects. That will have to wait until another blog post though.)

Another way to think about this model is related to item-response theory, where individuals can have a latent estimate of how smart they are, and questions can have a latent easiness/hardness. In Stata you might fit that by the code below, but a warning it takes awhile to converge. (Not sure why, the fixed effects for questions are symmetric, so assuming a random effect distribution should not be too far off. If you have any thoughts as to why let me know!)

*Model 6
melogit outcome_ self_control i.group || caseid: || question:

For an academic reference to this approach see Osgood et al.,(2002). Long story short, model the individual items, but pool them together in one model!

New working paper: Mapping attitudes towards the police at micro places

I have a new preprint posted, Mapping attitudes towards the police at micro places. This is work with Jasmine Silver, as well as Rob Worden and Sarah McLean. See the abstract:

We demonstrate the utility of mapping community satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. In each survey, respondents provided the nearest intersection to their address. We use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city, which shows broader neighborhood patterns of satisfaction as well as small area hot spots of dissatisfaction. Our results show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime and police activity. Models predicting satisfaction with police show that local counts of violent crime are the strongest predictors of attitudes towards police, even above individual level predictors of race and age.

In this article we make what are analogs of hot spot maps of crime, but measure dissatisfaction with the police.

 

One of the interesting findings is that these hot spots do not align nicely with census tracts (the tracts are generalized, we cannot divulge the location of the city). So the areas identified by each procedure would be much different.

 

As always, feel free to comment or send me an email if you have feedback on the article.

Monitoring homicide trends paper published

My paper, Monitoring Volatile Homicide Trends Across U.S. Cities (with coauthor Tom Kovandzic) has just been published online in Homicide Studies. Unfortunately, Homicide Studies does not give me a link to share a free PDF like other publishers, but you can either grab the pre-print on SSRN or always just email me for a copy of the paper.

They made me convert all of the charts to grey scale :(. Here is an example of the funnel chart for homicide rates in 2015.

And here are example fan charts I generated for a few different cities.

As always if you have feedback or suggestions let me know! I posted all of the code to replicate the analysis at this link. The prediction intervals can definately be improved both in coverage and in making their length smaller, so I hope to see other researchers tackling this as well.

Presentation at ASC: Crime Data Visualization for the Future

At the upcoming American Society of Criminology conference in Philadelphia I will be presenting a talk, Crime Data Visualization for the Future. Here is the abstract:

Open data is a necessary but not sufficient condition for data to be transparent. Understanding how to reduce complicated information into informative displays is an important step for those wishing to understand crime and how the criminal justice system works. I focus the talk on using simple tables and graphs to present complicated information using various examples in criminal justice. Also I describe ways to effectively evaluate the size of effects in regression models, and make black box machine learning models more interpretable.

But I have written a paper to go with the talk as well. You can download that paper here. As always, if you have feedback/suggestions let me know.

Here are some example graphs of plotting the predictions from a random forest model predicting when restaurants in Chicago will fail their inspections.

I present on Wednesday 11/15 at 11 am. You can see the full session here. Here is a quick rundown of the other papers — Marcus was the one who put together the panel.

  • A Future Proposal for the Model Crime Report – Marcus Felson
  • Crime Data Warehouses and the future of Big Data in Criminology – Martin Andresen
  • Can We Unify Criminal Justice Data, Like the Dutch and the Nordics? – Michael Mueller-Smith

So it should be a great set of talks.


I also signed up to present a poster, Mapping Attitudes Towards the Police at Micro Places. This is work with Albany Finn Institute folks, including Jasmine Silver, Sarah McLean, and Rob Worden. Hopefully I will have a paper to share about that soon, but for a teaser on that here is an example map from that work, showing hot spots of dissatisfaction with the police estimated via inverse distance weighting. Update: for those interested, see here for the paper and here for the poster. Stop on by Thursday to check it out!

And here is the abstract:

We demonstrate the utility of mapping community satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. In each survey, respondents provided the nearest intersection to their address. We use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city, which shows broader neighborhood patterns of satisfaction as well as small area hot spots of dissatisfaction. Our results show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime and police activity. Models predicting satisfaction with police show that local counts of violent crime are the strongest predictors of attitudes towards police, even above individual level predictors of race and age.

My Protips for College

Last week I gave a lecture to prospective undergrad students here at UT Dallas. At the end of the talk I gave some pro-tips for undergrads — things if I could go back in time and talk to myself this is what I would say. Here are those points, with a bit more personal elaboration than I gave during my talk. This is for undergrads going for a four year degree, it is not really applicable to graduate students (and mostly not applicable to community college students).

Pick a Major

My first major piece of advice is that all students should pick a major. You should not go undeclared.

Personally I did not have this problem — I haphazardly picked criminal justice and enjoyed it enough to stick with it. I had several friends though get stuck in the quagmire of never being able to decide a major.

There are a few reasons I have this piece of advice. One is that you should really think about why you are going to college in the first place — you should have some cognizable career aspirations that guide why you are there. Those aspirations will obviously guide what major you choose. You don’t have to worry though if you are not fully committed to that major. It is totally normal for an 18 year old to not be mature enough to make that sort of commitment. If you go for biology, realize that lab work sucks, you can change majors. So picking a major at the start is not like marriage — it is more like going on a date.

Part of the issue of going undeclared is people have the expectation that they will figure out a subject they really like later on (via a class) and then choose a major. Haphazardly taking classes does not guarantee this though. This is especially the case with introduction classes — which can often suck compared to higher level classes. Going again with the analogy to dating, you should have multiple dates (classes) in a particular major before you decide if it is/is not for you (you could always get a crappy teacher in one class as well). Being undeclared you only get minimal exposure to any particular subject. So it is better to pick something, with the mind that you might change later on, as the class that changes your mind may never come.

Finally, you should always have those career goals in mind. You shouldn’t decide to be a professional tennis player because you liked that GenEd tennis course you were randomly assigned. So ultimately you need to base what you want to major in on factors outside of how much you like the subject material. Some career opportunities are not strongly tied to a major — you can be a cop with an English degree if you want. That is not always the case though.

An additional point related to this is that you have more opportunities/majors in larger universities. If you pick the small private university and you decide you don’t want to major in education, you will be hamstrung in what other potential majors you can choose from, which leads to my next point.

More Interesting >> Easier

Second piece of advice — don’t pick easier classes over ones you think will be more interesting.

For college students, this is really the first opportunity you have in your education to shape your schedule. (You have some agency in high school, but nothing like college.) When you go for a four year degree, about half of your college courses will be general education (GenEd). For example at Bloomsburg IIRC I had to take a collection of four classes in each area of Humanities, Math & Sciences, and Social Sciences — as well as additional writing classes. These you have to choose a particular subject area (e.g. Economics, History), but any class in that department counts towards your GenEd requirements.

Now the seemingly obvious choice is to take the introduction classes in these majors, as they will be easier. This is a mistake I made. First, the intro classes can be mind numbingly boring. Second, they aren’t guaranteed to even be less work than higher level courses. Third, and most important, it ends up being much easier to do the work/show up to an interesting class than it is to wade through a really boring one. Let me say that again another way — you will not go to class if it is really boring, and maybe fail. It is easier to put the work in for classes you enjoy.

This works the same when picking a major. STEM has a reputation as being harder. Yes, aero-space engineering is really hard, but the STEM field is not filled with geniuses — they are normal people. I’d hasten to say writing English good is easy not so. (Most social sciences you will write — alot.) Painting something people want to buy is probably alot harder than learning about biology. Teaching elementary children is madness. Take chemistry if you think it is interesting — don’t be afraid because of some stereotype that you think it will be really hard.

The example that most comes to mind for me was a history general education requirement. I had the option one semester to do either a 300 level course on Japanese history, or the 100 level course on something like 16th century western European history. I chose the western European course, being concerned the 300 level course would be more work. It was a terrible decision — the 100 level course was quite a bit of work, requiring writing several essays and reading very boring material. This variation is pretty typical — variation between teachers is much larger than variation in how hard the 100, 200, 300 level courses are. About the only guarantee is that a mass lecture course will be less work (mostly exams) compared to smaller courses, but those can be real stinkers. (Do research on the professors, grab a syllabus, ask someone who took the course previously, go talk to them during officer hours or send an email. You can often weed out if it is good class or a stinker before having to sit through it. Avoid generic classes taught by someone for the very first time.)

Some courses have pre-requisites, but in many cases you can ask permission to take the course even if you have not fulfilled them. You should just use your best judgment about whether the pre-requisites are really necessary or are just bureaucratic. My communities and crime undergrad has a pre-requisite for introduction to criminology, but I highly doubt an engaged student would not be able to follow along. Differential equations you should probably have a good grasp of calculus though (but something like linear algebra would probably be ok to take with just high school mathematics, so it is just dependent on the context).

In retrospect I would have also chosen various stats/econometrics courses in the economics department over the introduction to macro economics. It is really hard to convince yourself to wake up at 8 in the morning Tuesdays and Thursdays if you really hate the course. Again I did not take those higher level econ courses as I was worried about being able to pass, although I shouldn’t have been. Which leads to my next point.

Don’t be Afraid to Fail!

Third piece of advice is related to the second — don’t be afraid to fail. So you should not forgo a particular path because you are concerned about failing. It is ok to fail.

It is ok to fail for a few reasons. One is that most universities have built in the ability to drop courses. So if you take a physics class and can’t keep up with the material, you can drop the course about half way through the semester. It is pretty easy to make up those credits over the summer (if you even need to make them up at all). The system is made so you can fail a few times and it has no impact on your GPA.

The second reason it is ok to attempt harder classes is because it is ok to have B’s/C’s. Students have been primed for their whole education about the importance of grades. I’m here to tell you now that grades don’t matter all that much. The sky will not fall down if you get a C. The difference between say a 3.3 GPA and a 3.8 GPA means very little for the majority of potential career opportunities.

Now, I’m not saying you can totally stink up the joint with grades (many schools have a minimum GPA requirement), but most folks aren’t trying to get into John Hopkins medical school — you will be fine if you take a harder course and get a C — and you will probably learn more in that hard course than taking something easy but boring. You are ultimately paying for your education. You might as well take advantage of the opportunities to learn.

I do have a few C’s on my undergraduate transcript. One is that macro-economics I talked about earlier — I only showed up to around half of the classes. Another though is the Design of Experiments — the first statistics course in the math department I took. I had to get permission to be in the course, and I did not understand everything. But after the semester the professor told me I was one of the best students in the class (despite the C – the material was difficult). I subsequently took quite a few more stats classes from that professor (I got all A’s from then on out I’m pretty sure). Taking those harder classes really prepared me for graduate school compared to many within my cohort, as well as gave me a leg up in many more research oriented positions while I was at SUNY Albany.

Talk on Scholars Day – Crime in Space and Time

I will be giving a talk tomorrow (10/21/17) at Scholars Day here at UT Dallas (where we get visits from prospective students). Here is the synopsis of my talk:

Synopsis: In this lecture, Dr. Andrew Wheeler will discuss his research on the spatial and temporal patterns of crime. He will discuss whether recent homicide trends are atypical given historical data and if you can predict which neighborhoods in Dallas have the most crime. He will also discuss what to expect from an education in criminology and the social sciences in general.

I will be at JSOM 2.106 from 11 to 11:45. Here is a bit of a sneak peak. (You will also get some Han’s Rosling style animated charts of homicide trends!)

I will also discuss some of my general pro-tips for incoming college students. I will expand that into a short post next week, but if you want that advice a few days ahead come to my talk!

Some notes on PChange – estimating when trajectories cross over time

J.C. Barnes and company published a paper in JQC not too long ago and came up with a metric, PChange, to establish the number of times trajectories cross in a sample. This is more of interest to life course folks, although it is not totally far fetched to see it applied to trajectories of crime at places. Part of my interest in it was simply that it is an interesting statistical question — when two trajectories with errors cross. A seemingly simple question that has a few twists and turns. Here are my subsequent notes on that metric.

The Domain Matters

First, here is an example of the trajectories not crossing:

This points to an important assumption about those lines not crossing though that was never mentioned in the Barnes paper — the domain matters. For instance, if we draw those rays further back in time what happens?

They cross! This points to an important piece of information when evaluating PChange — the temporal domain in which you examine the data matters. So if you have a sample of juvenile delinquency measures from 14-18 you would find less change than a similar sample from 12-20.

This isn’t really a critique of PChange — it is totally reasonable to only want to examine changes within a specific domain. Who cares if delinquency trajectories cross when people are babies! But it should be an important piece of information researchers use in the future if they use PChange — longer samples will show more change. It also won’t be fair to compare PChange for samples of different lengths.

A Functional Approach to PChange

For above you may ask — how would you tell if a trajectory crosses outside of the domain of the data? The answer to that question is you estimate an underlying function of the trajectory — some type of model where the outcome is a function of age (or time). With that function you can estimate the trajectory going back in time or forward in time (or in between sampled measurements). You may not want to rely on data outside of the domain (its standard error will be much higher than data within the time domain, forecasting is always fraught with peril!), but the domain of your sample is ultimately arbitrary. So what about the question will the trajectories ever cross? Or would the trajectories have crossed if I had data for ages 12-20 instead of just 16-18? Or would they have crossed if I checked the juveniles at age 16 1/2 instead of only at 16?

So actually instead of the original way the Barnes paper formulated PChange, here is how I thought about calculating PChange. First you estimate the underlying trajectory for each individual in your sample, then you take the difference of those trajectories.

y_i = f(t)
y_j = g(t)
y_delta = f(t) - g(t) = d(t)

Where y_i is the outcome y for observation i, and y_j is the outcome y for observation j. t is a measure of time, and thus the anonymous functions f and g represent growth models for observations i and j over time. y_delta is then the difference between these two functions, which I represent as the new function d(t). So for example the functions for each individual might be quadratic in time:

y_i = b0i + b1i(t) + b2i(t^2)
y_j = b0j + b1j(t) + b2j(t^2)

Subsequently the difference function will also be quadratic, and can be simply represented as:

y_delta = (b0i - b0j) + (b1i - b1j)*t + (b2i - b2j)*t^2

Then for the trajectories to cross (or at least touch), y_delta just then has to equal zero at some point along the function. If this were math, and the trajectories had no errors, you would just set d(t) = 0 and solve for the roots of the equation. (Most people estimating models like these use functions that do have roots, like polynomials or splines). If you cared about setting the domain, you would then just check if the roots are within the domain of interest — if they are, the trajectories cross, if they are not, then they do not cross. For data on humans with age, obviously roots for negative human years will not be of interest. But that is a simple way to solve the domain problem – if you have an underlying estimate of the trajectory, just see how often the trajectories cross within equivalent temporal domains in different samples.

I’d note that the idea of having some estimate of the underlying trajectory is still relevant even within the domain of the data — not just extrapolating to time periods outside. Consider two simple curves below, where the points represent the time points where each individual was measured.

So while the two functions cross, when only considering the sampled locations, like is done in Barnes et al.’s PChange, you would say these trajectories do not cross, when in actuality they do. It is just the sampled locations are not at the critical point in the example for these two trajectories.

This points to another piece of contextual information important to interpreting PChange — the number of sample points matter. If you have samples every 6 months, you will likely find more changes than if you just had samples every year.

I don’t mean here to bag on Barnes original metric too much — his PChange metric does not rely on estimating the underlying functional form, and so is a non-parametric approach to identifying change. But estimating the functional form for each individual has some additional niceties — one is that you do not need the measures to be at equivalent sample locations. You can compare someone measured at 11, 13, and 18 to someone who is measured at 12, 16, and 19. For people analyzing stuff for really young kids I bet this is a major point — the underlying function at a specific age is more important then when you conveniently measured the outcome. For older kids though I imagine just comparing the 12 to 11 year old (but in the same class) is probably not a big deal for delinquency. It does make it easier though to compare say different cohorts in which the measures are not at nice regular intervals (e.g. Add Health, NLYS, or anytime you have missing observations in a longitudinal survey).

In the end you would only want to estimate an underlying functional form if you have many measures (more so than 3 in my example), but this typically ties in nicely with what people modeling the behavior over time are already doing — modeling the growth trajectories using some type of functional form, whether it is a random effects model or a group based trajectory etc., they give you an underlying functional form. If you are willing to assume that model is good enough to model the trajectories over time, you should think it is good enough to calculate PChange!

The Null Matters

So this so far would be fine and dandy if we had perfect estimates of the underlying trajectories. We don’t though, so you may ask, even though y_delta does not exactly equal zero anywhere, its error bars might be quite wide. Wouldn’t we then still infer that there is a high probability the two trajectories cross? This points to another hidden assumption of Barnes PChange — the null matters. In the original PChange the null is that the two trajectories do not cross — you need a sufficient change in consecutive time periods relative to the standard error to conclude they cross. If the standard error is high, you won’t consider the lines to cross. Consider the simple table below:

Period A_Level A_SE B_Level B_SE
1      4         1    1.5   0.5     
2      5         1    3     0.5
3      6         1    4.5   0.5
4      7         1    6     0.5

Where A_Level and B_Level refer to the outcome for the four time periods, and A_SE and B_SE refer to the standard errors of those measurements. Here is the graph of those two trajectories, with the standard error drawn as areas for the two functions (only plus minus one standard error for each line).

And here is the graph of the differences — assuming that the covariance between the two functions is zero (so the standard error of the difference equals sqrt(A_SE^2 + B_SE^2)). Again only plus/minus one standard error.

You can see that the line never crosses zero, but the standard error area does. If our null is H0: y_delta = 0 for any t, then we would fail to reject the null in this example. So in Barnes original PChange these examples lines would not cross, whereas with my functional approach we don’t have enough data to know they don’t cross. This I suspect would make a big difference in many samples, as the standard error is going to be quite large unless you have very many observations and/or very many time points.

If one just wants a measure of crossed or did not cross, with my functional approach you could set how wide you want to draw your error bars, and then estimate whether the high or low parts of that bar cross zero. You may not want a discrete measure though, but a probability. To get that you would integrate the probability over the domain of interest and calculate the chunk of the function that cross zero. (Just assume the temporal domain is uniform across time.)

So in 3d, my difference function would look like this, whereas to the bottom of the wall is the area to calculate the probability of the lines crossing, and the height of the surface plot is the PDF at that point. (Note the area of the density is not normalized to sum to 1 in this plot.)

This surface graph ends up crossing more than was observed in my prior 2d plots, as I was only plotting 1 standard error. Here imagine that the top green part of the density is the mean function — which does not cross zero — but then you have a non-trivial amount of the predicted density that does cross the zero line.

In the example where it just crosses one time by a little, it seems obvious to consider the small slice as the probability of the two lines crossing. I think to extend this to not knowing to test above or below the line you could calculate the probability on either side of the line, take the minimum, and then double that minimum for your p-value. So if say 5% of the area is below the line in my above example, you would double it and say the two-tailed p-value of the lines crossing is p = 0.10. Imagine the situation in which the line mostly hovers around 0, so the mass is about half on one side and half on the other. In that case the probability the lines cross seems much higher than 50%, so doubling seems intuitively reasonable.

So if you consider this probability to be a p-value, with a very small p-value you would reject the null that the lines cross. Unlike most reference distributions for p-values though, you can get a zero probability estimate of the lines crossing. You can aggregate up those probabilities as weights when calculating the overall PChange for the sample. So you may not know for certain if two trajectories cross, but you may be able to say these two trajectories cross with a 30% probability.

Again this isn’t to say that PChange is bad — it is just different. I can’t give any reasoning whether assuming they do cross (my functional approach) or assuming they don’t cross (Barnes PChange) is better – they are just different, but would likely make a large difference in the estimated number of crossings.

Population Change vs Individual Level Change

So far I have just talked about trying to determine whether two individual lines cross. For my geographic analysis of trajectories in which I have the whole population (just a sample in time), this may be sufficient. You can calculate all pairwise differences and then calculate PChange (I see no data based reason to use the permutation sample approach Barnes suggested – we don’t have that big of samples, we can just calculate all pairwise combinations.)

But for many of the life course researchers, they are more likely to be interested in estimating the population of changes from the samples. Here I will show how you can do that for either random effects models, or for group based trajectory models based on the summary information. This takes PChange from a sample metric to a population level metric implied via your models you have estimated. This I imagine will be much easier to generalize across samples than the individual change metrics, which will be quite susceptible to outlier trajectories, especially in small samples.

First lets start with the random effects model. Imagine that you fit a linear growth model — say the random intercept has a variance of 2, and the random slope has a variance of 1. So these are population level metrics. The fixed effects and the covariance between the two random effect terms will be immaterial for this part, as I will discuss in a moment.

First, trivially, if you selected two random individuals from the population with this random effects distribution, the probability their underlying trajectories cross at some point is 1. The reason is for linear models, two lines only never cross if the slopes are perfectly parallel. Which sampling from a continuous random distribution has zero probability of them being exactly the same. This does not generalize to more complicated functions (imagine parabolas concave up and concave down that are shifted up and down so they never cross), but should be enough of a motivation to make the question only relevant for a specified domain of time.

So lets say that we are evaluating the trajectories over the range t = [10,20]. What is the probability two individuals randomly sampled from the population will cross? So again with my functional difference approach, we have

y_i = b0i + b1i*t
y_j = b0j + b1j*t
y_delta = (b0i - b0j) + (b1i - b1j)*t

Where in this case the b0 and b1 have prespecified distributions, so we know the distribution of the difference. Note that in the case with no covariates, the fixed effects will cancel out when taking the differences. (Generalizing to covariates is not as straightforward, you could either assume they are equal so they cancel out, or you could have them vary according to additional distributions, e.g. males have an 90% chance of being drawn versus females have a 10% chance, in that case the fixed effects would not cancel out.) Here I am just assuming they cancel out. Additionally, taking the difference in the trajectories also cancels out the covariance term, so you can assume the covariance between (b0i - b0j) and (b1i - b1j) is zero even if b0 and b1 have a non-zero covariance for the overall model. (Post is long enough — I leave that as an exercise for the reader.)

For each of the differences the means will be zero, and the variance will be the two variances added together, e.g. b0i - b0j will have a mean of zero and a variance of 2 + 2 = 4. The variance of the difference in slopes will then be 2. Now to figure out when the two lines will cross.

If you make a graph where the X axis is the difference in the intercepts, and the Y axis is the difference in the slopes, you can then mark off areas that indicate the two lines will cross given the domain. Here for example is a sampling of where the lines cross – red is crossing, grey is not crossing.

So for example, say we had two random draws:

y_i = 1   + 0.5*t
y_j = 0.5 + 0.3*t
y_delta = 0.5 + 0.2*t

This then shows that the two lines do not cross when only evaluating t between 10 and 20. They have already diverged that far out (you would need negative t to have the lines cross). Imagine if y_delta = -6 + 0.2*t though, this line does cross zero though, at t = 10 this function equals -1, whereas at t = 20 the function equals 4.

If you do another 3d plot you can plot the bivariate PDF. Again integrate the chunks of areas in which the function crosses zero, and voila, you get your population estimate.

This works in a similar manner to higher order polynomials, but you can’t draw it in a nice graph though. I’m blanking at the moment of a way to find these areas offhand in a nice way — suggestions welcome!

This gets a bit tricky thinking about in relation to individual level change. This approach does not assume any error in the random draws of the line, but assumes the draws will have a particular distribution. So the PChange does not come from adding up whether individual lines in your sample cross, it comes from the estimated distribution of what the difference in two randomly drawn lines would look like that is implied by your random effects model. Think if based on your random effect distribution you randomly drew two lines, calculated if they crossed, and then did this simulation a very large number of times. The integrations I’m suggesting are just an exact way to calculate PChange instead of the simulation approach.

If you were to do individual change from your random effects model you would incorporate the standard error of the estimated slope and intercept for the individual observation. This is for your hypothetical population though, so I see no need to incorporate any error.

Estimating population level change from group based trajectory models via my functional approach is more straightforward. First, with my functional approach you would assume individuals who share the same latent trajectory will cross with a high probability, no need to test that. Second, for testing whether two individual trajectories cross you would use the approach I’ve already discussed around individual lines and gain the p-value I mentioned.

So for example, say you had a probability of 25% that a randomly drawn person from group A would cross a randomly drawn person from Group B. Say also that Group A has 40/100 of the sample, and Group B is 60/100. So then we have three different groups: A to A, B to B, and A to B. You can then break down the pairwise number in each group, as well as the number of crosses below.

Compare   N    %Cross Cross
A-A      780    100    780
B-B     1770    100   1770
A-B     2400     25    600
Total   4950     64   3150

So then we have a population level p-change estimate of 64% from our GBTM. All of these calculations can be extended to non-integers, I just made them integers here to simplify the presentation.

Now, that overall PChange estimate may not be real meaningful for GBTM, as the denominator includes pairwise combinations of folks in the same trajectory group, which is not going to be of much interest. But just looking at the individual group solutions and seeing what is the probability they cross could be more informative. For example, although Barnes shows the GBTM models in the paper as not crossing, depending on how wide the standard errors of the functions are (that aren’t reported), this functional approach would probably assign non-zero probability of them crossing (think low standard error for the higher group crossing a high standard error for the low group).


Phew — that was a long post! Let me know in the comments if you have any thoughts/concerns on what I wrote. Simple question — whether two lines cross — not a real simple solution when considering the statistical nature of the question though. I can’t be the only person to think about this though — if you know of similar or different approaches to testing whether two lines cross please let me know in the comments.

Notes on using UCR data for class projects

Students in my classes often want to use UCR reported data for projects. One thing many don’t realize though is that the UCR data reported to the FBI is only aggregate statistics at regular intervals for the entire jurisdiction. So for example one can’t look at hot spots using reported UCR data.

If you do have a hypothesis that can be reasonably examined using monthly or yearly data at the jurisdiction level, here are a few notes on using UCR data. First is that you can get the most detailed downloads of data from ICPSR. That link has data series going back to 1960, and ends up being about two years behind (e.g. it is close to the end of 2017, and only 2015 data is available).

The datasets on ICPSR have monthly data for Part 1 crime types, as well as some information on arrests and clearances. Also they have all of the individual agencies, along with their ORI code. The ORI code allows you to link agencies over time.

While the FBI does have a page for more up to date UCR data (they just released the 2016 stats, so they are about a year behind), they are much more limited in the types of tables they disseminate. There typically is one table for Part 1 crime rates for individual large cities for each year, but otherwise it is aggregated to different city sizes. So most data analyses need to use the ICPSR data — the data directly from the FBI is not detailed enough.

For those wishing to map the data, it ends up being a bit tricky. Most people in the US are probably under the jurisdiction of at least two police departments — the local PD and the state police. Many people are also under the jurisdiction of a local sheriff. So many of these police agencies have overlapping boundaries. There is no easy source of the geographic boundaries for the police departments, but the ICPSR data does contain the zipcode for the headquarters for the police department. This won’t be accurate for state police — but should be suitable for mapping purposes for local agencies and sheriffs (sheriffs are sometimes organized at the county level). If you want polygon data for jurisdictional boundaries you will need to search for individual agencies and political boundaries — there is no easy source to download them all at once. Many rural areas will have police departments cover multiple towns, but if you stick to more urban areas you might be able to use city boundaries.

The ICPSR data has crime reports aggregated to the county level, so if that level of aggregation is not problematic you may use that data directly. You should be aware of many of the complaints about UCR data quality though. Mike Maltz has written a bit about it, but there are quite a few other folks who have noticed problems with reporting in the UCR data. The main problem to watch out for is missing data being accidentally reported as zero crimes occurring.

To stack datasets from different years from ICPSR is not too difficult if you are not going too far back in time. But if you go back to the older data, ICPSR changed the variable order. The variables are simply listed as V1 TO V100 something, so for example V15 in 1979 is not the same variable as V15 in 2005. My notes say they used the same variable order from 1998-2015, but you will want to check that yourself (I downloaded the SPSS files, it would not surprise me if the datasets differed for some of the years.)

Some additional resources students may want to familiarize themselves with to gather UCR data more quickly are the FBI UCR data tool and Mike Maltz’s cleaned up dataset and notes on how he made it. You should probably just use Mike Maltz’s dataset if you are using data over time.

If you are just interested in yearly homicides, I have provided a dataset of cleaned up homicides that goes back to 1960, see my paper that goes along with that dataset on graphing temporal homicide trends (mapping those trends could be an interesting project as well!)

New working paper: The effect of housing demolitions on crime in Buffalo, New York

I have a new working paper up, The effect of housing demolitions on crime in Buffalo, New York. This is in conjunction with my colleagues Dae-Young Kim and Scott Phillips, who are at SUNY Buffalo. Below is the abstract.

Objectives: From 2010 through 2015, the city of Buffalo demolished over 2,000 residences. This study examines whether those demolitions resulted in crime reductions.

Methods: Analysis was conducted at micro places matching demolished parcels to comparable control parcels with similar levels of crime. In addition, spatial panel regression models were estimated at the census tract and quarterly level, taking into account demographic characteristics of neighborhoods.

Results: We find that at the micro place level, demolitions cause a steep drop in reported crime at the exact parcel, and result in additional crime decreases at buffers of up to 1,000 feet away. At the census tract level, results indicated that demolitions reduced Part 1 crimes, but the effect was not statistically significant across different models.

Conclusions: While concerns over crime and disorder are common for vacant houses, the evidence that housing demolitions are an effective crime reduction solution is only partially supported by the analyses here. Future research should compare demolitions in reference to other neighborhood revitalization processes.

As always, if you have feedback/comments let me know.

And here are a few maps from the paper!

Don’t include temporal lags of crime in cross-sectional crime models

In my 311 and crime paper a reviewer requested I conduct cross-lagged models. That is, predict crime in 2011 while controlling for prior counts of crime in 2010, in addition to the other specific variables of interest (here 311 calls for service). In the supplementary material I detail why this is difficult with Poisson models, as the endogenous effect will often be explosive in Poisson models, something that does not happen as often in linear models.

There is a second problem though with cross-lagged models I don’t discuss though, and it has to do with how what I think a reasonable data generating process for crime at places can cause cross-lagged models to be biased. This is based on the fact that crime at places tends to be very temporally stable (see David Weisburd’s, or Martin Andresen’s, or my work showing that). So when you incorporate temporal lags of crime in models, this makes the other variables of interest (311 calls, alcohol outlets, other demographics, whatever) biased, because they cause crime in the prior time period. This is equivalent to controlling for an intermediate outcome. For examples of this see some of the prior work on the relationship between crime and disorder by Boggess and Maskaly (2014) or O’Brien and Sampson (2015).1

So Boggess and Maskaley (BM) and O’Brien and Sampson (OS) their simplified cross-lagged model is:

(1) Crime_post = B0*Crime_pre + B1*physicaldisorder_pre

Where the post and pre periods are yearly counts of crime and indicators of physical disorder. My paper subsequently does not include the prior counts of crime, but does lag the physical disorder measures by a year to ensure they are exogenous.

(2) Crime_post = B1*physicaldisorder_pre

There are a few reasons to do these lags. The most obvious is to make explanatory variable of broken windows exogenous, by making sure it is in the past. The reasons for including lags of crime counts are most often strictly as a control variable. There are some examples where crime begets more crime directly, such as retaliatory violence, (or see Rosenfeld, 2009) but most folks who do the cross-lagged models do not make this argument.

Now, my whole argument rests on what I think is an appropriate model explaining counts of crime at places. Continuing with the physical disorder example, I think a reasonable cross-sectional model of crime at places is that there are some underlying characteristics of locations that tend to be pretty stable over fairly long periods of time, and then we have more minor stuff like physical disorder that provide small exogenous shocks to the system over time.

(3) Crime_i = B0*(physicaldisorder_i) + Z_i

Where crime at location i is a function of some fixed characteristic Z. I can’t prove this model is correct, but I believe it is better supported by data. To support this position, I would refer to the incredibly high correlations between counts of crime at places from year to year. This is true of every crime dataset I have worked with (at every spatial unit of analysis), and is a main point of Shaw and McKay’s work plus Rob Sampsons for neighborhoods in Chicago, as well as David Weisburd’s work on trajectories of crime at street segments in Seattle. Again, this very high correlation doesn’t strike me as reasonably explained by crime causes more crime, what is more likely is that there are a set of fixed characteristics that impact criminal behavior at a certain locations.

If a model of crime is like that in (3), there are then two problems with the prior equations. The first problem for both (1) and (2) is that lagging physical disorder measures by a year does not make any sense. The idea behind physical disorder (a.k.a. broken windows) is that visible signs of disorder prime people to behave in a particular way. The priming presumably needs to be recent to affect behavior. But this can simply be solved by not lagging physical disorder by a year in the model. The lagged physical disorder effect might approximate the contemporaneous effect, if physical disorder itself is temporally consistent over long periods. So if say we replace physical disorder with locations of bars, the lagged effect of bars likely does not make any difference, between bars don’t turn over that much (and when they do they are oft just replaced by another bar).

But what if you still include the lags of crime counts? One may think that this controls for the omitted Z_i effect, but the effect is very bad for the other exogenous variables, especially lagged ones or temporally consistent ones. You are probably better off with the omitted random effect, because crime in the prior year is an intermediate outcome. I suspect this bias can be very large, and likely biases the effects of the other variables towards zero by quite alot. This is because effect of the fixed characteristic is large, the effect of the exogenous characteristic is smaller, and the two are likely correlated at least to a small amount.

To show this I conduct a simulation. SPSS Code here to replicate it. The true model I simulated is:

(4)  BW_it = 0.2*Z_i + ew_it
(5)  Crime_it = 5 + 0.1*BW_it + 0.9*Z_i + ec_it`

I generated this for 25,000 locations and two time points (the t subscript), and all the variables are set to have a variance of 1 (all variables are normally distributed). The error terms (ew_it and ec_it) are not correlated, and are set to whatever value is necessary so the resultant variable on the left hand side has a variance of 1. With so many observations one simulation run is pretty representative of what would happen even if I replicated the simulation multiple times. This specification makes both BW (to stand for broken windows) and Z_i correlated.

In my run, what happens when we fit the cross-lagged model? The effect estimates are subsequently:

Lag BW:   -0.07
Lag Crime: 0.90

Yikes – effect of BW is in the opposite direction and nearly as large as the true effect. What about if you just include the lag of BW?

Lag BW: 0.22

The reason this is closer to the true effect is because of some round-about-luck. Since BW_it is correlated with the fixed effect Z_i, the lag of BW has a slight correlation to the future BW. This potentially changes how we view the effects of disorder on crime though. If BW is more variable, we can make a stronger argument that it is exogenous of other omitted variables. If it is temporally consistent it is harder to make that argument (it should also reduce the correlation with Z_i).

Still, the only reason this lag has a positive effect is that Z_i is omitted. For us to make the argument that this approximates the true effect, we have to make the argument the model has a very important omitted variable. Something one could only do as an act of cognitive dissonance.

How about use the contemporaneous effect of BW, but still include the lag counts of crime?

BW:        0.13
Lag Crime: 0.86

That is not as bad, because the lag of crime is now not an intermediate outcome. Again though, if we switch BW with something more consistent in time, like locations of bars, the lag will be an intermediate outcome, and will subsequently bias the effect. So what about a model of the contemporaneous effect of BW, omitting Z_i? The contemporaneous effect of BW will still be biased, since Z_i is omitted from the model.

BW: 0.32

But a way to reduce this bias is to introduce other control variables that approximate the omitted Z_i. Here I generate a set of 10 covariates that are a function of Z_i, but are otherwise not correlated with BW nor each other.

(6) Oth_it = 0.5*Z_i + eoth_it

Including these covariates in the model progressively reduces the bias. Here is a table for the reduction in the BW effect for the more of the covariates you add in, e.g. with 2 means it includes two of the control variables in the model.

BW (with 0):  0.32
BW (with 1):  0.25
BW (with 2):  0.21
BW (with 3):  0.19
BW (with 10): 0.14

So if you include other cross-sectional covariates in an attempt to control for Z_i it brings the effect of BW closer to its true effect. This is what I believe happens in the majority of social science research that use strictly cross-sectional models, and is a partial defense of what people sometimes refer to kitchen sink models.

So in brief, I think using lags of explanatory variables and lags of crime in the same model are very bad, and can bias the effect estimates quite alot.

So using lags of explanatory variables and lags of crime counts in cross-sectional models I believe are a bad idea for most research designs. It is true that it makes it their effects exogenous, but it doesn’t eliminate the more contemporaneous effect of the variable, and so we may be underestimating the effect to a very large extent. Whether of not the temporal lag effects crime has to do with how the explanatory variable itself arises, and so the effect estimated by the temporal lag is likely to be misleading (and may be biased upward or downward depending on other parts of the model).

Incorporating prior crime counts is likely to introduce more bias than it solves I think for most cross-lagged models. I believe simply using a cross-sectional model with a reasonable set of control variables will get you closer to the real effect estimates than the cross-lagged models. If you think Z_i is correlated with a variable of interest (or lags of crime really do cause future crime) I think you need to do the extra step and have multiple time measures and fit a real panel data model, not just a cross lagged one.

I’m still not sure though when you are better off fitting a panel model versus expanding the time for the cross-section though. For one example, I think you are better off estimating the effects of demographic variables in a cross-sectional model, as opposed to a panel one, over a short period of time, (say less than 10 years). This is because demographic shifts simply don’t occur very fast, so there is little variance within units for a short panel.


  1. I actually came up with the idea of using 311 calls independently of Dan O’Brien’s work, see my prospectus in 2013 in which I proposed the analysis. So I’m not totally crazy – although was alittle bummed to miss the timing abit! Four years between proposing and publishing the work is a bit depressing as well.