All posts in category Regression

Some notes on synthetic control and Hogan/Kaplan

This will be a long one, but I have some notes on synthetic control and the back-and-forth between two groups. So first if you aren’t familiar, Tom Hogan published an article on how the progressive District Attorney (DA) in Philadelphia, Larry Krasner, in which Hogan estimates that Krasner’s time in office contributed to a large increase in the number of homicides. The control homicides are estimated using a statistical technique called synthetic control, in which you derive estimates of the trend in homicides to compare Philly to based on a weighted average of comparison cities.

Kaplan and colleagues (KNS from here on) then published a critique of various methods Hogan used to come up with his estimate. KNS provided estimates using different data and a different method to derive the weights, showing that Philadelphia did not have increased homicides post Krasner being elected. For reference:

Part of the reason I am writing this is if people care enough, you could probably make similar back and forths on every synth paper. There are many researcher degrees of freedom in the process, and in turn you can make reasonable choices that lead to different results.

I think it is worthwhile digging into those in more detail though. For a summary of the method notes I discuss for this particular back and forth:

Researchers determine the treatment estimate they want (counts vs rates) – solvers misbehaving is not a reason to change your treatment effect of interest
The default synth estimator when matching on counts and pop can have some likely unintended side-effects (NYC pretty much has to be one of the donor cities in this dataset)
Covariate balancing is probably a red-herring (so the data issues Hogan critiques in response to KNS are mostly immaterial)

In my original draft I had a note that this post would not be in favor of Hogan nor KNS, but in reviewing the sources more closely, nothing I say here conflicts with KNS (and I will bring a few more critiques of Hogan’s estimates that KNS do not mention). So I can’t argue much with KNS’s headline that Hogan’s estimates are fatally flawed.

An overview of synthetic control estimates

To back up and give an overview of what synth is for general readers, imagine we have a hypothetical city A with homicide counts 10 15 30, where the 30 is after a new DA has been elected. Is the 30 more homicides than you would have expected absent that new DA? To answer this, we need to estimate a counterfactual trend – what the homicide count would have been in a hypothetical world in which a new progressive DA was not elected. You can see the city homicides increased the prior two years, from 10 to 15, so you may say “ok, I expected it to continue to increase at the same linear trend”, in which case you would have expected it to increase to 20. So the counterfactual estimated increase in that scenario is observed - counterfactual, here 30 - 20 = 10, an estimated increase of 10 homicides that can be causally attributed to the progressive DA.

Social scientists tend to not prefer to just extrapolate prior trends from the same location into the future. There could be widespread changes that occur everywhere that caused the increase in city A. If homicide rates accelerated in every city in the country, even those without a new progressive DA, it is likely something else is causing those increases. So say we compare city A to city B, and city B had a homicide count trend during the same time period 10 15 35. Before the new DA in city A, cities A/B had the same pre-trend (both 10 15). The post time period City B increased to 35 homicides. So if using City B as the counterfactual estimate, we have the progressive DA reduced 5 homicides, again observed - counterfactual = 30 - 35 = -5. So even though city A increased, it increased less than we expected based on the comparison city B.

Note that this is not a hypothetical concern, it is pretty basic one that you should always be concerned about when examining macro level crime data. There has been national level homicide increases over the time period when Krasner has been in office (Yim et al, 2020, and see this blog post for updates. U.S. city homicide rates tend to be very correlated with each other (McDowall & Loftin, 2009).

So even though Philly has increased in homicide counts/rates when Krasner has been in office, the question is are those increases higher or lower than we would expect. That is where the synthetic control method comes in, we don’t have a perfect city B to compare to Philadelphia, so we create our own “synthetic” counter-factual, based on a weighted average of many different comparison cities.

To make the example simple, imagine we have two potential control cities and homicide trends, city C1 0 30 20, and city C2 20 0 30. Neither looks like a good comparison to city A that has trends 10 15 30. But if we do a weighted average of C1 and C2, with the weights 0.5 for each city, when combined they are a perfect match for the two pre-treatment periods:

C0  C1 Average cityA
 0  20   10     10
30   0   15     15
20  30   25     30

This is what the synthetic control estimator does, although instead of giving equal weights it determines the optimal weights to match the pre-treatment time period given many potential donors. In real data for example C0 and C1 may be given weights of 0.2 and 0.8 to give the correct balance based on the prior to treatment time periods.

The fundamental problem with synth

The rub with estimating the synth weights is that there is no one correct way to estimate the weights – you have more numbers to estimate than data points. In the Hogan paper, he has 5 pre time periods, 2010-2014, and he has 82 potential donors (99 other of the largest cities in the US minus 17 progressive prosecutors). So you need to learn 82 numbers (the weights) based on 5 data points.

Side note: you can also consider matching on covariates additional data points, although I will go into more detail on how matching on covariates is potentially a red-herring. Hogan I think uses an additional 5*3=15 time varying points (pop, cleared homicide, homicide clearance rates), and maybe 3 additional time invariant (median income, 1 prosecutor categorization, and homicides again!). So maybe has 5 + 15 + 3 = 23 data points to match on (so same fundamental problem, 23 numbers to learn 82 weights). I am just going to quote the full passage on Hogan (2022a) here where he discusses covariate matching:

The number of homicides per year is the dependent variable. The challenge with this synthetic control model is to use variables that both produce parallel trends in the pre-period and are sufficiently robust to power the post-period results. The model that ultimately delivered the best fit for the data has population, cleared homicide cases, and homicide clearance rates as regular predictors. Median household income is passed in as the first special predictor. The categorization of the prosecutors and the number of homicides are used as additional special predictors. For homicides, the raw values are passed into the model. Abadie (2021) notes that the underlying permutation distribution is designed to work with raw data; using log values, rates, or other scaling techniques may invalidate results.

This is the reason why replication code is necessary – it is very difficult for me to translate this to what Hogan actually did. “Special” predictors here are code words for the R synth package for time-invariant predictors. (I don’t know based on verbal descriptions how Hogan used time-invariant for the prosecutor categorization for example, just treats it as a dummy variable?) Also only using median income – was this the only covariate, or did he do a bunch of models and choose the one with the “best” fit (it seems maybe he did do a search, but doesn’t describe the search, only the end selected result).

I don’t know what Hogan did or did not do to fit his models. The solution isn’t to have people like me and KNS guess or have Hogan just do a better job verbally describing what he did, it is to release the code so it is transparent for everyone to see what he did.

So how do we estimate those 82 weights? Well, we typically have restrictions on the potential weights – such as the weights need to be positive numbers, and the weights should sum to 1. These are for a mix of technical and theoretical reasons (having the weights not be too large can reduce the variance of the estimator is a technical reason, we don’t want negative weights as we don’t think there are bizzaro comparison areas that have opposite world trends is a theoretical one).

These are reasonable but ultimately arbitrary – there are many different ways to accomplish this weight estimation. Hogan (2022a) uses the R synth package, KNS use a newer method also advocated by Abadie & L’Hour (2021) (very similar, but tries to match to the closest single city, instead of weights for multiple cities). Abadie (2021) lists probably over a dozen different procedures researchers have suggested over the past decade to estimate the synth weights.

The reason I bring this up is because when you have a problem with 82 parameters and 5 data points, the problem isn’t “what estimator provides good fit to in-sample data” – you should be able to figure out a estimator that accomplishes good in-sample fit. The issue is whether that estimator is any good out-of-sample.

Rates vs Counts

So besides the estimator used, you can break down 3 different arbitrary researcher data decisions that likely impact the final inferences:

outcome variable (homicide counts vs homicide per capita rates)
pre-intervention time periods (Hogan uses 2010-2014, KNS go back to 2000)
covariates used to match on

Lets start with the outcome variable question, counts vs rates. So first, as quoted above, Hogan cites Abadie (2021) for saying you should prefer counts to rates, “Abadie (2021) notes that the underlying permutation distribution is designed to work with raw data; using log values, rates, or other scaling techniques may invalidate results.”

This has it backwards though – the researcher chooses whether it makes sense to estimate treatment effects on the count scale vs rates. You don’t goal switch your outcome because you think the computer can’t give you a good estimate for one outcome. So imagine I show you a single city over time:

        Y0    Y1    Y2
Count   10    15    20
Pop   1000  1500  2000

You can see although the counts are increasing, the rate is consistent over the time period. There are times I think counts make more sense than rates (such as cost-benefit analysis), but probably in this scenario the researcher would want to look at rates (as the shifting denominator is a simple explanation causing the increase in the counts).

Hogan (2022b) is correct in saying that the population is not shifting over time in Philly very much, but this isn’t a reason to prefer counts. It suggests the estimator should not make a difference when using counts vs rates, which just points to the problematic findings in KNS (that making different decisions results in different inferences).

Now onto the point that Abadie (2021) says using rates is wrong for the permutation distribution – I don’t understand what Hogan is talking about here. You can read Abadie (2021) for yourself if you want. I don’t see anything about the permutation inferences and rates.

So maybe Hogan mis-cited and meant another Abadie paper – Abadie himself uses rates for various projects (he uses per-capita rates in the 2021 cited paper, Abadie et al., (2010) uses rates for another example), so I don’t think Abadie thinks rates are intrinsically problematic! Let me know if there is some other paper I am unaware of. I honestly can’t steelman any reasonable source where Hogan (2022a) came up with the idea that counts are good and rates are bad though.

Again, even if they were, it is not a reason to prefer counts vs rates, you would change your estimator to give you the treatment effect estimate you wanted.

Side note: Where I thought the idea with the problem with rates was going (before digging in and not finding any Abadie work actually saying there is issues with rates), was increased variance estimates with homicide data. So Hogan (2022a) estimates for the synth weights Detroit (0.468), New Orleans (NO) (0.334), and New York City (NYC) (0.198), here are those cities homicide rates graphed (spreadsheet with data + notes on sources).

You can see NO’s rate is very volatile, so is not a great choice for a matched estimator if using rates. (I have NO as an example in Wheeler & Kovandzic (2018), that much variance though is fairly normal for high crime not too large cities in the US, see Baltimore for example for even more volatility.) I could forsee someone wanting to make a weighted synth estimator for rates, either make the estimator a population weighted average, or penalize the variance for small rates. Maybe you can trick microsynth to do a pop weighted average out of the box (Robbins et al., 2017).

To discuss the Hogan results specifically, I suspect for example NYC being a control city with high weight in the Hogan paper, which superficially may seem good (both large cities on the east coast), actually isn’t a very good control area considering the differences in homicide trends (either rates or counts) over time. (I am also not so sure about describing NYC and New Orlean’s as “post-industrial” by Hogan (2022a) either. I mean this is true to the extent that all urban areas in the US are basically post-industrial, but they are not rust belt cities like Detroit.)

Here is for reference counts of homicides in Philly, Detroit, New Orleans, and NYC going back further in time:

NYC is such a crazy drop in the 90s, lets use the post 2000 data that KNS used to zoom in on the graph.

I think KNS are reasonable here to use 2000 as a cut point – it is more empirical based (post crime drop), in which you could argue the 90’s are a “structural break”, and that homicides settled down in most cities around 2000 (but still typically had a gradual decline). Given the strong national homicide trends though across cities (here is an example I use for class, superimposing Dallas/NYC/Chicago), I think using even back to the 60’s is easily defensible (moreso than limiting to post 2010).

It depends on how strict you want to be whether you consider these 3 cities “good” matches for the counts post 2010 in Hogan’s data. Detroit seems a good match on the levels and ok match on trends. NO is ok match on trends. NYC and NO balance each other in terms of matching levels, NYC has steeper declines though (even during the 2010-2014 period).

The last graph though shows where the estimated increases from Hogan (2022a) come from. Philly went up and those 3 other cities went down from 2015-2018 (and had small upward bumps in 2019).

Final point in this section, careful what you wish for with sparse weights and sum to 1 in the synth estimate. What this means in practice when using counts and matching on pop size, is that you need lines that are above and below Philly on those dimensions. So to get a good match on Pop, it needs to select at least one of NYC/LA/Houston (Chicago was eliminated due to having a progressive prosecutor). To get a good match on homicide counts, it also has to pick at least one city with more homicides per year as well, which limits the options to New York and Detroit (LA/Houston have lower overall homicide counts to Philly).

You can’t do the default Abadie approach for NYC for example (matching on counts and pop) – it will always have a bad fit when using comparison cities in the US as the donor pool. You either need to allow the weights to sum to larger than 1, or the lasso approach with an intercept is another option (so you only match on trend, not levels).

Because matching on trends is what matters for proper identification in this design, not levels, this is all sorts of problematic with the data at hand. (This is also a potential problem with the KNS estimator as well. KNS note though they don’t trust their estimate offhand, their reasonable point is that small changes in the design result in totally different inferences.)

Covariates and Out of Sample Estimates

For sake of argument, say I said Hogan (2022a) is bunk, because it did not match on “per-capita annual number of cheese-steaks consumed”. Even though on its face this covariate is non-sense, how do you know it is non-sense? In the synthetic control approach, there is no empirical, falsifiable way to know whether an covariate is a correct one to match on. There is no way to know that median income is better than cheese-steaks.

If you wish for more relevant examples, Philly has obviously more issues with street consumption of opioids than Detroit/NOLA/NYC, which others have shown relationships to homicide and has been getting worse over the time Krasner has been in office (Rosenfeld et al., 2023). (Or more simply social disorganization is the more common way that criminologists think about demographic trends and crime.)

This uncertainty in “what demographics to control for” is ok though, because matching on covariates is neither necessary nor sufficient to ensure you have estimated a good counter-factual trend. Abadie in his writings intended for covariates to be more like fuzzy guide-rails – they are qualitative things that you think the comparison areas should be similar on.

Because there are effectively an infinite pool of potential covariates to match on, I prefer the approach of simply limiting the donor pool apriori – Hogan limiting to large cities is on its face reasonable. Including other covariates is not necessary, and does not make the synth estimate more or less robust. Whether KNS used good or bad data for covariates is entirely a red-herring as to the quality of the final synth estimate.

Side note: I don’t doubt that Hogan got advice to not share data and code. It is certainly not normative in criminology to do this. It creates a bizarre situation though, in which someone can try to replicate Hogan by collating original sources, and then Hogan always comes back and says “no, the data you have are wrong” or “the approach you did is not exactly replicating my work”.

I get that collating data takes a long time, and people want to protect their ability to publish in the future. (Or maybe just limit their exposure to their work being criticized.) It is blatantly antithetical to verifying the scientific integrity of peoples work though.

Even if Hogan is correct though in the covariates that KNS used are wrong, it is mostly immaterial to the quality of the synth estimates. It is a waste of time for outside researchers to even bother to replicate Hogan’s covariates he used.

So I used the idea of empirical/falsifiable – can anything associated with synth be falsifiable? Why yes it can – the typical approach is to do some type of leave-one-out estimate. It may seem odd because synth estimates an underlying match to a temporal trend in the treated location, but there is nothing temporal about the synth estimate. You could jumble up the years in the pre-treatment sample and still would estimate the same weights.

Because of this, you can leave-a-year-out in the pre-treatment time period, run your synth algorithm, and then predict that left out year. A good synth estimator will be close to the observed value for the out of sample estimates in the pre-treated time period (and as a side bonus, you can use that variance estimate to estimate the error in the post-trend years).

That is a relatively simple way to determine if the Hogan 5 year vs KNS 15 year time periods are “better” synth controls (my money is on KNS for that one). Because Hogan has not released data/code, I am not going to go through that trouble. As I said in the side note earlier, I could try to do that, and Hogan could simply come back and say “you didn’t do it right”.

This also would settle the issue of “over-fit”. You actually cannot just look at the synth weights, and say that if they are sparse they are not over-fit and if not sparse are over-fit. So for reference, you have in Hogan essentially fitting 82 weights based on 5 datapoints, and he identified a fit with 3 non-zero weights. Flip this around, and say I had 5 data points and fit a model with 3 parameters, it is easily possible that the 3 parameter model in that scenario is overfit.

Simultaneously, it is not necessary to have a sparse weights matrix. Several alternative methods to estimate synth will not have sparse weights (I am pretty sure Xu (2017) will not have sparse weights, and microsynth estimates are not sparse either for just two examples). Because US cities have such clear national level trends, a good estimator in this scenario may have many tiny weights (where good here is low bias and variance out of sample). Abadie thinks sparse weights are good to make the model more interpretable (and prevent poor extrapolation), but that doesn’t mean by default a not sparse solution is bad.

To be clear, KNS admit that their alternative results are maybe not trustworthy due to not sparse weights, but this doesn’t imply Hogan’s original estimates are themselves “OK”. I think maybe a correct approach with city level homicide rate data will have non-sparse weights, due to the national level homicide trend that is common across many cities.

Wrapping Up

If Crim and Public Policy still did response pieces maybe I would go through that trouble of doing the cross validation and making a different estimator (although I would unlikely be an invited commenter). But wanted to at least do this write up, as like I said at the start I think you could do this type of critique with the majority of synth papers in criminology being published at the moment.

To just give my generic (hopefully practical) advice to future crim work:

don’t worry about matching on covariates, worry about having a long pre-period
the default methods you need to worry about if you have enough “comparable” units – this is in terms of levels, not just trends
the only way to know the quality of the modeling procedure in synth is to do out of sample estimates.

Bullet points 2/3 are perhaps not practical – most criminologists won’t have the capability to modify the optimization procedure to the situation at hand (I spent a few days trying without much luck to do my penalized variants suggested, sharing so others can try out themselves, I need to move onto other projects!) Also takes a bit of custom coding to do the out of sample estimates.

For many realistic situations though, I think criminologists need to go beyond just point and clicking in software, especially for this overdetermined system of equations synthetic control scenario. I did a prior blog post on how I think many state level synth designs are effectively underpowered (and suggested using lasso estimates with conformal intervals). I think that is a better default in this scenario as well compared to the typical synth estimators, although you have plenty of choices.

Again I had initially written this as trying to two side the argument, and not being for or against either set of researchers. But sitting down and really reading all the sources and arguments, KNS are correct in their critique. Hogan is essentially hiding behind not releasing data and code, and in that scenario can make an endless set of (ultimately trivial) responses of anyone who publishes a replication/critique.

Even if some of the the numbers KNS collated are wrong, it does not make Hogan’s estimates right.

References

Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2), 391-425.
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. Journal of the American Statistical Association, 105(490), 493-505.
Abadie, A., & L’hour, J. (2021). A penalized synthetic control estimator for disaggregated data. Journal of the American Statistical Association, 116(536), 1817-1834.
Hogan, T.P. (2022a) De‐prosecution and death: A synthetic control analysis of the impact of de‐prosecution on homicides. Criminology & Public Policy, 21(3), 489-534.
Hogan, T.P. (2022b) DE-PROSECUTION AND DEATH: A CORDIAL REPLY TO KAPLAN, NADDEO & SCOTT.
Kaplan, J., Naddeo, J., & Scott, T. (2022) De-prosecution and death: A comment on the fatal flaws in Hogan (2022).
McDowall, D., & Loftin, C. (2009). Do US city crime rates follow a national trend? The influence of nationwide conditions on local crime patterns. Journal of Quantitative Criminology, 25, 307-324.
Robbins, M. W., Saunders, J., & Kilmer, B. (2017). A framework for synthetic control methods with high-dimensional, micro-level data: evaluating a neighborhood-specific crime intervention. Journal of the American Statistical Association, 112(517), 109-126.
Rosenfeld, R., Roth, R., & Wallman, J. (2023). Homicide and the opioid epidemic: a longitudinal analysis. Homicide Studies, 27(3), 321-337.
Wheeler, A. P., & Kovandzic, T. V. (2018). Monitoring volatile homicide trends across US cities. Homicide Studies, 22(2), 119-144.
Xu, Y. (2017). Generalized synthetic control method: Causal inference with interactive fixed effects models. Political Analysis, 25(1), 57-76.
Yim, H. N., Riddell, J. R., & Wheeler, A. P. (2020). Is the recent increase in national homicide abnormal? Testing the application of fan charts in monitoring national homicide trends over time. Journal of Criminal Justice, 66, 101656..

2 Comments

by Andy Wheeler on July 9, 2023 • Permalink

Posted in Crime Analysis, data science, Papers, Python, R, Regression, scholarly

Tagged causal-inference, synthetic-control

Posted by Andy Wheeler on July 9, 2023

https://andrewpwheeler.com/2023/07/09/some-notes-on-synthetic-control-and-hogan-kaplan/

Using quantile regression to evaluate police response times

Jeff Asher recently had a post on analysis of response times across many agencies. One nitpick though (and ditto for prior analyses I have seen, such as Scott Mourtgos and company), is that you should not use linear models (or means in general) to describe response time distributions. They are very heavily right skewed, and so the mean tends to be not representative of the bulk of the data.

When evaluating changes in response time, imagine two simplistic scenarios. One, every single call increases by 5 minutes, so what used to be 5 is now 10, 20 is now 25, 60 is now 65, etc. That is probably not realistic for response times, it is probably calls in the tail (ones that take a very long time to wait for an opening in the queue) are what changes. E.g. 5 is still 5, 20 is still 20, but 60 is now 120. In the latter scenario, the left tail of the distribution does not change, only the right tail. In both scenarios the mean shifts.

I think a natural way to model the problem is instead of focusing on means, is to use quantile regression. It allows you to generalize the entire distribution (look at the left and right tails) and still control for individual level circumstances. Additionally, often emergency agencies set goals along the lines of “I want to respond to 90% of emergency events with X minutes”. Quantile regression is a great tool to describe that 90% make. Here I am going to show an example using the New Orleans calls for service data and python.

First, we can download the data right inside of python without saving it directly to disk. I am going to be showing off estimating quantile regression with the statsmodel library. I do the analysis for 19 through 22, but NOLA has calls for service going back to the early 2010s if folks are interested.

import pandas as pd
import statsmodels.formula.api as smf

# Download data, combo 19/20/21/22
y19 = 'https://data.nola.gov/api/views/qf6q-pp4b/rows.csv?accessType=DOWNLOAD'
y20 = 'https://data.nola.gov/api/views/hp7u-i9hf/rows.csv?accessType=DOWNLOAD'
y21 = 'https://data.nola.gov/api/views/3pha-hum9/rows.csv?accessType=DOWNLOAD'
y22 = 'https://data.nola.gov/api/views/nci8-thrr/rows.csv?accessType=DOWNLOAD'
yr_url = [y19,y20,y21,y22]
res_pd = [pd.read_csv(url) for url in yr_url]
data = pd.concat(res_pd,axis=0) # alittle over 1.7 million

Now we do some data munging. Here I eliminate self initiated events, as well as those with missing data. There then are just a handful of cases that have 0 minute arrivals, which to be consistent with Jeff’s post I also eliminate. I create a variable, minutes, that is the minutes between the time created and the time arrived on scene (not cleared).

# Prepping data
data = data[data['SelfInitiated'] == 'N'].copy() # no self init
data = data[~data['TimeArrive'].isna()].copy()   # some missing arrive
data['begin'] = pd.to_datetime(data['TimeCreate'])
data['end'] = pd.to_datetime(data['TimeArrive'])
dif = data['end'] - data['begin']
data['minutes'] = dif.dt.seconds/60
data = data[data['minutes'] > 0].copy() # just a few left over 0s

# Lets look at the distribution
data['minutes'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])

For quantiles, for the entire sample the median time is around 20 minutes, the 10th percentile is under 3 minutes and the 90th percentile is around 5 hours. Using the mean here (which in Jeff’s post varies from 50 to 146 minutes over the same 4 year period), can be somewhat misleading.

An important component of response times is differentiating between different priority calls. NOLA in their data, higher numbers are higher priority. Zero priority are things NOLA says don’t necessarily need an officer at all. So it could be those “0 priority” calls are really just dragging the overall average down over time, although they may have little to do with clearance rates or public safety overall. The priority category fields also has sub-categories, e.g. 1A is higher priority than 1B. To keep the post simple I just breakdown by integer leading values, not the sub letter-categories.

# Priority just do 1/2/3
# 3 is highest priority
data['PriorCat'] = data['Priority'].str[0]
# Only 5 cases of 3s, will eliminate these as well
data.groupby('PriorCat')['minutes'].describe()

Here you can really see the right skewness – priority 2 calls the mean is 25 minutes, but the median is under 10 minutes for the entire sample. A benefit of quantile regression I will use in a bit, the few outlying cases (beyond the quantiles of interest), really don’t impact the analysis. So those cases that take almost 24 hours (I imagine they are just auto-filled in like that in the data), really don’t impact estimates of smaller quantiles. But they can have noticeable influence on mean estimates.

Some final data munging, to further simplify I drop the 16 cases of priority 3s and 4s, and add in a few more categorical covariates for hour of the day, and look at months over time as categorical. (These decisions are more so to make the results easier to parse in a blog post in simpler tables, it would take more work to model a non-linear continuous over time variable, say via a spline, and make a reasonable ordinal encoding for the sub-priority categories.)

# only worry about 0/1/2s
data = data[data['PriorCat'].isin(['0','1','2'])].copy()
# Total in the end almost 600k cases

# Some factor date variables
def dummy_stats(vdate,begin_date):
    bd = pd.to_datetime(begin_date)
    year = vdate.dt.year
    month = vdate.dt.month
    week_day = vdate.dt.dayofweek
    hour = vdate.dt.hour
    diff_days = (vdate - bd).dt.days
    # if binary, turn week/month into dummy variables
    return diff_days, week_day, hour, month, year

dn, wd, hr, mo, yr = dummy_stats(data['begin'],'1/1/2022')
data['Hour'] = hr
data['Month'] = mo
data['Year'] = yr

# Lets just look at months over time
data['MoYr'] = data['Year'] + data['Month']/100

Now finally onto the modeling stuff. For those familiar with regression, quantile regression instead of predicting the mean predicts a quantile of the distribution. Here I show predicting the 50th quantile (the median). For those not familiar with regression, this is not all that different than doing a pivot table/group by, but aggregating by quantiles instead of means. Regression is somewhat different than the simpler pivot table, since you “condition on” other continuous factors (here I “control for” hour of day), but in broad strokes is similar.

Here I use a patsy “R style” formula, and fit a categorical covariate for the 0/1/2 categories, hour of day, and the time varying months over time (to see the general trends). The subsequent regression table is big, so will show in parts:

# Quantile regression for median
mod = smf.quantreg("minutes ~ C(PriorCat, Treatment(reference='2')) + C(Hour) + C(MoYr)", data)
res50 = mod.fit(q=0.5)
res50.summary()

First, I use 2 priority events as the referent category, so you can see (in predicting the median of the distribution), priority 1 events have a median 24 minutes longer than priority 2, and priority 0 have a median two hours later. You can see some interesting patterns in the hour of the day effects (which are for the overall effects, not broken down by priority). So there are likely shift changes at 06:00, 14:00, and 22:00 that result in longer wait times.

But of most interest are patterns over time, here is the latter half of that table, showing median estimates over the months in this sample.

You could of course make the model more complicated, e.g. look at spatial effects or incorporate other direct measures of capacity/people on duty. But here it is complicated enough for an illustrative blog post. January-2019 is the referent category month, and we can see some slight decreases in a few minutes around the start of the pandemic, but have been clearly been increasing at the median time fairly noticeably starting later in 2021.

As opposed to interpreting regression coefficients, I think it is easier to see model predictions. We can just make sample data points, here at noon over the different months, and do predictions over each different priority category:

# Predictions for different categories
hour = 12
prior_cat = [0,1,2]
oos = data.groupby(['PriorCat','MoYr'],as_index=False)['Hour'].size()
oos['Hour'] = 12
oos['Q50'] = res50.predict(oos)

print(oos[oos['PriorCat'] == '0'])
print(oos[oos['PriorCat'] == '1'])
print(oos[oos['PriorCat'] == '2'])

So here for priority 0, 130 has creeped up to 143.

And for priority 1, median times 35 to 49.

Note that the way I estimated the regression equation, the increase/decrease per month is forced to be the same across the different priority calls. So, the increase among priority 2 calls is again around 13 minutes according to the model.

But this assumption is clearly wrong. Remember my earlier “fast” and “slow” example, with only the slow calls increasing. That would suggest the distributions for the priority calls will likely have different changes over time. E.g. priority 0 may increase by alot, but priority 2 will be almost the same. You could model this in the formula via an interaction effect, e.g. something like "minutes ~ C(PriorCat)*C(MoYr) + C(Hour)", but to make the computer spit out a solution a bit faster, I will subset the data to just priority 2 calls.

Here the power of quantile regression is we can look at different distributions. Estimating extreme quantiles is tough, but looking at the 10th/90th (as well as the median) is pretty typical. I do those three quantiles, and generate model predictions over the months (again assuming a call at 12).

# To save time, I am only going to analyze
# Priority 2 calls now
p2 = data[data['PriorCat'] == '2'].copy()
m2 = smf.quantreg("minutes ~ C(MoYr) + C(Hour)", p2)
oos2 = oos[oos['PriorCat'] == '2'].copy()

# loop over different quantiles
qlist = [0.1, 0.5, 0.9]
for q in qlist:
    res = m2.fit(q=q)
    oos2[f'Q_{q}'] = res.predict(oos2)

oos2

So you can see my story about fast and slow calls plays out, although even when restricted to purportedly high risk calls. When looking at just priority 2 calls in New Orleans, the 10th percentile stays very similar over the period, although does have a slight increase from under 4 to almost 5 minutes. The 50th percentile has slightly more growth, but is from 10 minutes to 13 minutes. The 90th percentile though has more volatility – grew from 30 to 60 in small increases in 2022, and late 2022 has fairly dramatic further growth to 70/90 minutes. And you can see how the prior model that did not break out priority 0/1 calls changed this estimate for the left tail for the priority 2 left tail as well. (So those groups likely also had large shifts across the entire set.)

So my earlier scenario is overly simplistic, we can see some increase in the left tails of the distribution as well in this analysis. But, the majority of the increase is due to changes in the long right tail – calls that used to take less than 30 minutes are now taking 90 minutes to arrive. Which still likely has implications for satisfaction with police and reporting behavior, maybe not so much though with clearance or direct public safety.

No easy answers here in terms of giving internet advice to New Orleans. If working with NOLA, I would like to get estimates of officer capacity per shift, so I could incorporate into the quantile regression model that factor directly. That would allow you to precisely quantify how officer capacity impacts the distribution of response times. So not just “response times are going up” but “the decrease in capacity from A to B resulted in X increase in the 90th percentile of response times”. So if NOLA had goals set they could precisely state where officer capacity needed to be to have a shot of obtaining that goal.

Simulating Group Based Trajectories (in R)

The other day I pointed out on Erwin Kalvelagen’s blog how mixture models are a solution to fit regression models with multiple lines (where identification of which particular function/line is not known in advance).

I am a big fan of simulating data when testing out different algorithms for simply the reason it is often difficult to know how an estimator will behave with your particular data. So we have a bunch of circumstances with mixture models (in particular here I am focusing on repeated measures group based traj type mixture models) that it is hard to know upfront how they will do. Do you want to estimate group based trajectories, but have big N and small T? Or the other way, small N and big T? (Larger sample sizes tend to result in identifying more mixtures as you might imagine (Erosheva et al., 2014).) Do you have sparse Poisson data? Or high count Poisson data? Do you have 100,000 data points, and want to know how big of data and how long it may take? These are all good things to do a simulation to see how it behaves when you know the correct answer.

These are relevant no matter what the particular algorithm – so the points are all the same for k-medoids for example (Adepeju et al., 2021; Curman et al., 2015). Or whatever clustering algorithm you want to use in this circumstance. So here I show a few different simulations showing:

GBTM can recover the correct underlying equations
AIC/BIC fit stats have a difficult time distinguishing the correct number of groups
If the underlying model is a random effects instead of latent clusters, AIC/BIC performs quite well

The last part is because GBTM models have a habit of spitting out solutions, even if the true underlying data process has no discrete groups. This is what Skardhamar (2010) did in his article. It was focused on life course, but it applies equally to the spatial analysis GBTM myself and others have done as well (Curman et al., 2015; Weisburd et al., 2004; Wheeler et al., 2016). I’ve pointed out in the past that even if the fit for GBTM looks good, the underlying data can suggest a random effects model will work quite well, and Greenberg (2016) makes pretty much the same point as well.

Example in R

In the past I have shown how to use the crimCV package to fit these group based traj models, specifically zero-inflated Poisson models (Nielsen et al., 2014). Here I will show a different package, the R flexmix package (Grün & Leisch, 2007). This will be Poisson mixtures, but they have an example of doing zip models in there docs if you want.

So first, I load in the flexmix library, set the seed, and generate longitudinal data for three different Poisson models. One thing to note here, mixture models don’t assign an observation 100% to an underlying mixture, but the data I simulate here is 100% in a particular group.

################################################
library("flexmix")
set.seed(10)

# Generate simulated data
n <- 200 #number of individuals
t <- 10   #number of time periods
dat <- expand.grid(t=1:t,id=1:n)

# Setting up underlying 3 models
time <- dat$t
p1 <- 3.5 - time
p2 <- 1.3 + -1*time + 0.1*time^2
p3 <- 0.15*time
p_mods <- data.frame(p1,p2,p3)

# Selecting one of these by random
# But have different underlying probs
latent <- sample(1:3, n, replace=TRUE, prob=c(0.35,0.5,0.15))
dat$lat <- expand.grid(t=1:t,lat=latent)$lat
dat$sel_mu <- p_mods[cbind(1:(n*t), dat$lat)]
dat$obs_pois <- rpois(n=n*t,lambda=exp(dat$sel_mu))
################################################

Now that is the hard part really – figuring out exactly how you want to simulate your data. Here it would be relatively simple to increase the number of people/areas or time period. It would be more difficult to figure out underlying polynomial functions of time.

Next part we fit a 3 mixture model, then assign the highest posterior probabilities back into the original dataset, and then see how we do.

################################################
# Now fitting flexmix model
mod3 <- flexmix(obs_pois ~ time + I(time^2) | id, 
                model = FLXMRglm(family = "poisson"),
                data = dat, k = 3)
dat$mix3 <- clusters(mod3)

# Seeing if they overlap with true labels
table(dat$lat, dat$mix3)/t
################################################

So you can see that the identified groupings are quite good. Only 4 groups out of 200 are mis-placed in this example.

Next we can see if the underlying equations were properly recovered (you can have good separation between groups, but the polynomial fit may be garbage).

# Seeing if the estimated functions are close
rm3 <- refit(mod3)
summary(rm3)

This shows the equations are really as good as you could expect. The standard errors are as wide as they are because this isn’t really all that large a data sample for generalized linear models.

So this shows that if I feed in the correct underlying equation (almost, I could technically submit different equations with/without quadratic terms for example). But what about the real world situation in which you do not know the correct number of groups? Here I fit models for 1 to 8 groups, and then use the typical AIC/BIC to see which group it selects:

################################################
# If I look at different groups will AIC/BIC
# pick the right one?

group <- 1:8
left_over <- group[!(group %in% 3)]
aic <- rep(-1, 8)
bic <- rep(-1, 8)
aic[3] <- AIC(mod3)
bic[3] <- BIC(mod3)

for (i in left_over){
  mod <- flexmix(obs_pois ~ time + I(time^2) | id, 
                 model = FLXMRglm(family = "poisson"),
                 data = dat, k = i)
  aic[i] <- AIC(mod)
  bic[i] <- BIC(mod)
}

fit_stats <- data.frame(group,aic,bic)
fit_stats
################################################

Here it actually fit the same model for 3/5 groups (sometimes even if you tell flexmix to fit 5 groups, it will only return a smaller number). You can see that the fit stats for group 4 through are almost the same. So while AIC/BIC did technically pick the right number in this simulated example, it is cutting the margin pretty close to picking 4 groups in this data instead of 3.

So the simulation Skardhamar (2010) did was slightly different than this so far. What he did was simulate data with no underlying trajectory groups, and then showed GBTM tended to spit out solutions. Here I will show that is the case as well. I simulate random intercepts and a simple linear trend over time.

################################################
# Simulate random effects model
library(lme4)
rand_eff <- rnorm(n=n,0,1.5)
dat$re <- expand.grid(t=1:t,re=rand_eff)$re
dat$re_pois <- rpois(n=n*t,lambda=exp(dat$sel_mu))
dat$mu_re <- 3 + -0.2*time + dat$re
dat$re_pois <- rpois(n=n*t,lambda=exp(dat$mu_re))

re_mod <- glmer(re_pois ~ 1 + time + (1 | id), 
                data = dat, family = poisson(link = "log"))
summary(re_mod)
################################################

So you can see that the random effects model is all fine and dandy – recovers both the fixed coefficients, as well as estimates the correct variance for the random intercepts.

So here I go and see how the AIC/BIC compares for the random effects models vs GBTM models for 1 to 8 groups (I stuff the random effects model in the first row for group 0):

################################################
# Test AIC/BIC for random effects vs GBTM
group <- 0:8
left_over <- 1:8
aic <- rep(-1, 9)
bic <- rep(-1, 9)
aic[1] <- AIC(re_mod)
bic[1] <- BIC(re_mod)

for (i in left_over){
  mod <- flexmix(re_pois ~ time + I(time^2) | id, 
                 model = FLXMRglm(family = "poisson"),
                 data = dat, k = i)
  aic[i+1] <- AIC(mod)
  bic[i+1] <- BIC(mod)
}

fit_stats <- data.frame(group,aic,bic)
fit_stats
################################################

So it ends up flexmix will not give us any more solutions than 2 groups. But that the random effect fit is so much smaller (either by AIC/BIC) than the GBTM you wouldn’t likely make that mistake here.

I am not 100% sure how well we can rely on AIC/BIC for these different models (R does not count the individual intercepts as a degree of freedom here, so k=3 instead of k=203). But no reasonable accounting of k would flip the AIC/BIC results for these particular simulations.

One of the things I will need to experiment with more, I really like the idea of using out of sample data to validate these models instead of AIC/BIC – no different than how Nielsen et al. (2014) use leave one out CV. I am not 100% sure if that is possible in this set up with flexmix, will need to investigate more. (You can have different types of cross validation in that context, leave entire groups out, or forecast missing data within an observed group.)

References

Adepeju, M., Langton, S., & Bannister, J. (2021). Anchored k-medoids: a novel adaptation of k-medoids further refined to measure long-term instability in the exposure to crime. Journal of Computational Social Science, 1-26.

Grün, B., & Leisch, F. (2007). Fitting finite mixtures of generalized linear regressions in R. Computational Statistics & Data Analysis, 51(11), 5247-5252.

Curman, A. S., Andresen, M. A., & Brantingham, P. J. (2015). Crime and place: A longitudinal examination of street segment patterns in Vancouver, BC. Journal of Quantitative Criminology, 31(1), 127-147.

Erosheva, E. A., Matsueda, R. L., & Telesca, D. (2014). Breaking bad: Two decades of life-course data analysis in criminology, developmental psychology, and beyond. Annual Review of Statistics and Its Application, 1, 301-332.

Greenberg, D. F. (2016). Criminal careers: Discrete or continuous?. Journal of Developmental and Life-Course Criminology, 2(1), 5-44.

Nielsen, J. D., Rosenthal, J. S., Sun, Y., Day, D. M., Bevc, I., & Duchesne, T. (2014). Group-based criminal trajectory analysis using cross-validation criteria. Communications in Statistics-Theory and Methods, 43(20), 4337-4356.

Skardhamar, T. (2010). Distinguishing facts and artifacts in group-based modeling. Criminology, 48(1), 295-320.

Weisburd, D., Bushway, S., Lum, C., & Yang, S. M. (2004). Trajectories of crime at places: A longitudinal study of street segments in the city of Seattle. Criminology, 42(2), 283-322.

Wheeler, A. P., Worden, R. E., & McLean, S. J. (2016). Replicating group-based trajectory models of crime at micro-places in Albany, NY. Journal of Quantitative Criminology, 32(4), 589-612.

6 Comments

by Andy Wheeler on March 5, 2021 • Permalink

Posted in data science, R, Regression

Tagged group-based-trajectory, Poisson, simulation

Posted by Andy Wheeler on March 5, 2021

https://andrewpwheeler.com/2021/03/05/simulating-group-based-trajectories-in-r/

Using regularization to generate synthetic controls and conformal prediction for significance tests

When viewing past synthetic control results, one of things that has struck me is that the matching of the pre-trends is really good — almost too good in many cases (appears to be fitting to noise, although you may argue that is a feature in terms of matching exogenous shocks). For example, if you end up having a pre-treatment series of 10 years, and you have a potential donor pool the size of 30, you could technically pick 10 of them at random, fit a linear regression predicting the 10 observations in the treated unit, based on 10 covariates of the donor pool outcomes over the same pre time period, and get perfect predictions (ignoring the typical constraints one places on the coefficients).

So how do we solve that problem? One solution is to use regularized regression results (e.g. ridge regression, lasso), when the number of predictors is greater than the number of observations. So I can cast the matching procedure into a regression problem to generate the weights. Those regression procedures are typically used for forecasting, but don’t have well defined standard errors, and so subsequently are typically only used for point forecasts. One way to make inferences though is to generate the synthetic weights (here using lasso regression), and then use conformal prediction intervals to do our hypothesis testing of counterfactual trends.

Here I walk through an example using state panel crime data in R, full code and data can be downloaded here.

A Synthetic Control Example

So first, these are the packages we need to replicate the results. conformalInference is not on CRAN yet, so use devtools to install it.

#library(devtools)
#install_github(repo="ryantibs/conformal", subdir="conformalInference")
library(conformalInference)
library(glmnet)
library(Synth)

Then I have prepped a nice state panel dataset of crime rates and counts from 1960 through 2014. I set a hypothetical treatment start year in 2005 just so I have a nice 10 years post data for illustration. That is a pretty good length pre-panel though, and a good number of potential donors.

MyDir <- "C:\\Users\\axw161530\\Desktop\\SynthIdeas"
setwd(MyDir)

TreatYear <- 2005

LongData <- read.csv("CrimeStatebyState_Edited.csv")
summary(LongData)

Next I prep my data, currently it is in long panel format, but I need it in wide format to fit the regression equations I want. I am just matching on violent crime rates here. I take out NY, as it is missing a few years of data. (This dataset also includes DC.) Then I split it up into my pre intervention and post intervention set.

#Changing the data to wide for just the violent offenses
wide <- LongData[,c('State','Year','Violent.Crime.rate')]
names(wide)[3] <- 'VCR'
wide <- reshape(wide, idvar="Year", timevar="State", direction="wide")
summary(wide)
#Take out NY because of NAs
wide <- wide[,c(1:33,35:52)]

wide_pre <- as.matrix(wide[wide$Year < TreatYear,])
wide_post <- as.matrix(wide[wide$Year >= TreatYear,])

Now onto the good stuff, we can estimate our lasso regression using the pre-data to get our weights. This constrains the coefficients to be positive and below 1. But does not have the constraint they sum to 1. I just choose Alabama as an example treated unit — I intentionally chose a state and year that should not have any effects for illustration and to check the coverage of my technique vs more traditional analyses.

You can see in my notes this is different than traditional synth in that it has an intercept as well. I was surprised, but the predictions in sample were really bad without the intercept no matter how I sliced it.

res <- glmnet(x=wide_pre[,3:51],y=wide_pre[,2],family="gaussian",
       lower.limits=0,upper.limits=1,intercept=TRUE,standardize=FALSE,
       alpha=1) #need the intercept, predictions suck otherwise

Even though this does not constrain the coefficients to sum to 1, it ends up with weights really close to that ideal anyway (sum of the non-intercept coefficients is just over 1.01). When I use crossvalidation it does not choose weights that sum to unity, but in sample the above code and the cv.glmnet are really similar in terms of predictions.

co_ridge <- as.matrix(coef(res))
fin <- co_ridge[,"s99"]
active <- fin[fin > 0] #Does not include intercept

If you print active we then have for our state weights (and the intercept is pretty tiny, -22). So not quite sure why eliminating the intercept was causing such problems in this example. So North Carolina just sneaks in, but otherwise the synthetic control is a mix of Arkansas, California, Kentucky, and Texas. The intercept is just a level shift, so we are still matching curves otherwise, so that does not bother me very much.

VCR.AR 0.2078156362
VCR.CA 0.1201658279
VCR.IL 0.1543015666
VCR.KY 0.2483613907
VCR.NC 0.0002896238
VCR.TX 0.2818272850

If we look at our predictions for the pre-time period, Alabama had the typical crime path, with a big raise going into the early 90’s and then a fall afterward (black line), and our in-sample predictions from the lasso regression are decent.

pre_pred <- predict(res,newx=wide_pre[,3:51],s=min(res$lambda)) #for not cv results

plot(wide_pre[,1],wide_pre[,2],type='l',xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_pre[,1],pre_pred,bg='red',pch=21) #Not too shabby
legend(1960,800,legend=c("Observed Albama","Predicted"),col=c("black","black"), pt.bg=c("black","red"), lty=c(1,NA), pch=c(NA,21))

Now to evaluate post intervention, we are going to generate conformal prediction intervals using a jackknife approach. Basically doing all the jazz of above, but leaving one pre year out at a time, and trying to predict Alabama’s violent crime rate for that left out year. Repeat that same process for all prior years, and we can get a calculation of the standard error of our prediction. Then apply that standard error to future years, so we can tell if the observed trend is different than the counterfactual we estimated (given the counterfactual has errors). I generate both 90% prediction intervals, as well as 99% prediction intervals.

train_fun <- function(x, y, out=NULL){
  return( glmnet(x,y,alpha=1,standardize=FALSE,intercept=TRUE,nlambda=100,
                lower.limits=0,upper.limits=1,family="gaussian")
  )
}

pred_fun = function(out, newx) {
    return(predict(out, newx, s=min(out$lambda)))
}

limits_10 <- conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
                                 train.fun=train_fun,predict.fun=pred_fun,alpha=0.10,
                                 verbose=TRUE)

limits_01 <- conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
               train.fun=train_fun,predict.fun=pred_fun,alpha=0.01,
               verbose=TRUE)

plot(wide_post[,1],wide_post[,2],type='l',ylim=c(150,650),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
legend("topright",legend=c("Observed Albama","Predicted","90% Pred. Int.","99% Pred. Int."),cex=0.7,
       col=c("black","black","grey","grey"), pt.bg="red", lty=c(1,NA,1,1), pch=c(NA,21,NA,NA), lwd=c(1,1,1,3))

Then at the end of the above code snippet I made a plot. Black line is observed for Alabama from 05-14. Red dots are the estimated counterfactual based on the pre-weights. The lighter grey lines are then the prediction intervals. So we can see it is just outside the 90% intervals 3 times in the later years (would only expect 1 time), but all easily within the 99% intervals.

Note these are prediction intervals, not confidence intervals. Thinking about it I honestly don’t know whether we want prediction or confidence intervals in this circumstance, but prediction will be wider.

So this approach just matches on the pre-treated same outcome observations. To match on additional covariates, you can add them in as rows into the pre-treatment dataset (although you would want to normalize the values to a similar mean and standard deviation as the pre-treated outcome series).

You may also add in other covariates, like functions of time (although this changes the nature of the identification). So for example say you incorporate a linear and quadratic trend in time, and lasso only chooses those two time factors and no control areas. You are doing something more akin to interrupted time series analysis at that point (the counterfactual is simply based on your estimate of the pre-trend). Which I think is OK sometimes, but is quite different than using control areas to hopefully capture random shocks.

Comparing to Traditional Synth results

To see whether my error intervals are similar to the placebo approach, I used the old school synth R package. It isn’t 100% comparable, as it makes you match on at least one covariate, so here I choose to also match on the average logged population over the pre-treatment period.

#NY is missing years
LongData_MinNY <- LongData[as.character(LongData$State) != "NY",c("State","Year","Violent.Crime.rate","Population")]
LongData_MinNY$StateNum <- as.numeric(LongData_MinNY$State)
LongData_MinNY$State <- as.character(LongData_MinNY$State)
LongData_MinNY$LogPop <- log(LongData_MinNY$Population)    

state_nums <- unique(LongData_MinNY$StateNum)
    
dataprep.out <- dataprep(foo = LongData_MinNY,
                         dependent = "Violent.Crime.rate",
                         predictors = c("LogPop"),
                         unit.variable = "StateNum",
                         unit.names.variable = "State",
                         time.variable = "Year",
                         treatment.identifier = 2,
                         controls.identifier = state_nums[!state_nums %in% 2],
                         time.optimize.ssr = 1960:(TreatYear-1),
                         time.predictors.prior = 1960:(TreatYear-1),
                         time.plot = 1960:2014
                         )

synth_res <- synth(dataprep.out)
synth_tables <- synth.tab(dataprep.res = dataprep.out, synth.res = synth_res)
synth_tables$tab.w #a bunch of little weights across the board
path.plot(synth.res = synth_res, dataprep.res = dataprep.out, tr.intake=TreatYear,Xlab='',Ylab='Violent Crime Rate per 100,000',
      Legend=c("Alabama","Synthetic Control"), Legend.position=c("topleft"))

Looking at the weights, it is a bunch of little ones for many different states. Looking at the plot, it doesn’t appear to be any better fit than the lasso approach.

And then I just do the typical approach and use placebo checks to do inference. I loop over my 49 placebos (-1 state for NY, but +1 state because this list includes DC).

#Dataframes to stuff the placebos check results into
Predicted <- data.frame(dataprep.out$Y0plot %*% synth_res$solution.w)
names(Predicted) <- "TreatPred"

Pred_MinTreat <- data.frame(TreatPred = Predicted$TreatPred - LongData_MinNY[LongData_MinNY$StateNum == 2,"Violent.Crime.rate"])

#Now I just need to loop over the other states and collect their results for the placebo tests

placebos <- state_nums[!state_nums %in% 2]
for (i in placebos){
  dataprep.plac <- dataprep(foo = LongData_MinNY,
                           dependent = "Violent.Crime.rate",
                           predictors = c("LogPop"),
                           unit.variable = "StateNum",
                           unit.names.variable = "State",
                           time.variable = "Year",
                           treatment.identifier = i,
                           controls.identifier = state_nums[!state_nums %in% i],
                           time.optimize.ssr = 1960:(TreatYear-1),
                           time.predictors.prior = 1960:(TreatYear-1),
                           time.plot = 1960:2014
  )
  synth_resP <- synth(dataprep.plac)
  synth_tablesP <- synth.tab(dataprep.res = dataprep.plac, synth.res = synth_resP)
  nm <- paste0("S.",i)
  Predicted[,nm] <- dataprep.plac$Y0plot %*% synth_resP$solution.w
  Pred_MinTreat[,nm] <- Predicted[,nm] - LongData_MinNY[LongData_MinNY$StateNum == i,"Violent.Crime.rate"]
}

If you look at the synth estimates for Alabama (grey circles), they are almost exactly the same as the lasso predictions (red circles), even though the weights are very different.

PredRecent <- Predicted[1960:2014 >= TreatYear,]
DiffRecent <- Pred_MinTreat[1960:2014 >= TreatYear,]

plot(wide_post[,1],wide_post[,2],type='l',ylim=c(100,700),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
points(wide_post[,1],PredRecent$TreatPred,bg='grey',pch=21)
legend("topright",legend=c("Observed Albama","Lasso Pred.","90% Pred. Int.","99% Pred. Int.","Synth Pred."),cex=0.6,
       col=c("black","black","grey","grey"), pt.bg=c(NA,"red",NA,NA,"grey"), lty=c(1,NA,1,1,NA), pch=c(NA,21,NA,NA,21), lwd=c(1,1,1,3,1))

But when we look at variation in our placebo results (thin, purple lines), they are much wider than our conformal prediction intervals.

plot(wide_post[,1],wide_post[,2]-post_pred,type='l',ylim=c(-500,500),xlab='',ylab='Observed - Predicted (Violent Crime Rates)')
points(wide_post[,1],post_pred-post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_01$lo-post_pred,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up-post_pred,col='grey',lwd=3)

for (i in 2:ncol(PredRecent)){
  lines(wide_post[,1],DiffRecent[,i],col='#9400D340',lwd=0.5)
}

legend(x=2005.5,y=-700,legend=c("Observed Albama","Lasso Pred.","99% Pred. Int.","Placebos"),
       col=c("black","black","grey",'#9400D3'), pt.bg=c(NA,"red",NA,NA), lty=c(1,NA,1,1), 
       pch=c(NA,21,NA,NA), lwd=c(1,1,3,0.5), xpd=TRUE, horiz=TRUE, cex = 0.45)

So I was hoping they would be the same (conformal would cover the placebo at the expected rate), but alas they are not. So I’m not sure if my conformal intervals are too small, or the placebo checks are extra noisy. I can’t prove it, but I suspect the placebo checks are somewhat noisy, mainly because there will always be some intervention that is idiosyncratic to specific donors over long periods of time that makes them no longer good counterfactuals. This seems especially true if you consider predictions further out from the treatment year. Although I find the logic of the placebo checks pretty convincing, so I am somewhat torn.

Since we have in this example 49 donors, the two-tailed p-value for being outside the placebos would be 2/(49+1)=0.04. Here we would need an intervention that either increased violent crime rates by plus/minus 400 per 100,000, pretty much an impossible standard given a baseline of only 400 crimes per 100,000 as of 2004. The 99% conformal intervals are still pretty wide, with an increase/decrease of about 150 violent crimes per 100,000 needed to be a significant change. The two lines way outside 400 happen to be Alaska and Wyoming, not DC, so maybe a tiny population state results in higher volatility problem. But besides them there are a bunch of placebo states around plus/minus 300 as well.

So caveat emptor if you want to use this idea in your own work, I don’t know if my suggestion is good or bad. Here it suggests its more diagnostic (smaller intervals) than the placebo checks, and isn’t limited by the number of potential donors in setting the alpha level for your tests (e.g. if you only have 10 potential donors your placebo checks are only 90% intervals).

Since this is just one example, there are a few things I would need to know before recommending it more generally. One is that it may not work with smaller pre time series and/or a smaller donor pool. (Not sure of any better way of checking than via a ton of different simulations.)

More general notes

Doing some more lit review while preparing this post, I appear to be like 15th in line to suggest this approach (so don’t take it as novel). In terms of using the lasso to estimate the synth weights, it seems Susan Athey and colleagues proposed something similar in addition to using other machine learning techniques. Also see Amjad et al. 2018 in the Journal of Machine Learning, and this workshop by Alex Hollingsworth and Coady Wing. I am not even the first one to think to use conformal prediction intervals apparently, see this working paper (Chernozhukov, Wuthrich, and Zhu, 2019) posted just a few weeks prior.

There is another R package, gsynth, that appears to solve the problem of p > n via a variable reduction technique (Xu, 2017). Xu also discusses how incorporating more information is really making different identification assumptions. So again just getting good predictions/minimizing the in-sample mean square error is not necessarily the right approach to get correct causal inferences.

Just a blog post, so again can’t say if this is an improvement over other work offhand. This is just illustrative that the bounds for the conformal prediction may be smaller than the typical permutation based approach. Casting it as a regression problem I intuitively grok more, and think opens up more possibilities. For example, you may want to use binomial logistic models instead of linear for the fitting process (so takes into account more volatility for smaller population states).

6 Comments

by Andy Wheeler on December 6, 2019 • Permalink

Posted in R, Regression

Tagged lasso, regularized-regression, synthetic-control

Posted by Andy Wheeler on December 6, 2019

https://andrewpwheeler.com/2019/12/06/using-regularization-to-generate-synthetic-controls-and-conformal-prediction-for-significance-tests/

Some more testing coefficient contrasts: Multinomial models and indirect effects

Testing the equality of two coefficients is one of my more popular posts. This is a good thing — often more interesting hypotheses are to test two parameters against each other, as opposed to a strict null hypothesis of a coefficient against zero. Every now an then I get questions about applying this idea to new situations in which it is not always straightforward how to figure out. So here are a few examples using demonstration R code.

Multinomial Models

One question I received about applying the advice was to test coefficients across different contrasts in multinomial models. It may not seem obvious, but the general approach of extracting out the coefficients and the covariance between those estimates works the same way as most regression equations.

So in a quick example in R:

library(nnet)
data(mtcars)
library(car)

mtcars$cyl <- as.factor(mtcars$cyl)  
mtcars$am <- as.factor(mtcars$am)  
mod <- multinom(cyl ~ am + hp, data=mtcars, Hess=TRUE)
summary(mod)

And the estimates for mod are:

> summary(mod)
Call:
multinom(formula = cyl ~ am + hp, data = mtcars)

Coefficients:
  (Intercept)       am1        hp
6   -42.03847  -3.77398 0.4147498
8   -92.30944 -26.27554 0.7836576

Std. Errors:
  (Intercept)       am1        hp
6    27.77917  3.256003 0.2747842
8    31.93525 46.854100 0.2559052

Residual Deviance: 7.702737 
AIC: 19.70274

So say we want to test whether the hp effect is the same for 6 cylinders vs 8 cylinders. To test that, we just grab the covariance and construct our test:

#Example constructing test by hand
v <- vcov(mod)
c <- coef(mod)
dif <- c[1,3] - c[2,3]
se <- sqrt( v[3,3] + v[6,6] - 2*v[3,6])
z <- dif/se
#test stat, standard error, and two-tailed p-value
dif;se;2*(1 - pnorm(abs(z)))

Which we end up with a p-value of 0.0002505233, so we would reject the null that these two effects are equal to one another. Note to get the variance-covariance estimates for the parameters you need to set Hess=TRUE in the multinom call.

Another easier way though is to use the car libraries function linearHypothesis to conduct the same test:

> linearHypothesis(mod,c("6:hp = 8:hp"),test="Chisq")
Linear hypothesis test

Hypothesis:
6:hp - 8:hp = 0

Model 1: restricted model
Model 2: cyl ~ am + hp

  Df  Chisq Pr(>Chisq)    
1                         
2  1 13.408  0.0002505 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can see although this is in terms of a Chi-square test, it results in the same p-value. The Wald test however can be extended to testing multiple coefficient equalities, and a popular one for multinomial models is to test if any coefficients change across different levels of the dependent categories. The idea behind that test is to see if you can collapse that category with another that is equivalent.

To do that test, I created a function that does all of the contrasts at once:

#Creating function to return tests for all coefficient equalities at once
all_tests <- function(model){
  v <- colnames(coef(model))
  d <- rownames(coef(model))
  allpairs <- combn(d,2,simplify=FALSE)
  totn <- length(allpairs) + length(d)
  results <- data.frame(ord=1:totn)
  results$contrast <- ""
  results$test <- ""
  results$Df <- NULL
  results$Chisq <- NULL
  results$pvalue <- NULL
  iter <- 0
  for (i in allpairs){
    iter <- iter + 1
    l <- paste0(i[1],":",v)
    r <- paste0(i[2],":",v)
    test <- paste0(l," = ",r)
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i[1]," vs ",i[2])
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]    
  }
  ref <- model$lab[!(model$lab %in% d)]
  for (i in d){
    iter <- iter + 1
    test <- paste0(i,":",v," = 0")
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i," vs ",ref)
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]  
  }
  return(results)
}

Not only does this construct the test of the observed categories, but also tests whether each set of coefficients is simultaneously zero, which is the appropriate contrast for the referent category.

> all_tests(mod)
  ord contrast                                                            test Df        Chisq       pvalue
1   1   6 vs 8 6:(Intercept) = 8:(Intercept) and 6:am1 = 8:am1 and 6:hp = 8:hp  3    17.533511 0.0005488491
2   2   6 vs 4                    6:(Intercept) = 0 and 6:am1 = 0 and 6:hp = 0  3     5.941417 0.1144954481
3   3   8 vs 4                    8:(Intercept) = 0 and 8:am1 = 0 and 8:hp = 0  3 44080.662112 0.0000000000

User beware of multiple testing with this, as I am not sure as to the appropriate post-hoc correction here when examining so many hypotheses. This example with just three is obviously not a big deal, but with more categories you get n choose 2, or (n*(n-1))/2 total contrasts.

Testing the equality of multiple indirect effects

Another example I was asked about recently was testing whether you could use the same procedure to calculate indirect effects (popular in moderation and mediation analysis). Those end up being a bit more tricky, as to define the variance and covariance between those indirect effects we are not just dealing with adding and subtracting values of the original parameters, but are considering multiplications.

Thus to estimate the standard error and covariance parameters of indirect effects folks often use the delta method. In R using the lavaan library, here is an example (just taken from a code snippet Yves Rosseel posted himself), to estimate the variance-covariance matrix model defined indirect parameters.

#function taken from post in
#https://groups.google.com/forum/#!topic/lavaan/skgZRyzqtYM
library(lavaan)
vcov.def <- function(model){
  m <- model
  orig <- vcov(m)
  free <- m@Fit@x
  jac <- lavaan:::lavJacobianD(func = m@Model@def.function, x = free)
  vcov_def <- jac %*% orig %*% t(jac)
  estNames <- subset(parameterEstimates(m),op==":=")
  row.names(vcov_def) <- estNames$lhs
  colnames(vcov_def) <- estNames$lhs
  #I want to print the covariance table estimates to make sure the
  #labels are in the correct order
  estNames$se2 <- sqrt(diag(vcov_def))
  estNames$difSE <- estNames$se - estNames$se2
  print(estNames[,c('lhs','se','se2','difSE')])
  print('If difSE is not zero, labels are not in right order')
  return(vcov_def)
}

Now here is an example of testing individual parameter estimates for indirect effects.

set.seed(10)
n <- 100
X1 <- rnorm(n)
X2 <- rnorm(n)
X3 <- rnorm(n)
M <- 0.5*X1 + 0.4*X2 + 0.3*X3 + rnorm(n)
Y <- 0.1*X1 + 0.2*X2 + 0.3*X3 + 0.7*M + rnorm(n)
Data <- data.frame(X1 = X1, X2 = X2, X3 = X3, Y = Y, M = M)
model <- ' # direct effect
             Y ~ X1 + X2 + X3 + d*M
           # mediator
             M ~ a*X1 + b*X2 + c*X3
           # indirect effects
             ad := a*d
             bd := b*d
             cd := c*d
         '
model_SP.fit <- sem(model, data = Data)
summary(model_SP.fit)

#now apply to your own sem model
defCov <- vcov.def(model_SP.fit)

Unfortunately as far as I know, the linearHypothesis function does not work for lavaan objects, so if we want to test whether the indirect effect of whether ad = bd we need to construct it by hand. But with the vcov.def function we have those covariance terms we needed.

#testing hypothesis that "ad = bd"
#so doing "ad - bd = 0"
model_SP.param <- parameterEstimates(model_SP.fit)
model_SP.defined <- subset(model_SP.param, op==":=")
dif <- model_SP.defined$est[1] - model_SP.defined$est[2]
var_dif <- defCov[1,1] + defCov[2,2] - 2*defCov[1,2]
#so the test standard error of the difference is 
se_dif <- sqrt(var_dif)
#and the test statistic is
tstat <- dif/se_dif 
#two tailed p-value
dif;se_dif;2*(1 - pnorm(abs(tstat)))

To test whether all three indirect parameters are equal to each other at once, one way is to estimate a restricted model, and then use a likelihood ratio test of the restricted vs the full model. It is pretty easy in lavaan to create coefficient restrictions, just set what was varying to only be one parameter:

restrict_model <- ' # direct effect
                      Y ~ X1 + X2 + X3 + d*M
                    # mediator
                      M ~ a*X1 + a*X2 + a*X3
                    # indirect effects
                      ad := a*d
                  '

model_SP.restrict <- sem(restrict_model, data = Data)
lavTestLRT(model_SP.fit, model_SP.restrict)

If folks know of an easier way to do the Wald tests via lavaan models let me know, I would be interested!

2 Comments

by Andy Wheeler on April 9, 2018 • Permalink

Posted in R, Regression

Tagged contrasts, hypothesis-testing, multinomial, sem

Posted by Andy Wheeler on April 9, 2018

https://andrewpwheeler.com/2018/04/09/some-more-testing-coefficient-contrasts-multinomial-models-and-indirect-effects/

Pooling multiple outcomes into one regression equation

Something that came up for many of my students this last semester in my Seminar in Research class is that many were interested in multiple outcomes. The most common one is examining different types of delinquency for juveniles (often via surveys), but it comes up in quite a few other designs as well (e.g. different crime outcomes for spatial research, different measures of perceptions towards police, different measures of fear of crime, etc.).

Most of the time students default to estimating separate equations for each of these outcomes, but in most circumstances I was telling the students they should pool these outcomes into one model. I think that is the better default for the majority of situations. So say we have a situation with two outcomes, violent crimes and property crimes, and we have one independent variable we are interested in, say whether an individual was subjected to a particular treatment. We might then estimate two separate equations:

E[# Violent Crimes]  = B0v + B1v*(Treatment) 
    
E[# Property Crimes] = B0p + B1p*(Treatment)

By saying that I think by default we should think about pooling is basically saying that B1v is going to be close to equal to B1p in the two equations. Pooling the models together both lets us test that assertion, as well as get a better estimate of the overall treatment effect. So to pool the models we would stack the outcomes together, and then estimate something like:

E[# Crimes (by type)] = B0 + B1*(Treatment) + B2*(Outcome = Violent) + B3(Treatment*Outcome = Violent)

Here the B3 coefficient tests whether the treatment effect is different for the violent crime outcome as opposed to the property crime, and the dummy variable B2 effect controls for any differences in the levels of the two overall (that is, you would expect violent incidents to be less common than property crime incidents).

Because you will have multiple measures per individual, you can correct for that (by clustering the standard errors). But in the case you have many outcomes you might also want to consider a multi-level model, and actually estimate random effects for individuals and outcomes. So say instead of just violent and property crimes, but had a survey listing for 20 different types of delinquency. In that case you might want to do a model that looks like:

Prob(Delinquency_ij) = f[B0 + B1*(Treatment_j) + d_j + g_i]

Where one is estimating a multi-level logistic regression equation for delinquency type i within individual j, and the g_i and d_j are the random effects for delinquency types and individuals respectively. In the case you do not have many outcomes (say only 10), the random effect distribution might be hard to estimate. In that case I would just use fixed effects for the outcome dummy variables. But I can imagine the random effects for persons are of interest in many different study designs. And this way you get one model — instead of having to interpret 20+ models.

Also you can still estimate differential treatment effects across the different items if you want to, such as by looking at the interaction of the outcome types and the treatment. But in most cases in criminology I have come across treatments are general. That is, we would expect them to decrease/increase all crime types, not just some specific violent or property crime types. So to default pooling the treatment effect estimate makes sense.

To go a bit farther — juvenile delinquency is not my bag, but offhand I don’t understand why those who examine surveys of delinquency items use that multi-level model more often. Often times people aggregate the measures altogether into one overall scale, such as saying someone checked yes to 2 out of 10 violent crime outcomes, and checked yes to 5 out of 10 property crime outcomes. Analyzing those aggregated outcomes is another type of pooling, but one I don’t think is appropriate, mainly because it ignores the overall prevalence for the different items. For example, you might have an item such as "steal a car", and another that is "steal a candy bar". The latter is much more serious and subsequently less likely to occur. Going with my prior examples, pooling items together like this would force the random effects for the individual delinquency types, g_i, to all equal zero. Just looking at the data one can obviously tell that is not a good assumption.

Here I will provide an example via simulation to demonstrate this in Stata. First I generate an example dataset that has 1,000 individuals and 20 yes/no outcomes. They way the data are simulated is that each individual has a specific amount of self_control that decreases the probability of an outcome (with a coefficient of -0.5), they are nested within a particular group (imagine a different school) that affect whether the outcome occurs or not. In addition to this, each individual has a random intercept (drawn from a normal distribution), and each question has a fixed different prevalence.

*Stata simulation
clear
set more off
set seed 10
set obs 1000
generate caseid = _n
generate group = ceil(caseid/100) 
generate self_control = rnormal(0,1)
generate rand_int = rnormal(0,1)

*generating 20 outcomes that just have a varying intercept for each
forval i = 1/20 { 
  generate logit_`i' = -0.4 -0.5*self_control -0.1*group + 0.1*(`i'-10) + rand_int
  generate prob_`i' = 1/(1 + exp(-1*logit_`i'))
  generate outcome_`i' = rbinomial(1,prob_`i')
}
drop logit_* prob_* rand_int
summarize prob_*

And here is that final output:

. summarize prob_*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      prob_1 |      1,000    .1744795    .1516094   .0031003   .9194385
      prob_2 |      1,000    .1868849     .157952   .0034252   .9265418
      prob_3 |      1,000     .199886    .1642414   .0037841   .9330643
      prob_4 |      1,000      .21348    .1704442   .0041804   .9390459
      prob_5 |      1,000    .2276601    .1765258    .004618   .9445246
-------------+---------------------------------------------------------
      prob_6 |      1,000     .242416    .1824513   .0051012   .9495374
      prob_7 |      1,000    .2577337    .1881855   .0056347   .9541193
      prob_8 |      1,000    .2735951    .1936933   .0062236   .9583033
      prob_9 |      1,000     .289978    .1989401   .0068736    .962121
     prob_10 |      1,000    .3068564    .2038919    .007591   .9656016
-------------+---------------------------------------------------------
     prob_11 |      1,000    .3242004    .2085164   .0083827   .9687729
     prob_12 |      1,000    .3419763    .2127823   .0092562   .9716603
     prob_13 |      1,000    .3601469    .2166605   .0102197   .9742879
     prob_14 |      1,000    .3786715    .2201237   .0112824   .9766776
     prob_15 |      1,000    .3975066    .2231474   .0124542   .9788501
-------------+---------------------------------------------------------
     prob_16 |      1,000    .4166057    .2257093    .013746   .9808242
     prob_17 |      1,000    .4359203    .2277906   .0151697   .9826173
     prob_18 |      1,000       .4554    .2293751   .0167384   .9842454
     prob_19 |      1,000     .474993    .2304504   .0184663   .9857233
     prob_20 |      1,000    .4946465    .2310073   .0203689   .9870643

You can see from this list that each prob* variable then has a different overall prevalence, from around 17% for prob_1, climbing to around 50% for prob_20.

Now if you wanted to pool the items into one overall delinquency scale, you might estimate a binomial regression model (note this is not a negative binomial model!) like below (see Britt et al., 2017 for discussion).

*first I will show the binomial model in Britt
egen delin_total = rowtotal(outcome_*)
*Model 1
glm delin_total self_control i.group, family(binomial 20) link(logit)

Which shows for the results (note that the effect of self-control is too small, it should be around -0.5):

. glm delin_total self_control i.group, family(binomial 20) link(logit)

Iteration 0:   log likelihood =  -3536.491  
Iteration 1:   log likelihood = -3502.3107  
Iteration 2:   log likelihood = -3502.2502  
Iteration 3:   log likelihood = -3502.2502  

Generalized linear models                         No. of obs      =      1,000
Optimization     : ML                             Residual df     =        989
                                                  Scale parameter =          1
Deviance         =  4072.410767                   (1/df) Deviance =   4.117706
Pearson          =  3825.491931                   (1/df) Pearson  =    3.86804

Variance function: V(u) = u*(1-u/20)              [Binomial]
Link function    : g(u) = ln(u/(20-u))            [Logit]

                                                  AIC             =     7.0265
Log likelihood   = -3502.250161                   BIC             =  -2759.359

------------------------------------------------------------------------------
             |                 OIM
 delin_total |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
             |
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
             |
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------

One of the things I wish the Britt paper mentioned was that the above binomial model is equivalent to the a logistic regression model on the individual outcomes — but one that forces the predictions for each item to be the same across a person. So if you reshape the data from wide to long you can estimate that same binomial model as a logistic regression on the 0/1 outcomes.

*reshape wide to long
reshape long outcome_, i(caseid) j(question)
*see each person now has 20 questions each
*tab caseid

*regression model with the individual level data, should be equivalent to the aggregate binomial model
*Model 2
glm outcome_ self_control i.group, family(binomial) link(logit)

And here are the results:

. glm outcome_ self_control i.group, family(binomial) link(logit)

Iteration 0:   log likelihood = -12204.638  
Iteration 1:   log likelihood = -12188.762  
Iteration 2:   log likelihood = -12188.755  
Iteration 3:   log likelihood = -12188.755  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,989
                                                  Scale parameter =          1
Deviance         =  24377.50934                   (1/df) Deviance =   1.219546
Pearson          =  19949.19243                   (1/df) Pearson  =   .9980085

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.219975
Log likelihood   = -12188.75467                   BIC             =  -173583.3

------------------------------------------------------------------------------
             |                 OIM
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
             |
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
             |
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------

So you can see that Model 1 and Model 2 are exactly the same (in terms of estimates for the regression coefficients).

Model 2 though should show the limitations of using the binomial model — it predicts the same probability for each delinquency item, even though prob_1 is less likely to occur than prob_20. So for example, if we generate the predictions of this model, we can see that each question has the same predicted value.

predict prob_mod2, mu
sort question
by question: summarize outcome_ prob_mod2

And here are the results for the first four questions:

.     by question: summarize outcome_ prob_mod2

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

By construction, the binomial model on the aggregated totals is a bad fit to the data. It predicts that each question should have a probability of around 32% of occurring. Although you can’t fit the zero-inflated model discussed by Britt via the individual level logit approach (that I am aware of), that approach has the same limitation as the generic binomial model approach. Modeling the individual items just makes more sense when you have the individual items. It is hard to think of examples where such a restriction would be reasonable for delinquency items.

So here a simple update is to include a dummy variable for each item. Here I also cluster according to whether the item is nested within an individual caseid.

*Model 3
glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

And here are the results:

.     glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

Iteration 0:   log pseudolikelihood = -11748.056  
Iteration 1:   log pseudolikelihood = -11740.418  
Iteration 2:   log pseudolikelihood = -11740.417  
Iteration 3:   log pseudolikelihood = -11740.417  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,970
                                                  Scale parameter =          1
Deviance         =  23480.83406                   (1/df) Deviance =   1.175805
Pearson          =  19949.15609                   (1/df) Pearson  =   .9989562

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.177042
Log pseudolikelihood = -11740.41703               BIC             =  -174291.8

                             (Std. Err. adjusted for 1,000 clusters in caseid)
------------------------------------------------------------------------------
             |               Robust
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3858319   .0334536   -11.53   0.000    -.4513996   -.3202641
             |
       group |
          2  |  -.0620222   .1350231    -0.46   0.646    -.3266626    .2026182
          3  |   -.049852   .1340801    -0.37   0.710    -.3126442    .2129403
          4  |   .0549271   .1383412     0.40   0.691    -.2162167    .3260709
          5  |  -.1329942   .1374758    -0.97   0.333    -.4024419    .1364535
          6  |  -.4103578   .1401212    -2.93   0.003    -.6849904   -.1357253
          7  |  -.3145033   .1452201    -2.17   0.030    -.5991296   -.0298771
          8  |  -.2803599   .1367913    -2.05   0.040    -.5484659    -.012254
          9  |  -.4543686   .1431314    -3.17   0.002    -.7349011   -.1738362
         10  |  -.5962359   .1457941    -4.09   0.000    -.8819872   -.3104847
             |
    question |
          2  |   .1453902   .1074383     1.35   0.176    -.0651851    .3559654
          3  |   .1643203   .1094113     1.50   0.133     -.050122    .3787625
          4  |   .1262597   .1077915     1.17   0.241    -.0850078    .3375272
          5  |   .1830563    .105033     1.74   0.081    -.0228047    .3889173
          6  |   .3609468   .1051123     3.43   0.001     .1549304    .5669633
          7  |    .524749    .100128     5.24   0.000     .3285017    .7209963
          8  |   .5768412   .1000354     5.77   0.000     .3807754     .772907
          9  |   .7318797   .1021592     7.16   0.000     .5316513    .9321081
         10  |    .571682   .1028169     5.56   0.000     .3701646    .7731994
         11  |    .874362   .0998021     8.76   0.000     .6787535     1.06997
         12  |   .8928982   .0998285     8.94   0.000     .6972379    1.088559
         13  |   .8882734   .1023888     8.68   0.000      .687595    1.088952
         14  |   .9887095   .0989047    10.00   0.000     .7948599    1.182559
         15  |   1.165517   .0977542    11.92   0.000     .9739222    1.357111
         16  |   1.230355   .0981687    12.53   0.000     1.037948    1.422762
         17  |   1.260403   .0977022    12.90   0.000      1.06891    1.451896
         18  |   1.286065    .098823    13.01   0.000     1.092376    1.479755
         19  |   1.388013   .0987902    14.05   0.000     1.194388    1.581638
         20  |   1.623689   .0999775    16.24   0.000     1.427737    1.819642
             |
       _cons |  -1.336376   .1231097   -10.86   0.000    -1.577666   -1.095085
------------------------------------------------------------------------------

You can now see that the predicted values for each individual item are much more reasonable. In fact they are a near perfect fit.

predict prob_mod3, mu
by question: summarize outcome_ prob_mod2 prob_mod3

And the results:

.     by question: summarize outcome_ prob_mod2 prob_mod3

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .183    .0672242   .0475809   .4785903

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .205    .0729937   .0546202   .5149203

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .208    .0737455    .055606   .5196471

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .202    .0722336   .0536408   .5101407

If you want, you can also test whether any "treatment" effect (or here the level of a persons self control), has differential effects across the different delinquency items.

*Model 4
glm outcome_ self_control i.group i.question (c.self_control#i.question), family(binomial) link(logit) cluster(caseid)
*can do a test of all the interactions equal to zero at once
testparm c.self_control#i.question

I’ve omitted this output, but here of course the effect of self control is simulated to be the same across the different items, so one would fail to reject the null that any of the interaction terms are non-zero.

Given the way I simulated the data, the actual correct model is a random effects one. You should notice in each of the prior models the effect of self control is too small. One way to estimate that model in Stata is to below:

*Model 5
melogit outcome_ self_control i.group i.question || caseid:

And here are the results:

. melogit outcome_ self_control i.group i.question || caseid:

Fitting fixed-effects model:

Iteration 0:   log likelihood = -11748.056  
Iteration 1:   log likelihood = -11740.418  
Iteration 2:   log likelihood = -11740.417  
Iteration 3:   log likelihood = -11740.417  

Refining starting values:

Grid node 0:   log likelihood =  -10870.54

Fitting full model:

Iteration 0:   log likelihood =  -10870.54  
Iteration 1:   log likelihood = -10846.176  
Iteration 2:   log likelihood = -10845.969  
Iteration 3:   log likelihood = -10845.969  

Mixed-effects logistic regression               Number of obs     =     20,000
Group variable:          caseid                 Number of groups  =      1,000

                                                Obs per group:
                                                              min =         20
                                                              avg =       20.0
                                                              max =         20

Integration method: mvaghermite                 Integration pts.  =          7

                                                Wald chi2(29)     =    1155.07
Log likelihood = -10845.969                     Prob > chi2       =     0.0000
------------------------------------------------------------------------------
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.4744173   .0372779   -12.73   0.000    -.5474807    -.401354
             |
       group |
          2  |  -.0648018   .1642767    -0.39   0.693    -.3867783    .2571746
          3  |  -.0740465   .1647471    -0.45   0.653    -.3969449    .2488519
          4  |    .036207   .1646275     0.22   0.826     -.286457     .358871
          5  |  -.1305605   .1645812    -0.79   0.428    -.4531337    .1920126
          6  |  -.5072909   .1671902    -3.03   0.002    -.8349776   -.1796042
          7  |  -.3732567    .165486    -2.26   0.024    -.6976032   -.0489102
          8  |  -.3495889   .1657804    -2.11   0.035    -.6745126   -.0246653
          9  |  -.5593725   .1675276    -3.34   0.001    -.8877205   -.2310245
         10  |  -.7329717   .1673639    -4.38   0.000    -1.060999   -.4049445
             |
    question |
          2  |   .1690546   .1240697     1.36   0.173    -.0741177    .4122268
          3  |    .191157   .1237894     1.54   0.123    -.0514657    .4337797
          4  |   .1467393   .1243586     1.18   0.238    -.0969991    .3904776
          5  |   .2130531   .1235171     1.72   0.085     -.029036    .4551422
          6  |   .4219282   .1211838     3.48   0.000     .1844123    .6594441
          7  |   .6157484   .1194133     5.16   0.000     .3817027    .8497941
          8  |   .6776651   .1189213     5.70   0.000     .4445837    .9107465
          9  |   .8626735   .1176486     7.33   0.000     .6320865    1.093261
         10  |   .6715272   .1189685     5.64   0.000     .4383532    .9047012
         11  |   1.033571   .1167196     8.86   0.000     .8048051    1.262338
         12  |    1.05586    .116615     9.05   0.000     .8272985    1.284421
         13  |   1.050297   .1166407     9.00   0.000     .8216858    1.278909
         14  |   1.171248   .1161319    10.09   0.000     .9436331    1.398862
         15  |   1.384883   .1154872    11.99   0.000     1.158532    1.611234
         16  |   1.463414   .1153286    12.69   0.000     1.237375    1.689454
         17  |   1.499836   .1152689    13.01   0.000     1.273913    1.725759
         18  |   1.530954   .1152248    13.29   0.000     1.305117     1.75679
         19  |   1.654674   .1151121    14.37   0.000     1.429058    1.880289
         20  |   1.941035   .1152276    16.85   0.000     1.715193    2.166877
             |
       _cons |  -1.591796   .1459216   -10.91   0.000    -1.877797   -1.305795
-------------+----------------------------------------------------------------
caseid       |
   var(_cons)|   1.052621   .0676116                      .9281064     1.19384
------------------------------------------------------------------------------
LR test vs. logistic model: chibar2(01) = 1788.90     Prob >= chibar2 = 0.0000

In that model it is the closest to estimating the correct effect of self control (-0.5). It is still small (at -0.47), but the estimate is within one standard error of the true value. (Another way to estimate this model is to use xtlogit, but with melogit you can actually extract the random effects. That will have to wait until another blog post though.)

Another way to think about this model is related to item-response theory, where individuals can have a latent estimate of how smart they are, and questions can have a latent easiness/hardness. In Stata you might fit that by the code below, but a warning it takes awhile to converge. (Not sure why, the fixed effects for questions are symmetric, so assuming a random effect distribution should not be too far off. If you have any thoughts as to why let me know!)

*Model 6
melogit outcome_ self_control i.group || caseid: || question:

For an academic reference to this approach see Osgood et al.,(2002). Long story short, model the individual items, but pool them together in one model!

2 Comments

by Andy Wheeler on December 21, 2017 • Permalink

Posted in Regression, scholarly, Stata

Tagged regression, scholarly, Stata

Posted by Andy Wheeler on December 21, 2017

https://andrewpwheeler.com/2017/12/21/pooling-multiple-outcomes-into-one-regression-equation/

Don’t include temporal lags of crime in cross-sectional crime models

In my 311 and crime paper a reviewer requested I conduct cross-lagged models. That is, predict crime in 2011 while controlling for prior counts of crime in 2010, in addition to the other specific variables of interest (here 311 calls for service). In the supplementary material I detail why this is difficult with Poisson models, as the endogenous effect will often be explosive in Poisson models, something that does not happen as often in linear models.

There is a second problem though with cross-lagged models I don’t discuss though, and it has to do with how what I think a reasonable data generating process for crime at places can cause cross-lagged models to be biased. This is based on the fact that crime at places tends to be very temporally stable (see David Weisburd’s, or Martin Andresen’s, or my work showing that). So when you incorporate temporal lags of crime in models, this makes the other variables of interest (311 calls, alcohol outlets, other demographics, whatever) biased, because they cause crime in the prior time period. This is equivalent to controlling for an intermediate outcome. For examples of this see some of the prior work on the relationship between crime and disorder by Boggess and Maskaly (2014) or O’Brien and Sampson (2015).¹

So Boggess and Maskaley (BM) and O’Brien and Sampson (OS) their simplified cross-lagged model is:

(1) Crime_post = B0*Crime_pre + B1*physicaldisorder_pre

Where the post and pre periods are yearly counts of crime and indicators of physical disorder. My paper subsequently does not include the prior counts of crime, but does lag the physical disorder measures by a year to ensure they are exogenous.

(2) Crime_post = B1*physicaldisorder_pre

There are a few reasons to do these lags. The most obvious is to make explanatory variable of broken windows exogenous, by making sure it is in the past. The reasons for including lags of crime counts are most often strictly as a control variable. There are some examples where crime begets more crime directly, such as retaliatory violence, (or see Rosenfeld, 2009) but most folks who do the cross-lagged models do not make this argument.

Now, my whole argument rests on what I think is an appropriate model explaining counts of crime at places. Continuing with the physical disorder example, I think a reasonable cross-sectional model of crime at places is that there are some underlying characteristics of locations that tend to be pretty stable over fairly long periods of time, and then we have more minor stuff like physical disorder that provide small exogenous shocks to the system over time.

(3) Crime_i = B0*(physicaldisorder_i) + Z_i

Where crime at location i is a function of some fixed characteristic Z. I can’t prove this model is correct, but I believe it is better supported by data. To support this position, I would refer to the incredibly high correlations between counts of crime at places from year to year. This is true of every crime dataset I have worked with (at every spatial unit of analysis), and is a main point of Shaw and McKay’s work plus Rob Sampsons for neighborhoods in Chicago, as well as David Weisburd’s work on trajectories of crime at street segments in Seattle. Again, this very high correlation doesn’t strike me as reasonably explained by crime causes more crime, what is more likely is that there are a set of fixed characteristics that impact criminal behavior at a certain locations.

If a model of crime is like that in (3), there are then two problems with the prior equations. The first problem for both (1) and (2) is that lagging physical disorder measures by a year does not make any sense. The idea behind physical disorder (a.k.a. broken windows) is that visible signs of disorder prime people to behave in a particular way. The priming presumably needs to be recent to affect behavior. But this can simply be solved by not lagging physical disorder by a year in the model. The lagged physical disorder effect might approximate the contemporaneous effect, if physical disorder itself is temporally consistent over long periods. So if say we replace physical disorder with locations of bars, the lagged effect of bars likely does not make any difference, between bars don’t turn over that much (and when they do they are oft just replaced by another bar).

But what if you still include the lags of crime counts? One may think that this controls for the omitted Z_i effect, but the effect is very bad for the other exogenous variables, especially lagged ones or temporally consistent ones. You are probably better off with the omitted random effect, because crime in the prior year is an intermediate outcome. I suspect this bias can be very large, and likely biases the effects of the other variables towards zero by quite alot. This is because effect of the fixed characteristic is large, the effect of the exogenous characteristic is smaller, and the two are likely correlated at least to a small amount.

To show this I conduct a simulation. SPSS Code here to replicate it. The true model I simulated is:

(4)  BW_it = 0.2*Z_i + ew_it
(5)  Crime_it = 5 + 0.1*BW_it + 0.9*Z_i + ec_it`

I generated this for 25,000 locations and two time points (the t subscript), and all the variables are set to have a variance of 1 (all variables are normally distributed). The error terms (ew_it and ec_it) are not correlated, and are set to whatever value is necessary so the resultant variable on the left hand side has a variance of 1. With so many observations one simulation run is pretty representative of what would happen even if I replicated the simulation multiple times. This specification makes both BW (to stand for broken windows) and Z_i correlated.

In my run, what happens when we fit the cross-lagged model? The effect estimates are subsequently:

Lag BW:   -0.07
Lag Crime: 0.90

Yikes – effect of BW is in the opposite direction and nearly as large as the true effect. What about if you just include the lag of BW?

Lag BW: 0.22

The reason this is closer to the true effect is because of some round-about-luck. Since BW_it is correlated with the fixed effect Z_i, the lag of BW has a slight correlation to the future BW. This potentially changes how we view the effects of disorder on crime though. If BW is more variable, we can make a stronger argument that it is exogenous of other omitted variables. If it is temporally consistent it is harder to make that argument (it should also reduce the correlation with Z_i).

Still, the only reason this lag has a positive effect is that Z_i is omitted. For us to make the argument that this approximates the true effect, we have to make the argument the model has a very important omitted variable. Something one could only do as an act of cognitive dissonance.

How about use the contemporaneous effect of BW, but still include the lag counts of crime?

BW:        0.13
Lag Crime: 0.86

That is not as bad, because the lag of crime is now not an intermediate outcome. Again though, if we switch BW with something more consistent in time, like locations of bars, the lag will be an intermediate outcome, and will subsequently bias the effect. So what about a model of the contemporaneous effect of BW, omitting Z_i? The contemporaneous effect of BW will still be biased, since Z_i is omitted from the model.

BW: 0.32

But a way to reduce this bias is to introduce other control variables that approximate the omitted Z_i. Here I generate a set of 10 covariates that are a function of Z_i, but are otherwise not correlated with BW nor each other.

(6) Oth_it = 0.5*Z_i + eoth_it

Including these covariates in the model progressively reduces the bias. Here is a table for the reduction in the BW effect for the more of the covariates you add in, e.g. with 2 means it includes two of the control variables in the model.

BW (with 0):  0.32
BW (with 1):  0.25
BW (with 2):  0.21
BW (with 3):  0.19
BW (with 10): 0.14

So if you include other cross-sectional covariates in an attempt to control for Z_i it brings the effect of BW closer to its true effect. This is what I believe happens in the majority of social science research that use strictly cross-sectional models, and is a partial defense of what people sometimes refer to kitchen sink models.

So in brief, I think using lags of explanatory variables and lags of crime in the same model are very bad, and can bias the effect estimates quite alot.

So using lags of explanatory variables and lags of crime counts in cross-sectional models I believe are a bad idea for most research designs. It is true that it makes it their effects exogenous, but it doesn’t eliminate the more contemporaneous effect of the variable, and so we may be underestimating the effect to a very large extent. Whether of not the temporal lag effects crime has to do with how the explanatory variable itself arises, and so the effect estimated by the temporal lag is likely to be misleading (and may be biased upward or downward depending on other parts of the model).

Incorporating prior crime counts is likely to introduce more bias than it solves I think for most cross-lagged models. I believe simply using a cross-sectional model with a reasonable set of control variables will get you closer to the real effect estimates than the cross-lagged models. If you think Z_i is correlated with a variable of interest (or lags of crime really do cause future crime) I think you need to do the extra step and have multiple time measures and fit a real panel data model, not just a cross lagged one.

I’m still not sure though when you are better off fitting a panel model versus expanding the time for the cross-section though. For one example, I think you are better off estimating the effects of demographic variables in a cross-sectional model, as opposed to a panel one, over a short period of time, (say less than 10 years). This is because demographic shifts simply don’t occur very fast, so there is little variance within units for a short panel.

I actually came up with the idea of using 311 calls independently of Dan O’Brien’s work, see my prospectus in 2013 in which I proposed the analysis. So I’m not totally crazy – although was alittle bummed to miss the timing abit! Four years between proposing and publishing the work is a bit depressing as well.↩

Testing the equality of two regression coefficients

The default hypothesis tests that software spits out when you run a regression model is the null that the coefficient equals zero. Frequently there are other more interesting tests though, and this is one I’ve come across often — testing whether two coefficients are equal to one another. The big point to remember is that Var(A-B) = Var(A) + Var(B) - 2*Cov(A,B). This formula gets you pretty far in statistics (and is one of the few I have memorized).

Note that this is not the same as testing whether one coefficient is statistically significant and the other is not. See this Andrew Gelman and Hal Stern article that makes this point. (The link is to a pre-print PDF, but the article was published in the American Statistician.) I will outline four different examples I see people make this particular mistake.

One is when people have different models, and they compare coefficients across them. For an example, say you have a base model predicting crime at the city level as a function of poverty, and then in a second model you include other control covariates on the right hand side. Let’s say the the first effect estimate of poverty is 3 (1), where the value in parentheses is the standard error, and the second estimate is 2 (2). The first effect is statistically significant, but the second is not. Do you conclude that the effect sizes are different between models though? The evidence for that is much less clear.

To construct the estimate of how much the effect declined, the decline would be 3 - 2 = 1, a decrease in 1. What is the standard error around that decrease though? We can use the formula for the variance of the differences that I noted before to construct it. So the standard error squared is the variance around the parameter estimate, so we have sqrt(1^2 + 2^2) =~ 2.23 is the standard error of the difference — which assumes the covariance between the estimates is zero. So the standard error around our estimated decline is quite large, and we can’t be sure that it is an appreciably different estimate of poverty between the two models.

There are more complicated ways to measure moderation, but this ad-hoc approach can be easily applied as you read other peoples work. The assumption of zero covariance for parameter estimates is not a big of deal as it may seem. In large samples these tend to be very small, and they are frequently negative. So even though we know that assumption is wrong, just pretending it is zero is not a terrible folly.

The second is where you have models predicting different outcomes. So going with our same example, say you have a model predicting property crime and a model predicting violent crime. Again, I will often see people make an equivalent mistake to the moderator scenario, and say that the effect of poverty is larger for property than violent because one is statistically significant and the other is not.

In this case if you have the original data, you actually can estimate the covariance between those two coefficients. The simplest way is to estimate that covariance via seemingly unrelated regression. If you don’t though, such as when you are reading someone else’s paper, you can just assume the covariance is zero. Because the parameter estimates often have negative correlations, this assumption will make the standard error estimate smaller.

The third is where you have different subgroups in the data, and you examine the differences in coefficients. Say you had recidivism data for males and females, and you estimated an equation of the effect of a treatment on males and another model for females. So we have two models:

Model Males  : Prob(Recidivism) = B_0m + B_1m*Treatment
Model Females: Prob(Recidivism) = B_0f + B_1f*Treatment

Where the B_0? terms are the intercept, and the B_1? terms are the treatment effects. Here is another example where you can stack the data and estimate an interaction term to estimate the difference in the effects and its standard error. So we can estimate a combined model for both males and females as:

Combined Model: Prob(Recidivism) = B_0c + B_1c*Treatment + B_2c*Female + B_3c(Female*Treatment)

Where Female is a dummy variable equal to 1 for female observations, and Female*Treatment is the interaction term for the treatment variable and the Female dummy variable. Note that you can rewrite the model for males and females as:

Model Mal.: Prob(Recidivism) =     B_0c      +      B_1c    *Treatment    ....(when Female=0)
Model Fem.: Prob(Recidivism) = (B_0c + B_2c) + (B_1c + B_3c)*Treatment    ....(when Female=1)

So we can interpret the interaction term, B_3c as the different effect on females relative to males. The standard error of this interaction takes into account the covariance term, unlike estimating two totally separate equations would. (You can stack the property and violent crime outcomes I mentioned earlier in a synonymous way to the subgroup example.)

The final fourth example is the simplest; two regression coefficients in the same equation. One example is from my dissertation, the correlates of crime at small spatial units of analysis. I test whether different places that sell alcohol — such as liquor stores, bars, and gas stations — have the same effect on crime. For simplicity I will just test two effects, whether liquor stores have the same effect as on-premise alcohol outlets (this includes bars and restaurants). So lets say I estimate a Poisson regression equation as:

log(E[Crime]) = Intercept + b1*Bars + b2*LiquorStores

And then my software spits out:

                  B     SE      
Liquor Stores    0.36  0.10
Bars             0.24  0.05

And then lets say we also have the variance-covariance matrix of the parameter estimates – which most stat software will return for you if you ask it:

                L       B  
Liquor_Stores    0.01
Bars            -0.0002 0.0025

On the diagonal are the variances of the parameter estimates, which if you take the square root are equal to the reported standard errors in the first table. So the difference estimate is 0.36 - 0.24 = 0.12, and the standard error of that difference is sqrt(0.01 + 0.0025 - 2*-0.002) =~ 0.13. So the difference is not statistically significant. You can take the ratio of the difference and its standard error, here 0.12/0.13, and treat that as a test statistic from a normal distribution. So the rule that it needs to be plus or minus two to be stat. significant at the 0.05 level applies.

This is called a Wald test specifically. I will follow up with another blog post and some code examples on how to do these tests in SPSS and Stata. For completeness and just because, I also list two more ways to accomplish this test for the last example.

There are two alternative ways to do this test though. One is by doing a likelihood ratio test.

So we have the full model as:

 log(E[Crime]) = b0 + b1*Bars + b2*Liquor_Stores [Model 1]

And we have the reduced model as:

 log(E[Crime]) = b4 + b5*(Bars + Liquor_Stores)  [Model 2]

So we just estimate the full model with Bars and Liquor Stores on the right hand side (Model 1), then estimate the reduced model (2) with the sum of Bars + Liquor Stores on the right hand side. Then you can just do a chi-square test based on the change in the log-likelihood. In this case there is a change of one degree of freedom.

I give an example of doing this in R on crossvalidated. This test is nice because it extends to testing multiple coefficients, so if I wanted to test bars=liquor stores=convenience stores. The prior individual Wald tests are not as convenient for testing more than two coefficients equality at once.

Here is another way though to have the computer more easily spit out the Wald test for the difference between two coefficients in the same equation. So if we have the model (lack of intercept does not matter for discussion here):

y = b1*X + b2*Z [eq. 1]

We can test the null that b1 = b2 by rewriting our linear model as:

y = B1*(X + Z) + B2*(X - Z) [eq. 2]

And the test for the B2 coefficient is our test of interest The logic goes like this — we can expand [eq. 2] to be:

y = B1*X + B1*Z + B2*X - B2*Z [eq. 3]

which you can then regroup as:

y = X*(B1 + B2) + Z*(B1 - B2) [eq. 4]

and note the equalities between equations 4 and 1.

B1 + B2 = b1; B1 - B2 = b2

So B2 tests for the difference between the combined B1 coefficient. B2 is a little tricky to interpret in terms of effect size for how much larger b1 is than b2 – it is only half of the effect. An easier way to estimate that effect size though is to insert (X-Z)/2 into the right hand side, and the confidence interval for that will be the effect estimate for how much larger the effect of X is than Z.

Note that this gives an equivalent estimate as to conducting the Wald test by hand as I mentioned before.

21 Comments

by Andy Wheeler on October 19, 2016 • Permalink

Posted in Regression, scholarly

Tagged regression, scholarly

Posted by Andy Wheeler on October 19, 2016

https://andrewpwheeler.com/2016/10/19/testing-the-equality-of-two-regression-coefficients/

Translating between the dispersion term in a negative binomial regression and random variables in SPSS

NOTE!! – when I initially posted this I was incorrect, I thought SPSS listed the dispersion term in the form of Var(x) = mean + mean*dispersion. But I was wrong, and it is Var(x) = 1 + mean*dispersion (the same as Stata’s, what Cameron and Trivedi call the NB2 model, as cited in the Long and Freese Stata book for categorical variables.) The simulation in the original post worked out because my example I used the mean as 1, here I update it to have a mean of 2 to show the calculations are correct. (Also note that this parametrization is equivalent to Var(x) = mean*(1 + mean*dispersion), see Stata’s help for nbreg.)

When estimating a negative binomial regression equation in SPSS, it returns the dispersion parameter in the form of:

Var(x) = 1 + mean*dispersion

When generating random variables from the negative binomial distribution, SPSS does not take the parameters like this, but the more usual N trials with P successes. Stealing a bit from the R documentation for dnbinom, I was able to translate between the two with just a tedious set of algebra. So with our original distribution being:

Mean = mu
Variance = 1 + mu*a

R has an alternative representation closer to SPSS’s based on:

Mean = mu
Variance = mu + mu^2/x

Some tedious algebra will reveal that in this notation x = mu^2/(1 - mu + a*mu) (note to future self, using Solve in Wolfram Alpha could have saved some time, paper and ink). Also, R’s help for dbinom states that in the original N and P notation that p = x/(x + mu). So here with mu and a (again a is the dispersion term as reported by GENLIN in SPSS) we can solve for p.

x = mu^2/(1 - mu + a*mu)
p = x/(x + mu)

And since p is solved, R lists the mean of the distribution in the N and P notation as:

n*(1-p)/p = mu

So with p solved we can figure out N as equal to:

mu*p/(1-p) = n

So to reiterate, if you have a mean of 2 and dispersion parameter of 4, the resultant N and P notation would be:

mu = 2
a = 4
x = mu^2/(1 - mu + a*mu) = 2^2/(1 - 2 + 4*2) = 4/7
p = x/(x + mu) = (4/7)/(4/7 + 2) = 2/9
n = mu*p/(1-p) = 2*(2/9)/(7/9) = 4/7

Here we can see that in the N and P notation the similar negative binomial model results in a fractional number of successes, which might be a surprising result for some that it is even a possibility. (There is likely an easier way to do this translation, but forgive me I am not a mathematician!)

Now we would be finished, but unfortunately SPSS’s negative binomial random functions only take integer values and do not take values of N less than 1 (R’s dnbinom does). So we have to do another translation of the N and P notation to the gamma distribution to be able to draw random numbers in SPSS. Another representation of the negative binomial model is a mixture of Poisson distributions, with the distribution of the mixtures being from a gamma distribution. Wikipedia lists a translation from the N and P notation to a gamma with shape = N and scale = P/(1-P).

So I wrapped these computations up in an SPSS macros that takes the mean and the dispersion parameter, calculates N and P under the hood, and then draws a random variable from the associated negative binomial distribution.

DEFINE !NegBinRV (mu = !TOKENS(1)
       /disp = !TOKENS(1) 
       /out = !TOKENS(1) )
COMPUTE #x = !mu**2/(1 - !mu + !disp*!mu).
COMPUTE #p = #x / (#x + !mu).
COMPUTE #n = !mu*#p/(1 - #p).
COMPUTE #G = RV.GAMMA(#n,#p/(1 - #p)).
COMPUTE !Out = RV.POISSON(#G).
FORMATS !Out (F5.0).
!ENDDEFINE.

I am not sure if it is possible to use this gamma representation and native SPSS functions to calculate the corresponding CDF and PDF of the negative binomial distribution. But we can use R to do that. Here is an example of keeping the mean at 1 and varying the dispersion parameter between 0 and 5.

BEGIN PROGRAM R.
library(ggplot2)
x <- expand.grid(0:10,1:5)
names(x) <- c("Int","Disp")
mu <- 1
x$PDF <- mapply(dnbinom, x=x$Int, size=mu^2/(1 - mu + x$Disp*mu), mu=mu)
#add in poisson 
t <- data.frame(cbind(0:10,rep(0,11),dpois(0:10,lambda=1)))
names(t) <- c("Int","Disp","PDF")
x <- rbind(t,x)
p <- ggplot(data = x, aes(x = Int, y = PDF, group = as.factor(Disp))) + geom_line()
p
#for the CDF
x$CDF <- ave(x$PDF, x$Disp, FUN = cumsum) 
END PROGRAM.

Here you can see how the larger dispersion term can easily approximate the zero inflation typical in criminal justice data (see an applied example from my work). R will not take a dispersion parameter of zero in this notation (as the size would be divided by zero and not defined), so I just tacked on the Poisson distribution with a mean of zero.

Here is an example of generating random data from a negative binomial distribution with a mean of 2 and a dispersion parameter of 4. I then grab the PDF from R, and superimpose them both on a chart in SPSS (or perhaps I should call it a PMF, since it only has support on integer values). You can see the simulation with 10,000 observations is a near perfect fit (so a good sign I did not make any mistakes!)

*Simulation In SPSS.
INPUT PROGRAM.
LOOP Id = 1 TO 10000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME RandNB.

!NegBinRV mu = 2 disp = 4 out = NB.

*Making seperate R dataset of PDF.
BEGIN PROGRAM R.
mu <- 2
disp <- 4
x <- 0:11
pdf <- dnbinom(x=x,size=mu^2/(1 - mu + disp*mu),mu=mu)
#add in larger than 10
pdf[max(x)+1] <- 1 - sum(pdf[-(max(x)+1)])
MyDf <- data.frame(cbind(x,pdf))
END PROGRAM.
EXECUTE.
STATS GET R FILE=* /GET DATAFRAME=MyDf DATASET=PDF_NB.
DATASET ACTIVATE PDF_NB.
FORMATS x (F2.0).
VALUE LABELS x 11 '11 or More'.

*Now superimposing bar plot and PDF from separate datasets.
DATASET ACTIVATE RandNB.
RECODE NB (11 THRU HIGHEST = 11)(ELSE = COPY) INTO NB_Cat.
FORMATS NB_Cat (F2.0).
VALUE LABELS NB_Cat 11 '11 or More'.

GGRAPH
  /GRAPHDATASET NAME="Data" DATASET='RandNB' VARIABLES=NB_Cat[LEVEL=ORDINAL] COUNT()[name="COUNT"] 
  /GRAPHDATASET NAME="PDF" DATASET='PDF_NB' VARIABLES=x pdf
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: Data=userSource(id("Data"))
  DATA: NB_Cat=col(source(Data), name("NB_Cat"), unit.category())
  DATA: COUNT=col(source(Data), name("COUNT"))
  SOURCE: PDF=userSource(id("PDF"))
  DATA: x=col(source(PDF), name("x"), unit.category())
  DATA: den=col(source(PDF), name("pdf"))
  TRANS: den_per = eval(den*100)
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(summary.percent(NB_Cat*COUNT)), shape.interior(shape.square))
  ELEMENT: point(position(x*den_per), color.interior(color.black), size(size."8"))
END GPL.

2 Comments

by Andy Wheeler on January 3, 2015 • Permalink

Posted in Macro, Regression, SPSS

Tagged negative-binomial, r, SPSS

Posted by Andy Wheeler on January 3, 2015

https://andrewpwheeler.com/2015/01/03/translating-between-the-dispersion-term-in-a-negative-binomial-regression-and-random-variables-in-spss/

Poisson regression and crazy predictions

Here is a problem I’ve encountered a few times in my own work (and others) with Poisson regression models and the exponential link function. It came up recently in some discussions on the scatterplot blog by Jeremy Freese (see 1 & 2) critiquing the PNAS paper on the effect of female named hurricanes on death tolls, so I figured I would expand up those thoughts a little here.

So the problem is when you estimate a Poisson regression model is that the exponential link function can become explosive for explanatory variables that have a large range. So to be clear, we have a Poisson regression model of the form (here E[Y] mean the expected value of Y):

log(E[Y]) = B1*(X)
    E[Y]  = e^(B1*X)

If X has a small range this may be fine, but if X has a large range it can become problematic. Consider if Y are hurricane deaths, X is monetary damage of the hurricane, and B1 = 0.01. Lets say the monetary damage ranges from 1 to 1000 (imagine these are in thousands of dollars, so range between $1,000 and $1 million). What happens with the predictions?

E[Deaths] = e^(0.01*   1) =     1.01
E[Deaths] = e^(0.01*   5) =     1.05
E[Deaths] = e^(0.01*  10) =     1.11
E[Deaths] = e^(0.01*  50) =     1.65
E[Deaths] = e^(0.01* 100) =     2.71
E[Deaths] = e^(0.01* 500) =   148
E[Deaths] = e^(0.01*1000) = 22026

These predictions are invariant to linear transformations – that is Z-scoring X in the original units doesn’t change the predictions (the same as expressing X in [dollars/1000] doesn’t make any difference than just by including [X] on the right hand side). The linear predictor of B1 will simply be scaled by the appropriate inverse transformation. Also I’d note that expressed in terms of incident rate ratios the effect would be e^0.01=1.01. This appears on its face a totally innocuous effect, and only in consideration of the variation in X does it appear to be absurd.

You can see that if the range of X were smaller, say between 1 and 100, the predictions might be fine. The predictions between 1 and 100 only vary by 1.7 deaths. The problem with these explosive predictions at larger values is that they are nonsense for most of social scientific research. A simple sanity check to see if this is occurring is to check the predicted value from your Poisson regression equation at the low end of X versus the high end (and just pretend all of the other explanatory variables are set to 0) exactly as I have done here. If the high end is crazy, you will need to consider some alternative model specification (or be very clear that the model can not be extrapolated to the larger values of X).

A useful alternate parametrization is simply to log X, and in this case when exponentiating the right hand side, it will make the predictor a power of the original metric.

So imagine we fit the model:

log(E[Y]) = B2*(log(X))
    E[Y]) = e^(B2*log(X)) 
          = x^B2

Lets say here that B2 = 0.5. What happens to our predictions again?

E[Deaths] =    1^0.5 =  1
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

Those predictions look a little bit easier to swallow at the larger ranges. Notice also the differences in predictions in the smaller stages? There is more discrimination for the smaller values than on the original scale, but the larger values are suppressed. Lets consider the predictions side by side for easier comparison.

   X      B1   B2
   -      --   --
   1       1    1
   5       1    2
  10       1    3
  50       2    7
 100       3   10
 500     148   22
1000   22026   32

A frequent problem with logging the explanatory variables is that they contain zeroes. A simple alternative is to treat log(0) as 0 and then have a separate dummy variable equal to 1 when X = 0. This model may not make Occam happy, as it implies a discontinuity at 0, but it is in my opinion a small price to pay. Also if there are a lot of zeroes this doesn’t strike me as totally unrealistic to have a mixture of what happens at 0 and then what happens at the higher values. So the full model written out would be:

log(E[Y]) = B3*D + B4*(log(X))

But the model is essentially discontinuous. When X=0, we treat log(X)=0 and D=1, so the model reduces to;

log(E[Y]) = B3*D :When X = 0

When X>0, D=0 and the model reduces to:

log(E[Y]) = B4*(log(X)) :When X > 0

Now, it certainly would be weird if B3>>0, as this would imply a high spike at 0, and then at the X value of 1 Y goes back down to 1 and then increases with X. If we expect B4 to be positive, then a negative value of B3 (or very close to 0) would make the most sense. It is still a discontinuity in the function, but one that may make theoretical sense. So imagine we fit the equation log(E[Y]) = B3*D + B4*(log(X)), lets say B4 is 0.5 (the same as B2), and that B3 is equal to -0.1. This would then make the set of predictions go:

E[Deaths] =  e^-0.01 =  0.9
E[Deaths] =    1^0.5 =  1.0
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

So in this made up example the discontinuity pretty much fits right in with the rest of the function. We may consider other non-linear transformations of X as well (splines or higher powers) but frequently an additional problem is that the bulk of the data lie in the lower end of the range. So for our dollars if it was highly right skewed, there may only be a few values at 100 or higher. These can be highly influential if you use powers of X (e.g. include X^2, X^3 etc. on the right hand side) – so splines are a better choice – but essentially no matter how you fit the function it will be hard to verify the fit at these values or extrapolate to those tails. So the fit in the original function may be fine – but it just implies unrealistic marginal effects in the tails.

So how do we verify one equation over the other? Visualizing count data in scatterplots tend to be harder than visualizing continuous data, especially if there is a stock pile of data at 0. The problem is simply exacerbated if the explanatory variable has a similar right skew, there will be a large mass near the origin of the plot and very sparse everywhere else.

My simple suggesting is to just bin the data at X values, which is very simple if X is integer valued, and then plot the mean and standard error of the Y value within those bins. As Poisson regression and its variants rely on asymptotic properties, if the error bars are too variable to deduce a pattern you should be concerned your sample size isn’t large enough to begin with.

If the bulk of the data only have a small range over X, then it will be hard in practice to differentiate between the two model parametrizations I suggest here (with the typical noisy data we have in the social sciences). So you may prefer logging the X variable simply to prevent the dramatic explosions in the tails of the data right from the start.

I do feel comfortable saying that if the ratio of the smallest to largest value for the independent variable is over 100, you should check the predictions of the exponential link function very closely (if the smallest value is 0 just estimate the ratio as if the smallest value is 1). If this ratio is 100 or larger, unless the linear predictor in the Poisson regression equation is very small (<<.01) the predictions may explode into very implausible ranges for the larger X values.

2 Comments

by Andy Wheeler on June 17, 2014 • Permalink

Posted in Regression, scholarly

Tagged Poisson, scholarly

Posted by Andy Wheeler on June 17, 2014

https://andrewpwheeler.com/2014/06/17/poisson-regression-and-crazy-predictions/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 399 other subscribers
aoristic big-data cartography census choropleth citeulike color consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category Regression

An overview of synthetic control estimates

The fundamental problem with synth

Rates vs Counts

Covariates and Out of Sample Estimates

Wrapping Up

References

Example in R

References

A Synthetic Control Example

Comparing to Traditional Synth results

More general notes

Multinomial Models

Testing the equality of multiple indirect effects

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange