Social scientists often have a problem when conducting analysis — we have theories that are not tightly coupled to actual measures of individual behavior. A response to this is to often conduct models of many different, interrelated measures. This can be either as outcome variables, e.g. if I know poverty predicts all crimes, does poverty predict both violent crime and property crime at the city level? Or as explanatory variables, e.g. does being a minority reduce your chances of getting a job interview, or does it matter the type of minority you are — e.g. Black, Asian, Hispanic, Native American, etc.
Another situation is conducting analysis among different units of analysis, e.g. see if a treatment has a different effect for males or females, or see if a treatment works well in one country, but does not work well in another. Or if I find that a policy intervention works at the city level, are the effects in all areas of the city, or in just some neighborhoods?
On its face, these may seem like all unique problems. They are not, they are all different variants of subgroup analysis. In my dissertation in trying to identify situations in which you need to use small geographic spatial units of analysis, I realized that the logic behind choosing a geographic unit of analysis is the same as these different subgroup analyses. I more succinctly outline my logic in this article, but I will try it in a blog post as well.
I am what I would call a “reductionist social scientist”. In plain terms, if we fit a model:
Y = B*X
we can always get more specific, either in terms of explanatory variables:
Y = b1*x1 + b2*x2, where X = x1 + x2
Or in terms of the outcome:
y1 = b1*X
y2 = b2*X, where Y = y1 + y2
Hubert Blalock talks about this in his causal inferences book. I think many social scientists are reductionists in this sense, we can always estimate more specific explanatory variables, or more specific outcomes, or within different subgroups, ad nauseam. Thus the problem is not should we conduct analysis in some particular subgroup, but when should we be satisfied that the aggregate effect we estimate is good enough, or at least not misleading.
So remember this when we are evaluating a model, the aggregate effect is a function of the sub-group effects. In linear models the math is easy, which I show some examples in my linked paper, but the logic generally holds for non-linear models as well. So when should we be ok with the aggregate level effect? We should be ok with the aggregate effect if we expect the direction and size of the effects in the subgroups to be similar. We should not be ok if the effects are countervailing in the subgroups, or if the magnitude of the differences is very large.
For some simplistic examples, if we go with our job interview for minorities relative to whites example:
Prob(Job Interview) = 0.5 + 0*(Minority)
So here the effect is zero, minorities have the same probability as white individuals, 50%. But lets say we estimate an effect for different minority categories:
Prob(Job Interview) = 0.5 + 0.3(Asian) - 0.3(Black)
Our aggregate effect for minorities is zero, because it is positive for Asian’s and negative for Black individuals, and in the aggregate these two effects cancel out. That is one situation in which we should be worried.
Now how about the effect of poverty on crime:
All Crime = 5*(Percent in Poverty)
Versus different subsets of crime, violent and property.
Violent Crime = 3*(Percent in Poverty)
Property Crime = 2*(Percent in Poverty)
Here we can see that the subgroups contribute to the total, but the effect for property is slightly less than that for violent crime. Here the aggregate effect is not misleading, but the micro level effect may be theoretically interesting.
For the final areas, lets say we do a gun buy back program, and we estimate the reduction in shootings at the city wide level. So lets say we estimate the number of shootings per month:
Shootings in the City = 10 - 5*(Gun Buy Back)
So we say the gun buy back reduced 5 shootings per month. Maybe we think the reduction is restricted to certain areas of the city. For simplicity, say this city only has two neighborhoods, North and South. So we estimate the effect of the gun buy back in these two neighborhoods:
Shootings in the North Neighborhood = 9 - 5*(Gun Buy Back)
Shootings in the South Neighborhood = 1 - 0*(Gun Buy Back)
Here we find the program only reduced shootings in the North neighborhood, it had no appreciable effects in the south neighborhood. The aggregate city level effect is not misleading, we can just be more specific about decomposing that effect to different areas.
Here I will relate this to some of my recent work — using 311 calls for service to predict crime at micro places in DC.
In a nutshell, I’ve fit models of the form:
Crime = B*(311 Calls for Service)
And I found that 311 calls have a positive, but small, effect on crime.
Over time, either at presentations in person or in peer review, I’ve gotten three different “subgroup” critiques. These are:
- I think you cast the net too wide in 311 calls, e.g. “bulk collections” should not be included
- I think you shouldn’t aggregate all crime on the left hand side, e.g. I think the effect is mostly for robberies
- I think you shouldn’t estimate one effect for the entire city, e.g. I think these signs of disorder matter more in some neighborhoods than others
Now, these are all reasonable questions, but does it call into question my main aggregate finding? Not at all.
For the casting the net too wide for 311 calls, do you think that bulk collections have a negative relationship to crime? Unlikely. (I’ve eliminated them from my current article due to one reviewer complaint, but to be honest I think they should be included. Seeing a crappy couch on the street is not much different than seeing garbage.)
For all crime on the left hand side, do you think 311 calls have a negative effect on some crimes, but a positive effect on others? Again, unlikely. It may be the case that it has larger effects on some than others, but it does not mean the effect on all crime is misleading. So what if it is larger for robberies than others, you can go and build a theory about why that is the case and test it.
For the one estimate in different parts of the city, do you think it has a negative effect in some parts, and a positive effect in others? Again, unlikely. It may be some areas the effect is larger, but overall we expect it to be positive or zero in all areas of the city. The aggregate city wide effect is not likely to be misleading.
These are all fine for future research question, but I get frustrated when these are given as reasoning to critique my current findings. They don’t invalidate the aggregate findings at all.
In response to this, you may think, well why not conduct all these subgroup analyses – whats the harm? There are a few different harms to conducting these subgroup analyses willy-nilly. They are all related to chasing the noise and interpreting it.
For each of these subgroups, you will have smaller power to estimate effects than the aggregate. Say I test the effect of each individual 311 call type (there are near 30 that I sum together). Simply by chance some of these will have null effects or slightly negative effects and all will be small by themselves. I have no apriori reason to think some have a different effect than others, the theory behind why they are related to crime at all (Broken Windows) does not distinguish between them.
This often ends up being a catch-22 in peer review. You do more specific analysis, by chance a coefficient goes in the wrong direction, and the reviewer interprets it as your measures and/or model is bunk. In reality they are just over-interpreting noise.
That is in response to reviewers, but what about you conducting subgroup analysis on your own? Here you have to worry about the garden-of-forking paths. Say I conducted the subgroup analysis for different types of crime outcomes, and they are all in the same direction except for thefts from auto. I then report all of the results except for thefts from auto, because that does not confirm my theory. This is large current problem in reproducing social science findings — a subgroup analysis may seem impressive, but you have to worry about which results the researcher cherry picked.
This only reporting confirmatory evidence for some subgroups will always be a problem in data analysis — not even pre-registration of your data plan will solve it. Thus, you should only do subgroup analysis if there is strong theoretical reasoning you think the aggregate effect is misleading. You should simply plan a new study on its own to assess different subgroups from the start if you think differences may be theoretically interesting.
Given some of the reviews I received for the 311 paper, I am stuffing many of these subgroup analyses in Appendices just to preempt reviewer critique. (Ditto for my paper on alcohol outlets and crime I will be presenting at ASC in a few weeks, that will be up on SSRN likely next week.) I don’t think it is the right thing to do though, again I think it is mostly noise mining, but perpetually nit-picky reviewers have basically forced me to do it.