p-values with large samples (AMA)

Vishnu K, a doctoral student in Finance writes in a question:

Dear Professor Andrew Wheeler

Hope you are fine. I am big follower of your blog and have used it heavily to train myself. Since you welcome open questions, I thought of asking one here and I hope you don’t mind.

I was reading the blog Dave Giles and one of his blogs https://davegiles.blogspot.com/2019/10/everythings-significant-when-you-have.html assert that one must adjust for p values when working with large samples. In a related but old post, he says the same

“So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about ‘statistically significant’ results when the sample size is in the thousands, or even bigger. So, the p-values reported above are mostly pretty marginal, as far as significance is concerned” https://davegiles.blogspot.com/2011/04/drawing-inferences-from-very-large-data.html#more

In one of the posts of Andrew Gelman, he said the same

“When the sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values.

Large n, likely to find something. Huge n, almost certain to find lots of small p-values” https://statmodeling.stat.columbia.edu/2009/06/18/the_sample_size/

As Leamer (1978) points if the level of significance should be set as a decreasing function of sample size, is there a formula through which we can check the needed level of significance for rejecting a null?

Context 1: Sample Size is 30, number of explanatory variables are 5

Context 2: Sample Size is 1000, number of explanatory variables are 5

In both contexts cant, we use p-value <.05 or should we fix a very small p-value for context 2 even though both contexts relates to same data set with difference in context 2 being a lot more data points!

Worrying about p-values here is in my opinion the wrong way to think about it. You can focus on the effect size, and even if an effect is significant, it may be substantively too small to influence how you use that information.

Finance I see, so I will try to make a relevant example. Lets say a large university randomizes students to take a financial literacy course, and then 10 years later follows up to see their overall retirement savings accumulated. Say the sample is very large, and we have results of:

Taken Class: N=100,000 Mean=5,000 SD=2,000
   No Class: N=100,000 Mean=4,980 SD=2,000

SE of Difference ~= 9
Mean Difference = 20
T-Stat ~= 2.24
p-value ~= 0.025

So we can see that the treated class saves more! But it is only 20 dollars more over ten years. We have quite a precise estimate. Even though those who took the class save more, do you really think taking the class is worth it? Probably not based on these stats, it is such a trivial effect size given the sample and overall variance of savings.

And then as a follow up from Vishnu:

Thanks a lot Prof Andrew. One final question is, Can we use the Cohen’s d or any other stats for effect size estimation in these cases?

Cohen’s d = (4980 – 5000) ⁄ 2000 = 0.01.

I don’t personally really worry about Cohen’s D to be honest. I like to try to work out the cost-benefits on the scales that are meaningful (although this makes it difficult to compare across different studies). So since I am a criminologist, I will give a crime example:

Treated Areas: 40 crimes
Non-Treated Areas: 50 crimes

Ignore the standard error for this at the moment. Whether a drop in 10 crimes “is worth it” depends on the nature of the treatment and the type of crime. If the drop is simply stealing small items from the store, but the intervention was hire 10 security guards, it is likely not worth it (the 10 guards salary is likely much higher than the 10 items they prevented theft of).

But pretend now that the intervention was nudging police officers to patrol more in hot spots (so no marginal labor cost) and the crimes we examined were shootings. Preventing 10 shootings is a pretty big deal, because they have such large societal costs.

In this scenario the costs-benefits are always on the count scale (how many crimes did you prevent). Doing another scale (like Cohen’s D or incident rate ratios or whatever) just obfuscates how to calculate the costs/benefits in this scenario.

Precision in measures and policy relevance

Too busy to post much recently – will hopefully slow down a bit soon and publish some more technical posts, but just a quick opinion post for this Sunday. Reading a blog post by Callie Burt the other day – I won’t comment on the substantive critique of the Harden book she is discussing (since I have not read it), but this quote struck me:

precise point estimates are generally not of major interest to social scientists. Nearly all of our measures, including our outcome measures, are noisy, (contain error), even biased. In general, what we want to know is whether more of something (education, parental support) is associated with more (or less) of something else (income, education) that we care about, ideally with some theoretical orientation. Frequently the scale used to measure social influences is somewhat arbitrary anyway, such that the precise point estimate (e.g., weeks of schooling) associated with 1 point increase in the ‘social support scale’ is inherently vague.

I think Callie is right, precise point estimates often aren’t of much interest in general criminology. I think this perspective is quite bad though for our field as a whole in terms of scientific advancement. Most criminology work is imprecise (for various reasons), and because of this it has no hope to be policy relevant.

Lets go with Callie’s point about education is associated with income. Imagine we have a policy proposal that increases high school completion rates via allocating more money to public schools (the increased education), and we want to see its improvement on later life outcomes (like income). Whether a social program “is worth it” depends not only whether it is effective in increasing high school completion rates, but by how much and how much return on investment there is those later life outcomes we care about. Programs ultimately have costs; both in terms of direct costs as well as opportunity costs to fund some other intervention.

Here is another more crim example – I imagine most folks by now know that bootcamps are an ineffective alternative to incarceration for the usual recidivism outcomes (MacKenzie et al., 1995). But what folks may not realize is that bootcamps are often cheaper than prison (Kurlychek et al., 2011). So even if they do not reduce recidivism, they may still be worth it in a cost-benefit analysis. And I think that should be evaluated when you do meta-analyses of CJ programs.

Part of why I think economics is eating all of the social sciences lunch is not just because of the credibility revolution, but also because they do a better job of valuating costs and benefits for a wide variety of social programs. These cost estimates are often quite fuzzy, same as more general theoretical constructs Callie is talking about. But we often can place reasonable bounds to know if something is effective enough to be worth more investment.

There are a smattering of crim papers that break this mold though (and to be clear you can often make these same too fuzzy to be worthwhile critiques for many of my papers). For several examples in the policing realm Laura Huey and her Canadian crew have papers doing a deep dive into investigation time spent on cases (Mark et al., 2019). Another is Lisa Tompson and company have a detailed program evaluation of a stalking intervention (Tompson et al., 2021). And for a few papers that I think are very important are Priscilla Hunt’s work on general CJ costs for police and courts given a particular UCR crime (Hunt et al., 2017; 2019).

Those four papers are definitely not the norm in our field, but personally think are much more policy relevant than the vast majority of criminological research – properly estimating the costs is ultimately needed to justify any positive intervention.

References

  • Hunt, P., Anderson, J., & Saunders, J. (2017). The price of justice: New national and state-level estimates of the judicial and legal costs of crime to taxpayers. American Journal of Criminal Justice, 42(2), 231-254.
  • Hunt, P. E., Saunders, J., & Kilmer, B. (2019). Estimates of law enforcement costs by crime type for benefit-cost analyses. Journal of Benefit-Cost Analysis, 10(1), 95-123.
  • Kurlychek, M. C., Wheeler, A. P., Tinik, L. A., & Kempinen, C. A. (2011). How long after? A natural experiment assessing the impact of the length of aftercare service delivery on recidivism. Crime & Delinquency, 57(5), 778-800.
  • MacKenzie, D. L., Brame, R., McDowall, D., & Souryal, C. (1995). Boot camp prisons and recidivism in eight states. Criminology, 33(3), 327-358.
  • Tompson, L., Belur, J., & Jerath, K. (2021). A victim-centred cost–benefit analysis of a stalking prevention programme. Crime Science, 10(1), 1-11.
  • Mark, A., Whitford, A., & Huey, L. (2019). What does robbery really cost? An exploratory study into calculating costs and ‘hidden costs’ of policing opioid-related robbery offences. International Journal of Police Science & Management, 21(2), 116-129.

Knowing when to fold them: A quantitative approach to ending investigations

The recent work on investigations in the criminal justice field has my head turning about potential quantitative applications in this area (check out the John Eck & Kim Rossmo podcasts on Jerry’s site first, then check out the recent papers in Criminology and Public Policy on the topic for a start). One particular problem that was presented to me was detective case loads — detectives are humans, so can only handle so many cases at once. Triage is typically taken at the initial crime reporting stage, with inputs such as seriousness of the offense, the overall probability of the case being solved, and future dangerousness of folks involved being examples of what goes into that calculus to assign a case.

Here I wanted to focus on a different problem though — how long to keep cases open? There are diminishing returns to keeping cases open indefinitely, and so PDs should be able to right size the backend of detective open cases as well as the front end triaging. Here my suggested solution is to estimate a survival model of the probability of a case being solved, and then you can estimate an expected return on investment given the time you put in.

Here is a simplified example. Say the table below shows the (instantaneous) probability of a case being solved per weeks put into the investigation.

Week 1  20%
Week 2  10%
Week 3   5%
Week 4   3%
Week 5   1%

In survival model parlance, this would be the hazard function in discrete time increments. And then we have diminishing probabilities over time, which should also be true (e.g. a higher probability of being solved right away, and gets lower over time). The expected return of investigating this crime at time t is the cumulative probability of the crime being solved at time t, multiplied by whatever value you assign to the case being solved. The costs of investigating will be fixed (based on the detective salary), so is just a multiple of t*invest_costs.

So just to fill in some numbers, lets say that it costs the police department $1,000 a week to keep an investigation going. Also say a crime has a return of $10,000 if it is solved (the latter number will be harder to figure out in practice, as cost of crime estimates are not a perfect fit). So filling in our table, we have below our detective return on investment estimates (note that the cumulative probability of being solved is not simply the sum of the instantaneous probabilities, else it would eventually go over 100%). So return on investment (ROI), at week 1 is 10,000*0.2 = 2,000, at week 2 is 10,000*0.28 = 2,800, etc.

        h(t) solved%  cum-costs   ROI   
Week 1  20%    20%     1,000     2,000
Week 2  10%    28%     2,000     2,800
Week 3   5%    32%     3,000     3,200
Week 4   3%    33%     4,000     3,300
Week 5   1%    34%     5,000     3,400

So the cumulative costs outweigh the total detective resources devoted to the crime by Week 4 here. So in practice (in this hypothetical example) you may say to a detective you get 4 weeks to figure it out, if not solved by then it should be closed (but not cleared), and you should move onto other things. In the long run (I think) this strategy will make sure detective resources are balanced against actual cases solved.

This right sizes investigation lengths from a global perspective, but you also might consider whether to close a case on an individual case-by-case basis. In that case you wouldn’t calculate the sunk cost of the investigation so far, it is just the probability of the case being solved going forward relative to future necessary resources. (You do the same table, just start the cum-costs and solved percent columns from scratch whenever you are making that decision.)

In an actual applied setting, you can estimate the survival function however you want (e.g. you may want a cure mixture-model, so not all cases will result in 100% being solved given infinite time). It is also the case that different crimes will not only have different survival curves, but also will have different costs of crime (e.g. a murder has a greater cost to society than a theft) and probably different investigative resources needed (detective costs may also get lower over time, so are not constant). You can bake that all right into this estimate. So you may say the cost of a murder is infinite, and you should forever keep that case open investigating it. A burglary though may be a very short time interval before it should be dropped (but still have some initial investment).

Another neat application of this is that if you can generate reasonable returns to solving crimes, you can right size your overall detective bureau. That is you can make a quantitative argument I need X more detectives, and they will help solve Y more crimes resulting in Z return on investment. It may be we should greatly expand detective bureaus, but have them only keep many cases open a short time period. I’m thinking of the recent officer shortages in Dallas, where very few cases are assigned at all. (Some PDs have patrol officers take initial detective duties on the crime scene as well.)

There are definitely difficulties with applying this approach. One is that getting the cost of solving a crime estimate is going to be tough, and bridges both quantitative cost of crime estimates (although many of them are sunk costs after the crime has been perpetrated, arresting someone does not undo the bullet wound), likelihood of future reoffending, and ethical boundaries as well. If we are thinking about a detective bureau that is over-booked to begin with, we aren’t deciding on assigning individual cases at that point, but will need to consider pre-empting current investigations for new ones (e.g. if you drop case A and pick up case B, we have a better ROI). And that is ignoring the estimating survival part of different cases, which is tricky using observational data as well (selection biases in what cases are currently assigned could certainly make our survival curve estimates too low or too high).

This problem has to have been tackled in different contexts before (either by actuaries or in other business/medical contexts). I don’t know the best terms to google though to figure it out — so let me know in the comments if there is related work I should look into on solving this problem.