# p-values with large samples (AMA)

Vishnu K, a doctoral student in Finance writes in a question:

Dear Professor Andrew Wheeler

Hope you are fine. I am big follower of your blog and have used it heavily to train myself. Since you welcome open questions, I thought of asking one here and I hope you don’t mind.

I was reading the blog Dave Giles and one of his blogs https://davegiles.blogspot.com/2019/10/everythings-significant-when-you-have.html assert that one must adjust for p values when working with large samples. In a related but old post, he says the same

“So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about ‘statistically significant’ results when the sample size is in the thousands, or even bigger. So, the p-values reported above are mostly pretty marginal, as far as significance is concerned” https://davegiles.blogspot.com/2011/04/drawing-inferences-from-very-large-data.html#more

In one of the posts of Andrew Gelman, he said the same

“When the sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values.

Large n, likely to find something. Huge n, almost certain to find lots of small p-values” https://statmodeling.stat.columbia.edu/2009/06/18/the_sample_size/

As Leamer (1978) points if the level of significance should be set as a decreasing function of sample size, is there a formula through which we can check the needed level of significance for rejecting a null?

Context 1: Sample Size is 30, number of explanatory variables are 5

Context 2: Sample Size is 1000, number of explanatory variables are 5

In both contexts cant, we use p-value <.05 or should we fix a very small p-value for context 2 even though both contexts relates to same data set with difference in context 2 being a lot more data points!

Worrying about p-values here is in my opinion the wrong way to think about it. You can focus on the effect size, and even if an effect is significant, it may be substantively too small to influence how you use that information.

Finance I see, so I will try to make a relevant example. Lets say a large university randomizes students to take a financial literacy course, and then 10 years later follows up to see their overall retirement savings accumulated. Say the sample is very large, and we have results of:

``````Taken Class: N=100,000 Mean=5,000 SD=2,000
No Class: N=100,000 Mean=4,980 SD=2,000

SE of Difference ~= 9
Mean Difference = 20
T-Stat ~= 2.24
p-value ~= 0.025``````

So we can see that the treated class saves more! But it is only 20 dollars more over ten years. We have quite a precise estimate. Even though those who took the class save more, do you really think taking the class is worth it? Probably not based on these stats, it is such a trivial effect size given the sample and overall variance of savings.

And then as a follow up from Vishnu:

Thanks a lot Prof Andrew. One final question is, Can we use the Cohen’s d or any other stats for effect size estimation in these cases?

Cohen’s d = (4980 – 5000) ⁄ 2000 = 0.01.

I don’t personally really worry about Cohen’s D to be honest. I like to try to work out the cost-benefits on the scales that are meaningful (although this makes it difficult to compare across different studies). So since I am a criminologist, I will give a crime example:

``````Treated Areas: 40 crimes
Non-Treated Areas: 50 crimes``````

Ignore the standard error for this at the moment. Whether a drop in 10 crimes “is worth it” depends on the nature of the treatment and the type of crime. If the drop is simply stealing small items from the store, but the intervention was hire 10 security guards, it is likely not worth it (the 10 guards salary is likely much higher than the 10 items they prevented theft of).

But pretend now that the intervention was nudging police officers to patrol more in hot spots (so no marginal labor cost) and the crimes we examined were shootings. Preventing 10 shootings is a pretty big deal, because they have such large societal costs.

In this scenario the costs-benefits are always on the count scale (how many crimes did you prevent). Doing another scale (like Cohen’s D or incident rate ratios or whatever) just obfuscates how to calculate the costs/benefits in this scenario.

# How to interpret one sided tests for coefficient differences?

In my ask me anything series, Rob Case writes in a question about interpreting one-sided tests for the difference in coefficients:

Mr. Wheeler,

Thank you for your page https://andrewpwheeler.com/2016/10/19/testing-the-equality-of-two-regression-coefficients/

I did your technique (at the end of the page) of re-running the model with X+Z and X-Z as independent variables (with coefficients B1 and B2, respectively).

I understand:

1. (although you did not say so) that testing whether coefficient b1 (X’s coefficient in the original equation) is LESS THAN coefficient b2 (Z’s coefficient in the original regression) is a one-sided test; and testing whether one coefficient is DIFFERENT from another is a two-sided test
2. that the 90%-confidence t-distribution-critical-values-with-infinite-degrees-of-freedom are 1.282 for one-sided tests and 1.645 for two-sided tests
3. that if the resulting t-stat for the B2 coefficient is say 1.5, then—according to the tests—I should therefore be 90% confident that b1 is in fact less than b2; and I should NOT be 90% confident that b1 is different from b2.

But—according to MY understanding of logic and statistics—if I am 90% confident that b1 is LESS THAN b2, then I would be MORE THAN 90% confident that b1 DIFFERS from b2 (because “differs” includes the additional chance that b1 is greater than b2), i.e. the tests and my logic conflict. What am I doing wrong?

Rob

So I realize null hypothesis statistical testing (NHST) can be tricky to interpret – but the statement in 3 is not consistent with how we do NHST for several reasons.

So if we have a null hypothesis that Beta1 = Beta2, for reasons to do with the central limit theorem we actually rewrite this to be:

``Null: Beta1 - Beta2 = 0 => Theta0``

I’ve noted this new parameter we are testing – the difference in the two coefficients – as Theta0. For NHST we assume this parameter is 0, and then test to see how close our data is to this parameter. So we estimate with our data:

``````b1 - b2 = Diff
DiffZ = Diff/StandardError_Diff``````

Now, to calculate a p-value, we need to say how unlikely our data estimate, DiffZ, is given the assumed null distribution Theta0. So imagine we draw our standard normal distribution curve about Theta0. This then defines the space for NHST, for a typical two sided test we have (here assuming DiffZ is a negative value):

``P(Z < DiffZ | Theta0 ) + P(Z > -DiffZ | Theta0 ) = Two tailed p-value``

Where less than Z is our partitioning of the space of the null hypothesis, since an exact value for DiffZ here when the distribution of potential outcomes is continuous is zero. For a one sided test, you would just take the relevant portion of the above, and not add the two above portions together:

``````P(Z < DiffZ | Theta0 ) = One tail p-value for Beta1 < Beta2
P(Z > -DiffZ | Theta0 ) = One tail p-value for Beta1 > Beta2``````

Note here that the test is conditional on the null hypothesis. Statements such as ‘I should therefore be 90% confident that b1 is in fact less than b2’, which seem to estimate the complement of the p-value (e.g. 1 – p-value) and interpret it as a meaningful probability are incorrect.

P-values are basically numerical summaries of how close the data are to the presumed null distribution. Small p-values just indicate they are not close to the assumed null distribution. The complement of the p-value is not evidence for the alternative hypothesis. It is just the left over distribution for the null hypothesis that is inside the Z values.

Statisticians oftentimes at this point in the conversation suggest Bayesian analysis and instead interpret posteriori probabilities instead of p-values. I will stop here though, as I am not sure “90% confident” readily translates into a specific Bayesian statement. (It could be people are better off doing inferiority/equivalence testing for example, e.g. changing the null hypothesis.)