This is a real example from my work illustrating regression to the mean. I have a scale measuring impulsivity of offenders. I had an intervention that used cognitive behavioral thereapy (CBT) in a boot camp for one group, and business as usual for another (just plain old jail). I have measures of impulsivity at pre
, post
, and 6 month follow up (what I label as post2
). CBT is suppossed to reduce impulsivity, and hopefully keep it that way.
I find that those who have gained the most during the intervention tend to revert back to their prior scores once they leave the bootcamp. That is, the measure [post - pre]
, the gain in bootcamp, has a negative correlation with [post2 - post]
, the loss after bootcamp. Is this due to the intervention being shitty? No! It is not — this is the result of regression to the mean. This does not show any relationship between the values, it will happen even if the impulse scores are totally random.
Note that the definition of covariance is:
Cov(X,Y) = E[(x - E[X])*(y - E[Y])]
Where E
is representing the expectation, and Cov(X,Y)
of course means the covariance between X
and Y
. Here for easier equations we can assume the mean in the impulse scale is zero across all three waves, which makes the means of the change scores zero as well (without any loss in generality). So dropping the inner expecations this equation reduces to:
Cov(X,Y) = E[x*y]
So defining post-pre = Change1
and post2 - post = Change2
, expanding out to the original components we have:
Cov(Change1,Change2) = Cov(post-pre,post2-post) = E[ (post-pre)*(post2-post) ]
The last result can then be expanded to:
E[ post*post2 - post*post - pre*post2 + pre*post ]
Because of the bilinearity of expectation, these can be further teased out:
E[ post*post2 ] - E[ post*post ] - E[ pre*post2 ] + E[ pre*post]
Note we can rewrite this back into variances and covariances of the original levels:
Cov(post,post2) - Var(post) - Cov(pre,post2) + Cov(pre,post)
There are two things to note here. 1) The covariances in the change scores can be entirely written as functions in the covariances of the levels. They do not supply information independent of the levels themselves.
For 2), if the data are random (that is the covariances between all the levels are random), the covariances between the change scores will be negative. This is because of the minus sign in front of the variance of the post term. For random data, all the other covariances are zero. This results in the correlation between the change scores being -1/2.
For a simple example in R:
> set.seed(10)
> n <- 10000 #sample size
> t1 <- rnorm(n) #three random vectors
> t2 <- rnorm(n)
> t3 <- rnorm(n)
> levels <- data.frame(t1,t2,t3)
> differ <- data.frame(c1=t2-t1,c2=t3-t2)
>
> #correlations in levels are approximately zero
> cor(levels)
t1 t2 t3
t1 1.0000000000 0.001874345 -0.0007006367
t2 0.0018743450 1.000000000 -0.0045967380
t3 -0.0007006367 -0.004596738 1.0000000000
>
> #correlation of differences is -0.5
> cor(differ)
c1 c2
c1 1.0000000 -0.4983006
c2 -0.4983006 1.0000000
Sometimes I see people talk about regression to the mean as if it is a sociological thing, like something that needs to be explained in terms of human behavior. It is not, it is entirely mathematical.
This is also one of the reasons I don’t like using change scores, either as independent or dependent variables. They typically can be rewritten in terms of the levels, and involve coeffficient restrictions that can have strange consequences. There are some situations (fixed effects) that make sense for the dependent variable. I haven’t seen a situation in the terms of independent variables where they make sense.
1 Comment