AMA OLS vs Poisson regression

Crazy busy with Crime De-Coder and day job, so this blog has gone by the wayside for a bit. I am doing more python training for crime analysts, most recently in Austin.

If you want to get a flavor of the training, I have posted a few example videos on YouTube. Here is an example of going over Quarto markdown documents:

I do these custom for each agency. So I log into your system, do actual queries with your RMS to illustrate. Coding is hard to get started, so part of the idea behind the training is to figure out all of the hard stuff (installation, connecting to your RMS, setting up batch jobs), so it is easier for analysts to get started.


This post was a good question I recently received from Lars Lewenhagen at the Swedish police:

In my job I often do evaluations of place-based interventions. Sometimes there is a need to explore the dosage aspect of the intervention. If I want to fit a regression model for this the literature suggests doing a GLM regression predicting the crime counts in the after period with the dosage and crime counts in the before period as covariates. This looks right to me, but the results are often contradictory. Therefore, I contemplated making change in crime counts the dependent variable and doing simple linear regression. I have not seen anyone doing this, so it must be wrong, but why?

And my response was:

Short answer is OLS is probably fine.

Longer answer to tell whether it makes more sense for OLS vs GLM what matters is mostly the functional relationship between the dose response. So for example, say your doses were at 0,1,2,3

A linear model will look like for example

E[Y] = 10 + 3*x

Dose, Y
 0  , 10
 1  , 13
 2  , 16
 3  , 19

E[Y] is the “expected value of Y” (the parameter that is akin to the sample mean). For a Poisson model, it will look like:

log(E[Y]) = 2.2 + 0.3*x

Dose, Y
 0  ,  9.0
 1  , 12.2
 2  , 16.4
 3  , 22.2

So if you plot your mean crime at the different doses, and it is a straight line, then OLS is probably the right model. If you draw the same graph, but use a logged Y axis and it is a straight line, Poisson GLM probably makes more sense.

In practice it is very hard to tell the difference between these two curves in real life (you need to collect dose response data at many points). So just going with OLS is not per se good or bad, it is just a different model and for experiments with only a few dose locations it won’t make much of a difference to describe the experiment itself.

Where the model makes a bigger difference is extrapolating. Go with our above two models, and look at the prediction for dose=10. The differences between the two models make a much larger difference.

I figured this would be a good one for the blog. Most of the academic material will talk about the marginal distribution of the variable being modeled (which is not quite right, as the conditional distribution is what matters). Really for alot of examples I look, linear models are fine, hence why I think the WDD statistic is reasonable (but not always).

For quasi-experiments it is the ratio between treated and control as well, but for a simpler dose-response scenario, you can just plot the means at binned locations of the doses and then see if it is a straight or curved line. In sample it often doesn’t even matter very much, it is all just fitting mean values. Where it is a bigger deal is extrapolation outside of the sample.

Next Post
Leave a comment

Leave a comment