Graphing Spline Predictions in SPSS

I might have around 10 blog posts about using splines in regression models – and you are about to get another. Instead of modeling non-linear effects via polynomial terms (e.g. including x^2, x^3 in a model, etc.), splines are a much better default procedure IMO. For a more detailed mathy exposition on splines and a walkthrough of the functions, see my class notes.

So I had a few questions about applying splines in generalized linear models and including control variables in my prior post (on a macro to estimate the spline terms). These include can you use them in different types of generalized linear models (yes), can you include other covariates into the model (yes). For either of those cases, interpreting the splines are more difficult though. I am going to show an example here of how to do that.

Additionally I have had some recent critiques of my paper on CCTV decay effects. One is that the locations of the knots we chose in that paper is arbitrary. So while that is true, one of the reasons I really like splines is that they are pretty robust – you can mis-specify the knot locations, and if you have enough of them they will tend to fit quite a few non-linear functions. (Also a note on posting pre-prints, despite being rejected twice and under review for around 1.5 years, it has over 2k downloads and a handful of citations. The preprint has more downloads than my typical published papers do.)

So here I am going to illustrate these points using some simulated data according to a particular logistic regression equation. So I know the true effect, and will show how mis-located spline knots still recovers the true effect quite closely. This example is in SPSS, and uses my macro on estimating spline basis.

Generating Simulated Data

So first in SPSS, I define the location where I am going to save my files. Then I import my Spline macro.

* Example of splines for generalized linear models 
* and multiple variables.

DATASET CLOSE ALL.
OUTPUT CLOSE ALL.

* Spline Macro.
FILE HANDLE macroLoc /name = "C:\Users\andre\OneDrive\Desktop\Spline_SPSS_Example".
INSERT FILE = "macroLoc\MACRO_RCS.sps".

Second, I create a set of synthetic data, in which I have a linear changepoint effect at x = 0.42. Then I generate observations according to a particular logistic regression model, with not only the non-linear X effects, but also two covariates Z1 (a binary variable) and Z2 (a continuous variable).

*****************************************************.
* Synthetic data.
SET SEED = 10.
INPUT PROGRAM.
LOOP Id = 1 to 10000.
END CASE.
END LOOP.
END file.
END INPUT PROGRAM.
DATASET NAME Sim.

COMPUTE X = RV.UNIFORM(0,1).
COMPUTE #Change = 0.42.
DO IF X <= #Change.
  COMPUTE XDif = 0.
ELSE.
  COMPUTE XDif = X - #Change.
END IF.
COMPUTE Z1 = RV.BERNOULLI(0.5).
COMPUTE Z2 = RV.NORMAL(0,1).  

DEFINE !INVLOGIT (!POSITIONAL  !ENCLOSE("(",")") ) 
1/(1 + EXP(-!1))
!ENDDEFINE.

*This is a linear changepoint at 0.42, other variables are additive.
COMPUTE ylogit = 1.1 + -4.3*x + 2.4*xdif + -0.4*Z1 + 0.2*Z2.
COMPUTE yprob = !INVLOGIT(ylogit).
COMPUTE Y = RV.BERNOULLI(yprob).
*These are variables you won't have in practice.
ADD FILES FILE =* /DROP ylogit yprob XDif.
FORMATS Id (F9.0) Y Z1 (F1.0) X Z2 (F3.2).
EXECUTE.
*****************************************************.

Creating Spline Basis and Estimating a Model

Now like I said, the correct knot location is at x = 0.42. Here I generate a set of regular knots over the x input (which varies from 0 to 1), at not the exact true value for the knot.

!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

Now if you look at your dataset, there are 3 new splinex? variables. (For restricted cubic splines, you get # of knots - 2 new variables, so with 5 knots you get 3 new variables here.)

We are then going to use those new variables in a logistic regression model. We are also going to save our model results to an xml file. This allows us to use that model to score a different dataset for predictions.

GENLIN Y (REFERENCE=0) WITH X splinex1 splinex2 splinex3 Z1 Z2 
  /MODEL X splinex1 splinex2 splinex3 Z1 Z2 
      INTERCEPT=YES DISTRIBUTION=BINOMIAL LINK=LOGIT
  /OUTFILE MODEL='macroLoc\LogitModel.xml'. 

And if we look at the coefficients, you will see that the coefficients look offhand very close to the true coefficients, minus splinex2 and splinex3. But we will show in a second that those effects should be of no real concern.

Generating New Data and Plotting Predictions

So you should do this in general with generalized linear models and/or non-linear effects, but to interpret spline effects you can’t really look at the coefficients and know what those mean. You need to make plots to understand what the non-linear effect looks like.

So here in SPSS, I create a new dataset, that has a set of regularly sampled locations along X, and then set the covariates Z1=1 and Z2=0. These set values you may choose to be at some average, such as mean, median, or mode depending on the type of covariate. So here since Z1 can only take on values of 0 and 1, it probably doesn’t make sense to choose 0.5 as the set value. Then I recreate my spline basis functions using the exact sample macro call I did earlier.

INPUT PROGRAM.
LOOP #xloc = 0 TO 300.
  COMPUTE X = #xloc/300.
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Fixed.
COMPUTE Z1 = 1.
COMPUTE Z2 = 0.
EXECUTE.
DATASET ACTIVATE Fixed.

*Redoing spline variables.
!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

Now in SPSS, we score this dataset using our prior model xml file we saved. Here this generates the predicted probability from our logistic model.

MODEL HANDLE NAME=LogitModel FILE='macroLoc\LogitModel.xml'. 
COMPUTE PredPr = APPLYMODEL(LogitModel, 'PROBABILITY', 1).
EXECUTE.
MODEL CLOSE NAME=LogitModel.

And to illustrate how close our model is, I generate what the true predicted probability should be based on our simulated data.

*Lets also do a line for the true effect to show how well it fits.
COMPUTE #change = 0.42.
DO IF X <= #change.
  COMPUTE xdif = 0.
ELSE.
  COMPUTE xdif = (X - #change).
END IF.
EXECUTE.
COMPUTE ylogit = 1.1 + -4.3*x + 2.4*xdif + -0.4*Z1 + 0.2*Z2.
COMPUTE TruePr = !INVLOGIT(ylogit).
FORMATS TruePr PredPr X (F2.1).
EXECUTE.

And now we can put these all into one graph.

DATASET ACTIVATE Fixed.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PredPr TruePr
  /FRAME INNER=YES
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: PredPr=col(source(s), name("PredPr"))
  DATA: TruePr=col(source(s), name("TruePr"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: cat(aesthetic(aesthetic.shape), map(("PredPr",shape.solid),("TruePr",shape.dash)))
  ELEMENT: line(position(X*PredPr), shape("PredPr"))
  ELEMENT: line(position(X*TruePr), shape("TruePr")) 
END GPL.

So you can see that even though I did not choose the correct knot location, my predictions are nearly spot on with what the true probability should be.

Generating Predictions Over Varying Inputs

So in practice you can do more complicated models with these splines, such as allowing them to vary over different categories (e.g. interactions with other covariates). Or you may simply want to generate predicted plots such as above, but have a varying set of inputs. Here is an example of doing that; for Z1 we only have two options, but for Z2, since it is a continuous covariate we sample it at values of -2, -1, 0, 1, 2, and generate lines for each of those predictions.

*****************************************************.
* Can do the same thing, but vary Z1/Z2.

DATASET ACTIVATE Sim.
DATASET CLOSE Fixed.

INPUT PROGRAM.
LOOP #xloc = 0 TO 300.
  LOOP #z1 = 0 TO 1.
    LOOP #z2 = -2 TO 2.
      COMPUTE X = #xloc/300.
      COMPUTE Z1 = #z1.
      COMPUTE Z2 = #z2.
      END CASE.
    END LOOP.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Fixed.
EXECUTE.
DATASET ACTIVATE Fixed.

*Redoing spline variables.
!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

MODEL HANDLE NAME=LogitModel FILE='macroLoc\LogitModel.xml'. 
COMPUTE PredPr = APPLYMODEL(LogitModel, 'PROBABILITY', 1).
EXECUTE.
MODEL CLOSE NAME=LogitModel.

FORMATS Z1 Z2 (F2.0) PredPr X (F2.1).
VALUE LABELS Z1
  0 'Z1 = 0'
  1 'Z1 = 1'.
EXECUTE.

*Now creating a graph of the predicted probabilities over various combos.
*Of input variables.
DATASET ACTIVATE Fixed.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PredPr Z1 Z2
  /FRAME INNER=YES
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: PredPr=col(source(s), name("PredPr"))
  DATA: TruePr=col(source(s), name("TruePr"))
  DATA: Z1=col(source(s), name("Z1"), unit.category())
  DATA: Z2=col(source(s), name("Z2"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Predicted Probability"))
  GUIDE: axis(dim(3), opposite())
  GUIDE: legend(aesthetic(aesthetic.color), label("Z2"))
  SCALE: cat(aesthetic(aesthetic.color), map(("-2",color."8c510a"),("-1",color."d8b365"),
               ("0",color."f6e8c3"), ("1",color."80cdc1"), ("2",color."018571")))
  ELEMENT: line(position(X*PredPr*Z1), color(Z2))
END GPL.
*****************************************************.

So between all of these covariates, the form of the line does not change much (as intended, I simulated the data according to an additive model).

If you are interested in drawing more lines for Z2, you may want to use a continuous color scale instead of a categorical one (see here for a similar example).

A failed attempt at optimal search paths

Recently saw Kim Rossmo have a paper that describes a Bayesian approach to prioritizing areas for a search for missing persons. So he illustrates an approach to give a probability surface, but that still leaves implicit how individuals are to traverse over that probability space in the search itself.

For an example of where there can be potential ambiguity even with the probability surface, in the surface below we have three hot spots. So if we have four people to search this area, and they can only search a finite connected area (so no hop-scotching around), should we have them split between each of the hot spots, or should they cover one of the hot spots in more detail. (It is hard to tell in my graph, but the hot spot in the central western part of the graph has a higher hump, but is steeper, so top right has more mass but is more spread out.)

I’ve actually failed to be able to generate a decent algorithm to do this though. It is similar to this prior work of mine, but I actually discovered some errors in that work in trying to apply it to this situation (can have disconnected subtours that are complicated paths). So attempted several other variants and have yet to come up with a decent solution.

I tried out a greedy algorithm to solve the problem (pick the highest hump, march like an ant around until you have covered your max tour, and then start again). But this was not good either. But it generated some interesting accidental art. So here is my greedy approach to pick four tours in which they can traverse 300 grid cells, and here it says better to ignore the bottom hotspot and spread around your effort in the other two areas:

I know this is pretty sup-optimal though, as you can continue to generate more tours through this approach and eventually find better ones.

This is going to bug me forever now, but posting a blog to move on. So if you know of a solution please fill me in!

Lit reviews are (almost) functionally worthless

The other day I got an email from ACJS about the most downloaded articles of the year for each of their journals. For The Journal of Criminal Justice Education it was a slightly older piece, How to write a literature review in 2012 by Andrew Denney & Richard Tewksbury, DT from here on. As you can guess by the title of my blog post, it is not my most favorite subject. I think it is actually an impossible task to give advice about how to write a literature review. The reason for this is that we have no objective standards by which to judge a literature review – whether one is good or bad is almost wholly subject to the discretion of the reader.

The DT article I don’t think per se gives bad advice. Use an outline? Golly I suggest students do that too! Be comprehensive in your lit review about covering all relevant work? Well who can argue with that!

I think an important distinction to make in the advice DT give is the distinction between functional actions and symbolic actions. Functional in this context means an action that makes the article better accomplish some specific function. So for example, if I say you should translate complicated regression models to more intuitive marginal effects to make your results more interpretable for readers, that has a clear function (improved readability).

Symbolic actions are those that are merely intended to act as a signal to the reader. So if the advice is along the lines of, you should do this to pass peer review, that is on its face symbolic. DT’s article is nearly 100% about taking symbolic actions to make peer reviewers happy. Most of the advice doesn’t actually improve the content of the manuscript (or in the most charitable interpretation how it improves the manuscript is at best implicit). In DT’s section Why is it important this focus on symbolic actions becomes pretty clear. Here is the first paragraph of that section:

Literature reviews are important for a number of reasons. Primarily, literature reviews force a writer to educate him/herself on as much information as possible pertaining to the topic chosen. This will both assist in the learning process, and it will also help make the writing as strong as possible by knowing what has/has not been both studied and established as knowledge in prior research. Second, literature reviews demonstrate to readers that the author has a firm understanding of the topic. This provides credibility to the author and integrity to the work’s overall argument. And, by reviewing and reporting on all prior literature, weaknesses and shortcomings of prior literature will become more apparent. This will not only assist in finding or arguing for the need for a particular research question to explore, but will also help in better forming the argument for why further research is needed. In this way, the literature review of a research report “foreshadows the researcher’s own study” (Berg, 2009, p. 388).

So the first argument, a lit review forces a writer to educate themselves, may offhand seem like a functional objective. It doesn’t make sense though, as lit. reviews are almost always written ex post research project. The point of writing a paper is not to educate yourself, but educate other people on your research findings. The symbolic motivation for this viewpoint becomes clear in DT’s second point, you need to demonstrate credibility to your readers. In terms of integrity if the advice in DT was ‘consider creating a pre-analysis plan’ or ‘release data and code files to replicate your results’ that would be functional advice. But no, it is important to wordsmith how smart you are so reviewers perceive your work as more credible.

Then the last point in the paragraph, articulating the need for a particular piece of research, is again a symbolic action in DT’s essay. You are arguing to peer reviewers about the need for a particular research question. I understand the spirit of this, but think back to what function does this serve? It is merely a signal to reviewers to say, given finite space in a journal, please publish my paper over some other paper, because my topic is more important.

You actually don’t need a literature review to demonstrate a topic is important and/or needed – you can typically articulate that in a sentence or two. For a paper I reviewed not too long ago on crime reductions resulting from CCTV installations in a European city, I was struck by another reviewers critique saying that the authors “never really motivate the study relative to the literature”. I don’t know about you, but the importance of that study seems pretty obvious to me. But yeah sure, go ahead and pad that citation list with a bunch of other studies looking at the same thing to make some peer reviewers happy. God forbid you simply cite a meta-analysis on prior CCTV studies and move onto better things.

What should a lit review accomplish?

So again I don’t think DT give bad advice – mostly vapid but not obviously bad. DT focus on symbolic actions in lit reviews because as lit reviews are currently performed in CJ/Crim journals, they are almost 100% symbolic. They serve almost no functional purpose other than as a signal to reviewers that you are part of the club. So DT give about the best advice possible navigating a series of arbitrary critiques with no clear standard.

As an example for this position that lit reviews accomplish practically nothing, conduct this personal experiment. The next peer review article you pick up, do not read the literature review section. Only read the abstract, and then the results and conclusion. Without having read the literature review, does this change the validity of a papers findings? It for the most part does not. People get feelings hurt by not being cited (including myself), but even if someone fails to cite some of my work that is related it pretty much never impacts the validity of that persons findings.

So DT give advice about how peer review works now. No doubt those symbolic actions are important to getting your paper published, even if they do not improve the actual quality of the manuscript in any clear way. I rather address the question about what I think a lit review should look like – not what you should do to placate three random people and the editor. So again I think the best way to think about this is via articulating specific functions a lit review accomplishes in terms of improving the manuscript.

Broadening the scope abit to consider the necessity of citations, the majority of citations in articles are perfunctory, but I don’t think people should plagiarize. So when you pull a very specific piece of information from a source, I think it is important to cite that work. Say you are using a survey instrument developed by someone else, citing the work that establishes that instruments reliability and validity, as well as the original population those measures were established on, is certainly useful information to the reader. Sources of information/measures, a recent piece saying the properties of your statistical model are I think other good examples of things to cite in your work. Unfortunately I cannot give a bright line here, I don’t cite Gauss every time I use the normal distribution. But if I am using a code library someone else developed that is important, inasmuch as that if someone wants to do a similar project they could use the same library.

In terms of discussing relevant results in prior studies, again the issue is the boundary of what is relevant is very difficult to articulate. If there is a relevant meta-analysis on a topic, it seems sufficient to me to simply state the results of the meta-analysis. Why do I think that is important though? It helps inform your priors about the current study. So if you say a meta-analysis effect size is X, and the current study has an effect size much larger, it may give you pause. It is also relevant if you are generalizing from the results of the study, it is just another piece of evidence in addition to the meta-analysis, not an island all by itself.

I am not saying discussing prior specific results are not needed entirely, but they do not need to be extensive. So if studies Z, Y, X are similar to yours but all had null results, and you think it was because the sample sizes were too small, that is relevant and useful information. (Again it changes your priors.) But it does not need to be belabored on in detail. The current standard of articulating different theoretical aspects ad-nauseum in Crim/CJ journals does not improve the quality of manuscripts. If you do a hot spots policing experiment, you do not need to review all the different minutia of general deterrence theory. Simply saying this experiment is likely to only accomplish general deterrence, not specific deterrence, seems sufficient to me personally.

When you propose a book you need to say ‘here are some relevant examples’ – I think the same idea would be sufficient for a lit review. OK here is my study, here are a few additional studies I think the reader may be interested in that are related. This accomplishes what contemporary lit reviews do in a much more efficient manner – citing more articles makes it much more difficult to pull out the really relevant related work. So admit this does not improve the quality of the current manuscript in a specific way, but helps the reader identify other sources of interest. (I as a reader typically go through the citation list and note a few articles I am interested in, this helps me accomplish that task much quicker.)

I’ve already sprinkled a few additional pieces of advice in this blog post (marginal effect estimates, pre-analysis plans, sharing data code), although you may say they don’t belong in the lit review. Whatever, those are things that actually improve either the content of the manuscript or the actual integrity of the research, not some spray paint on your flowers.

Relevant Other Work

Changepoints in CCTV Effects

So I am a big fan of using splines in regression equations to model non-linear effects. But a limitation of these is that you need to upfront say how many knots you want, as well as where the knots are. So I have explored a bit on fitting models that can identify the changepoints themselves. It was a tricky road, I tried building some in deep learning using pytorch, then tried variational auto-encoders in pyro, then pystan (marginalizing the changepoint out), and then pymc3 (using different samplers). All of my attempts failed! But when I used the R mcp library (Lindeløv, 2020), it was able to find my changepoint using simulated data. (It uses JAGS under the hood, no idea why JAGS behaved better than my other attempts.)

Usecase: Dropoff effect of CCTV on clearance rates

So in spatial criminology, a popular hypothesis is estimating distance decay effects. Ratcliffe (2012) was the first example of using a changepoint regression model to do this, showing a changepoint in the effect of bars on the spatial density of crime nearby. This has been replicated in Xu & Griffiths (2017), and in my work using machine learning and partial dependence plots I show similar changepoint patterns as well (Wheeler & Steenbeek, 2020).

One example use case though I want to mention is not in terms of estimating the spatial density of crime, but with the characteristics of the crime events themselves. Sometimes people I think mistakenly think since I have spatial data, I need to aggregate it to some areal unit, and then do analysis of that areal unit data. That approach is not per-se wrong, but is sometimes a step removed from what you want, and can result in some tricky inferences.

Take for example a recent paper looking at clearances and using RTM by Kennedy et al. (2020). What they do is spatially aggregate homicides cleared and homicides not cleared, and run RTM on each. You might be tempted to interpret if a factor is selected for both models that it does not impact clearances, but it also depends on the size of the effect. So for example, in Brooklyn for drug markets they report a rate ratio of 3.1 and 2.4 (both at the same spatial distance). To translate this into a clearance rate, you need to add the two density estimates for all cases, and then take the cleared cases as the numerator.

# Example R code
clear <- exp(-0.1 + log(3.1))
nonclear <- exp(-0.1 + log(2.4))
prop <- clear/(clear + nonclear)
prop #0.5636364

Here I am treating -0.1 as the intercept. So here this is lower, but close to the overall clearance in Brooklyn, 58%. This 56% will be the estimate iff the intercept for each equation is the same, if they are not though it could change the clearance rate estimate either way. Since the Kennedy paper did not report this, we cannot know. So for instance, if we change the intercept estimates so clearances are higher and non-clearances are lower, we get an estimate that drug markets increase clearances slightly, not decrease them:

clear <- exp(-0.05 + log(3.1))
nonclear <- exp(-0.2 + log(2.4))
prop <- clear/(clear + nonclear)
prop #0.6001124

In this example it probably won’t push them too far either way, but takes a bit of work going from the aggregate data analysis to the estimate we want, how those spatial risk factors impact the clearance rate. There is an easier way though – just incorporate your spatial features, such as the distance the nearest crime generator factor, and estimate a model on the micro level incident data. This is what Kennedy et al. (2020) do later in the paper when incorporating the RTM predictions – I just think they should have done the RTM machinery directly on this problem, instead of the two-step approach.

Examples of my work I have done this approach in the past (incorporating spatial data into the micro level incidents) is with fatalities from gun shot wounds (Circo & Wheeler, 2020). We actually investigated non-linear effects though of distance/drive-time, and did not find evidence of that. Going back to the crime clearance example though, another pre-print I examine the effects of CCTV cameras and find a diminishing effect of case clearances given the distance to the camera (Jung & Wheeler, 2019).

So here we use a pre-post design to show there are some selection effects, and we do further analysis to show this camera bump in clearances is only limited to thefts. But we set the splines at 500, 1000, and 1500 feet pre-emptively for the analysis. A reviewer critique of this is that those three locations are arbitrary (which is correct), so here I will see if I do a changepoint model that allows us to find the knot locations if it will show the same ones.

The idea behind this analysis is that CCTV are often used in investigations. Yeondae is an officer in Korea, same as here in the states first things detectives do is to go and grab CCTV footage. Analysis of cameras are often aggregated to their viewsheds, but I think estimating distance decay effects make as much sense. So events closer to the cameras presumably will provide more clear evidence than events at the border of the viewshed. A second point is that even if the event takes place off-camera, there may be evidence cross by the camera viewshed. Detectives will often try to follow individuals across multiple cameras. So both of those factors suggest a distance decay effect both within a cameras viewshed and a decaying effect even outside of the viewshed. (In addition to this, geo coordinates of crime locations are not perfectly accurate measures either, so that could cause effects outside of the viewshed as well.)

Here I am just limiting the data to the post camera data within 3000 feet for thefts, which still is over 26,000 observations. I’ve posted the data/code to follow along here.

Analysis using mcp in R

Again given my hardship in coding this up myself in python, I created a simulated data example and checked the results using mcp (which you can check in my code). Since mcp recovered my simulated changepoint, (and my python attempts did not), going to go ahead with the mcp library! First, we will import my clearance data and get rid of a few missing cases.

#################
library(mcp)
library(ggplot2)
set.seed(10)
#can see I planned on doing this in pytorch at first!
setwd('D:\\Dropbox\\Dropbox\\Documents\\BLOG\\changepoint_pytorch\\Analysis')
theft_clear <- read.csv('PostTheft_CCTV.csv')
theft_clear <- theft_clear[complete.cases(theft_clear),]
#################

So first for a reference, if I assume there is a linear changepoint at 1000 feet, here are what my results look like. Note here that this is not aggregated data to spatial locations, each row in this dataset is a theft offense, whether it was cleared, and the distance to the nearest CCTV camera.

#################
#What are the coefficients if assume a changepoint of 1000 feet
theft_clear$x_dif <- (theft_clear$CAM.DIST - 1000)*(theft_clear$CAM.DIST > 1000)
theft_mod <- glm(formula = 'STATUSi ~ CAM.DIST + x_dif', family = "binomial", data = theft_clear)
summary(theft_mod) #This gives an estimate of 
#################

And here you can visualize the results alittle easier than trying to back out probabilities for the regression equation:

#################
pred_mod <- predict(theft_mod,type='response')
plot(theft_clear$CAM.DIST,pred_mod, main="Changepoint at 1000 ft",
  xlab="Distance from Camera (ft)", ylab="Probability Clearance")
#################

So this shows clearances nearby cameras in Dallas are around 15%, and they trail off to around 9% at 1000 feet. After that they continue to tail off, but are nearly flat. But again that is assuming a change point at 1000 feet. But the mcp package lets us actual estimate the changepoint itself using Bayesian regression. Here is the set up that is equivalent to my formulation earlier, in that the changepoint cannot be discontinuous.

#################
theft_clear$x <- theft_clear$CAM.DIST 
model = list(
  STATUSi | trials(const) ~ 1 + x,
  ~ 0 + x  #joined changing rate
)

fit = mcp(model, data = theft_clear, family = binomial(), iter = 3000, adapt = 500)
#################

And then if you are following along you can go ahead and take a nap (maybe took 2 hours on my machine?), and when we get back summary(fit) gives us:

So we have very similar coefficients to the manual changepoint model earlier, but the changepoint is around 1600 feet, not 1000. (Although note these are Bayesian credible intervals, not frequentist confidence intervals.) And now to make a nice plot of the fitted model.

#Fitted values for new data
newdat <- data.frame(x = (0:300)*10)
newdat$const <- 1
newdat$CAM.DIST <- newdat$x
res <- fitted(fit, newdata = newdat)

p_pred <- ggplot(data=res) + 
  geom_line(size=1.2, color='black', aes(x = x, y = fitted)) + 
  geom_ribbon(alpha=0.5, fill='black', aes(x = x, ymin=Q2.5 , ymax=Q97.5)) + 
  scale_x_continuous(name="Feet from Camera",breaks=seq(0,3000,500),minor_breaks=NULL) + 
  scale_y_continuous(name="P(Clearance)",breaks=seq(0.06,0.16,0.02),minor_breaks=NULL) +
  theme_bw() + theme(panel.grid.major = element_line(colour = 'grey', linetype = 'dashed', size=0.1)) + 
  theme(text = element_text(size=20))

p_pred

So you can see that here it is a nearly linear drop off until 1600 feet, and then starts to climb back up. The climb up I think is likely due to selection effects, but we can’t 100% rule out displacement effects. Displacement effects could occur with cameras if detectives prioritize events around cameras and de-prioritize other events not nearby cameras. Skeptical that applies to thefts in Dallas though, as they very rarely will be assigned a detective at all.

Wrap Up

So this ended up taking me for a few different turns. One of the things I wanted to be able to test multiple changepoints, maybe if I can ever get pymc3 to give me a reasonable fit, this example is a good illustration. That should also maybe say if you should have no changepoint as well. I think maybe it is much harder to fit those models with binomial data though than with continuous (maybe good for another blog post as well, did simulations at first with 1000 observations and that was a bad idea).

One thing that would be good for evaluating whether change points are reasonable are out of sample predictive comparisons. So say estimate a no changepoint model, a linear changepoint model, and then a model with fixed spline locations. Then see which of those better fits the out of sample data. But since this is a blog post, will leave it as is. But this is a simple illustration to extend prior spatial analysis of changepoints in distance decay effects to one example – crime clearances and CCTV cameras – that I think makes alot of sense.

References

Confidence intervals around proportions

So you probably learned about confidence intervals around means in your introductory statistics class. For a refresher, a confidence interval covers a particular statistic at a pre-specified rate. So if I generate 100 90% intervals around a mean, I expect that those confidence intervals would cover the true underlying mean around 90 times out of those 100. So it is a statement about the procedure overall – not any individual test.

This repeated coverage property is often not exactly what we want in statistics. But, I often find examining confidence intervals around samples to be an informative way to quantify uncertainty in estimates. For example, I have a machine learning model serving up predictions to a subsequent auditing process. I expect this to maintain a hit rate above 20%. The past week I only had a hit rate of 30/200 (15%), should I be worried? Probably not, a 95% confidence interval around that proportion is 10% to 21%.

Proportions come up so often that intro stats courses should probably talk more extensively about generating confidence intervals around them. There are many different confidence intervals for proportions, Wikipedia lists 7 different options!

I prefer where possible to use the Clopper-Pearson intervals by default. I will show an examples of generating Clopper-Pearson intervals in Excel and Python. But, another situation I have come across is I want to do these intervals entirely in SQL. For that situation, I will show how to use Agresti–Coull intervals.

Excel Clopper-Pearson

In Excel, if the A column contains the numerator, the B column contains the denominator, and if G1 has the alpha level, this brutish formula gets you the lower bound of your confidence interval;

=IF(A2=0, 0, BETA.INV($G$1/2, A2, B2-A2+1))

A here is your upper bound;

=IF(A2=B2, 1, BETA.INV(1-$G$1/2, A2+1, B2-A2))

And here is a screenshot of the filled in results:

Note for my criminology friends, you can use this for very extreme proportions as well. So say you had a homicide rate of 10 per 100,000, where the observed sample was 30 homicides in a city of 300,000. You can generate a binomial confidence interval around the proportion and then translate back to the rate per 100,000. So in that scenario, it results in a 95% confidence interval of a homicide rate of 6.7 to 14.3.

This is actually the reason I like defaulting to Clopper-Pearson. The other approximations can fail very badly for extreme tail events like this.

Python Clopper-Pearson

Here is a simple function in python to return the Clopper-Pearson intervals. This works for vectorized inputs as well (e.g. numpy arrays or pandas series).

import numpy as np
from scipy.stats import beta

def binom_int(num,den, confint=0.95):
    quant = (1 - confint)/ 2.
    low = beta.ppf(quant, num, den - num + 1)
    high = beta.ppf(1 - quant, num + 1, den - num)
    return (np.nan_to_num(low), np.where(np.isnan(high), 1, high))

And here is an example use:

hits = np.array([0, 1, 2, 3, 97, 98, 99, 100])
tries = np.array([100]*len(hits))
lowCI, highCI = binom_int(hits, tries)

Check out my prior blog post on making smoothed scatterplots on how to plot those proportion spikes in matplotlib.

SQL Agresti–Coull

So as I mentioned previously, I prefer the Clopper-Pearson intervals. This however relies on the availability of a function for the inverse beta distribution. One common situation is I just have all my tables in SQL, and I want to make a dashboard connected to a view of my tables. So the proportion of some event broken downs by days/weeks/months etc.

In that case exporting the data to python and re-uploading to the database can be a bit of a hassle, whereas creating a view is much less work. So here is an example query to calculate the proportion intervals entirely in SQL. So the initial table is a micro level table of events with 0/1 for a particular group. (This screenshot is for Access, but this should work in various databases.)

And then it is a groupby to get the original numerator, denominator, and proportion. Then a few rows calculating the adjusted proportion (add 2 to the numerator and 2*2 to the denominator), then finally this can still produce lower than 0 and higher than 1 intervals, so I cap those off.

/* This is for Access, for others may want to use SQRT() instead of SQR()
   Also may want to use CASE WHEN instead of IIF */
SELECT
   GroupID,
   SUM(Outcome) AS Num,
   COUNT(Outcome) AS Den,
   Num/Den AS Prop,
   Num + 2 AS nadj,
   Den + 2*2 as dadj,
   nadj/dadj as padj,
   2*SQR((padj/nadj)*(1 - padj)) AS zadj,
   IIF( padj < zadj, 0, padj - zadj) AS LowCI,
   IIF( (1 - padj) < zadj, 1, padj + zadj) AS HighCI
FROM ExampleData
GROUP BY GroupID;

This produces a 95% confidence interval for the final two columns. If you wanted to generate say a 99% confidence interval, you would replace the 2’s in the above table with 2.6. (In R you can do qnorm(1 - a/2), where a is 1 - confidence_level, to figure out this constant.)

What you shouldn’t use these intervals for

While I believe many applications of dashboards are well suited to including confidence intervals, confidence intervals (like p-values) are apt to be misinterpreted. One common one is that for a single 95% confidence interval, that does not mean the interval covers the true estimate with a 95% probability. This is an inference for an individual sample that is not possible in frequentist statistics – that summary would be akin to a posterior credible interval in Bayesian statistics. Confidence intervals are about the procedure, if we do this procedure over and over again, in the long run it will cover the true statistic (which we do not observe for any individual sample), according to the level we set.

Another common mistake with confidence intervals is when comparing two different intervals, them overlapping is sometimes interpreted as no difference. But this is a very conservative test (e.g. will fail to reject the null of no differences too often).

So say we were monitoring a process over time, and in October the process was 20% (40/200) and in November it was 28% (168/600). October’s confidence interval is 15% to 26%, and November’s confidence interval is 24% to 32%. So since those intervals overlap, we should conclude there are no differences correct? Not exactly, if we do a direct test for the differences in proportions (akin to a t-test of mean differences), we get a confidence interval of the difference as -14% to -1% (in R prop.test(c(40,168), c(200,600))). So in that direct hypothesis test, we would conclude October’s percent is lower than Novembers percent.

Geoff Cumming suggests that when going from individual confidence intervals to comparisons between groups, one confidence interval needs to cover the point estimate for the other group to conclude the two groups are different.

But that being said, I believe many dashboards would be improved if incorporating such confidence intervals. So although they may not always provide the test of interest, they are a good way to prevent yourself from over-interpreting noisy trends in smaller samples. In the case of comparing two intervals, for most situations I deal with, being conservative in saying this process is not showing differences is a better approach than worrying about minor fluctuations (although just depends on the use case whether that default behavior makes sense.)

So please, when reporting proportions with small samples, provide a confidence interval around those proportions!

Outliers in Distributions

If you google ‘outlier’, all of the results that come up are in terms of individual observation outliers. So if you have a set of transaction data that is 10, 20, 30, 8000, the singlet observation 8000 is an outlier. But for many situations with transaction data, you don’t want to examine individual outlier incidents, but look for systematic patterns. For example, if I am looking at healthcare insurance claims for my work, a single claim that is $100,000 is actually not that rare. But if we have a hospital that has mostly $100,000 claims for a specific treatment, whereas another with similar cases has a range of $50,000 to $100,000, that may signal there is some funny business going on.

There is no singular way to examine outliers in distribution. A plain old t-test of mean differences may make sense for some situations. But a generally more useful way IMO to think about the problem is to examine the distribution of the outcome in CDF space, as opposed to looking at particular moments of the distribution. A t-test basically only looks at the differences in means for the distributions, whereas examining the CDF we are looking for weird patterns at any point in the distribution.

Here is an example of comparing the cost of hospital stays (per length of stay), for a hospital compared to all others from the same datasource (details on the data in a sec). The way to read this graph is that at 10^3 (so $1000 per day claims) for facility 1458, we have around 20% of the claims data are below this value. For the rest of the hospital data, a larger proportion of claims are under a thousand dollars, more like 25%. Since the red line is always below the black line, it also means that the claims at this hospital are pretty much always larger than the claims at all the other hospitals.

For this example analysis, I am using data from New York State health insurance claims data (SPARC). I have posted python code to replicate here (note if you cannot access dropbox links, feel free to email and I will forward).

Here I am specifically analyzing medical, in-patient insurance claims (I dropped surgical claims) for around 300+ hospitals. There are quite a few claims in this data, over 2 million, and the majority of hospitals have plenty of claims to examine (so no hospitals with only 10 claims). I also specifically examine costs per length of stay. Initially I just examined costs, but will get to why I changed to the normalized version towards the end of the post.

Analysis of CDF Outliers

So first what I did was attempt to do a leave-one-out type stat test using the Kolmogorov-Smirnov test. This is a test that looks at the maximum vertical difference between the CDFs I showed earlier. I should have known better though. Given this large of sample size, even with multiple comparison adjustments for false discovery rate, every hospital was considered an outlier. This is sort of the curse of null hypothesis significance testing, it is either underpowered, so you get null results when things should really be flagged, or with large samples everything is flagged.

So what I did first was make a graph of all the different CDFs for each individual hospital. You can see from this plot we have a mass of the distribution that looks very similar in shape, but is shifted left or right. (Hospitals can bill different values, i.e. casemix, so can have the same types of events but have different bills, so that is normal.) But then we have a few outliers really stick out.

To characterize the central mass in this image, what I did was calculate each empirical CDF for each hospital (over 300 in this sample). Then I estimated the CDF for each hospital at a sample of points logspace distributed between $100 to $100,000. Then I took the 90% distribution between the ECDF values. This is easier to show than to say, so in the below pic the grey area is the 90% region for the CDFs. Then you can do stats to see how hospitals may fall outside that band.

So here 1320 is looking good until around 60% of the distribution, and then it is shifted right. There is a kink in the CDF as well, so this suggests really a set of different types of claims, and in that second group it is the outlier. 1320 was the hospital that had the most sample points outside of my grey coverage area, but you could also do outliers in terms of the distance between those two lines (again like a KS test stat), or in the area between those two lines (that is like a version of the Wasserstein distance only considering above/below moves). So here is the hospital that has the largest distance below the band (above the band signals that a hospital has lower claims on average):

Flat lines horizontally signal an absence of data, whereas vertical lines signal a set of claims with the exact same bill. So here we have a set of claims around $1000 per day that look normal, then an abnormal absence of data from $1,000 to $10,000. Then a large spike of claims that end up being around $45k per day.

So this is looking at the distribution relative to other hospitals, but a few examples I am familiar with look for these flat/vertical spikes in the CDF to identify fraud. Mike Maltz has an example of identifying collusion in bids. In another, Chris Stucchio identifies spikes in transaction data signaling potential fraud. Here I am just doing a test relative to other data to identify weird curves, not just flat lines though.

One limitation of this analysis I have conducted here is that it does not take into account the nature of the claims data. So say you had a hospital that specializes in cancer treatment, it may be totally normal for them to have claims that are higher value overall than a more typical hospital that spreads claims among a wider variety of types of visits/treatments. Initially I analyzed just the cost data, and it identified a few big outliers that ended up being hospice/nursing homes. So they had really high dollar value claims, but also really long stays. So when analyzing the claim per length of say, they were totally normal in that central mass.

So ultimately there could be other characteristics in the types of claims hospitals submit that could explain the weird CDF. One way to take that into account is to do a conditional model for the claims, and then do the ECDF tests on those conditional models. One way may be to look at the residuals for each individual hospital, another would be to draw a matched comparison sample. (Greg Ridgeway did this when examining police behavior in the NYPD.)

That would be like making a single comparison line (like my initial black/red line graph). So controlling the false discovery after that will be tough with larger samples (again the typical KS test, even with a matched sample, will likely always reject). So wondering if there is another machine learning way to identify outliers in CDF space, like a mashup of isolation forests and conditional density forests. Essentially I want to fit a model to draw those grey CDF bands, instead of relying on my sample of hospitals to draw the grey band in those latter plots.

Mapping attitudes paper published

My paper (joint work with Jasmine Silver, Rob Worden, and Sarah McLean), Mapping attitudes towards the police at micro places, has been published in the most recent issue of the Journal of Quantitative Criminology. Here is the abstract:

Objectives: We examine satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. We illustrate the utility of this approach by comparing micro- and meso-level aggregations of policing attitudes, as well as by predicting views about the police from crime data at micro places.

Methods: In each survey, respondents provided the nearest intersection to their address. Using that geocoded survey data, we use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city and compare the micro-level pattern of policing attitudes to survey data aggregated to the census tract. We also use spatial and multi-level regression models to estimate the effect of local violent crimes on attitudes towards police, controlling for other individual and neighborhood level characteristics.

Results: We demonstrate that there are no systematic biases for respondents refusing to answer the nearest intersection question. We show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime. Models predicting satisfaction with police show that local counts of violent crime are a strong predictor of attitudes towards police, even above individual level predictors of race and age.

Conclusions: Asking survey respondents to provide the nearest intersection to where they live is a simple approach to mapping attitudes towards police at micro places. This approach provides advantages beyond those of using traditional neighborhood boundaries. Specifically, it provides more precise locations police may target interventions, as well as illuminates an important predictor (i.e., nearby violent crimes) of policing attitudes.

And this was one of my favorites to make maps. We show how to take surveys and create analogs of hot spot maps of negative sentiment towards police. We do this via asking individuals to list their closest intersection (to still give some anonymity), and then create inverse distance weighted maps of negative attitudes towards police.

We also find in this work that nearby crimes are the biggest factor in predicting negative sentiment towards police. This hints that past results aggregating attitudes to neighborhoods is inappropriate, and that police reducing crime is likely to have the best margin in terms of making people more happy with the police in general.

As always, feel free to reach out for a copy of the paper if you cannot access JQC. (Or you could go a view the pre-print.)

Amending the WDD test to incorporate Harm Weights

So I received a question the other day about amending my and Jerry Ratcliffe’s Weighted Displacement Difference (WDD) test to incorporate crime harms (Wheeler & Ratcliffe, 2018). This is a great idea, but unfortunately it takes a small bit of extra work compared to the original (from the analysts perspective). I cannot make it as simple as just piping in the pre-post crime weights into that previous spreadsheet I shared. The reason is a reduction of 10 crimes with a weight of 10 has a different variance than a reduction of 25 crimes with a weight of 4, even though both have the same total crime harm reduction (10*10 = 4*25).

I will walk through some simple spreadsheet calculations though (in Excel) so you can roll this on your own. HERE IS THE SPREADSHEET TO DOWNLOAD TO FOLLOW ALONG. What you need to do is to calculate the traditional WDD for each individual crime type in your sample, and then combine all those weighted WDD’s estimates in the end to figure out your crime harm weighted estimate in the end (with confidence intervals around that estimated effect).

Here is an example I take from data from Worrall & Wheeler (2019) (I use this in my undergrad crime analysis class, Lab 6). This is just data from one of the PFA areas and a control TAAG area I chose by hand.

So first, go through the motions for your individual crimes in calculating the point estimate for the WDD, and then also see the standard error of that estimate. Here is an example of piping in the data for thefts of motor vehicles. The WDD is simple, just pre-post crime counts. Since I don’t have a displacement area in this example, I set those cells to 0. Note that the way I calculate this, a negative number is a good thing, it means crime went down relative to the control areas.

Then you want to place those point estimates and standard errors in a new table, and in those same rows assign your arbitrary weight. Here I use weights taken from Ratcliffe (2015), but these weights can be anything. See examples in Wheeler & Reuter (2020) for using police cost of crime estimates, and Wolfgang et al. (2006) for using surveys on public perceptions of severity. Many of the different indices though use sentencing data to derive the weights. (You could even use negative weights and the calculations here all work, say you had some positive data on community interactions.)

Now we have all we need to calculate the harm-weighted WDD test. The big thing here to note is that the variance of Var(x*harm_weight) = Var(x)*harm_weight^2. So that allows me to use all the same machinery as the original WDD paper to combine all the weights in the end. So now you just need to add a few additional columns to your spreadsheet. The point estimate for the harm reduction is simply the weight multiplied by the point estimate for the crime reduction. The variance though you need to square the standard error, and square the weight, and then multiply those squared results together.

Once that is done, you can pool the harm weighted stats together, see the calculations below the table. Then you can use all the same normal distribution stuff from your intro stats class to calculate z-scores, p-values, and confidence intervals. Here are what the results look like for this particular example.

I think this is actually a really good idea to pool results together. Many place based police interventions are general, in that you might expect them to reduce multiple crime types. Various harm scores are a good way to pool the results, instead of doing many individual tests. A few caveats though, I have not done simulations like I did in the WDD peer reviewed paper, I believe these normal approximations will do OK under the same circumstances though that we suggest it is reasonable to do the WDD test. You should not do the WDD test if you only have a handful of crimes in each area (under 5 in any cell in that original table is a good signal it is too few of crimes).

These crime count recommendations I think are likely to work as well for weighted crime harm. So even if you give murder a really high weight, if you have fewer than 5 murders in any of those original cells, I do not think you should incorporate it into the analysis. The large harm weight and the small numbers do not cancel each other out! (They just make the normal approximation I use likely not very good.) In that case I would say only incorporate individual crimes that you are OK with doing the WDD analysis to begin with on their own, and then pool those results together.

Sometime I need to grab the results of the hot spots meta-analysis by Braga and company and redo the analysis using this WDD estimate. I think the recent paper by Braga and Weisburd (2020) is right, that modeling the IRR directly makes more sense (I use the IRR to do cost-benefit analysis estimates, not Cohen’s D). But even that is one step removed, so say you have two incident-rate-ratios (IRRs), 0.8 and 0.5, the latter is bigger right? Well, if the 0.8 study had a baseline of 100 crimes, that means the reduction is 100 - 0.8*100 = 20, but if the 0.5 study had a baseline of 30 crimes, that would mean a reduction of 30 - 0.5*30 = 15, so in terms of total crimes is a smaller effect. The WDD test intentionally focuses on crime counts, so is an estimate of the actual number of crimes reduced. Then you can weight those actual crime decreases how you want to. I think worrying about the IRR could even be one step too far removed.

References

CrimCon Roundtable: Flipping a Criminal Justice PhD to an alt-academic Data Science Career

This Thursday 11/19/2020 at 1 PM Eastern, I will be participating in a roundtable for the online CrimCon event. This is free for everyone to zoom in, and here is the link to the program, I am on Stream 3!

The title is above — I have been a private sector data scientist at HMS for not quite a year now. I wanted to organize a panel to help upcoming PhD’s in criminal justice get some more exposure to potential data science positions, outside the traditional tenure track. Here is the abstract:

Tenure-track positions in academia are becoming more challenging to obtain, and only a small portion of junior faculty continue in academia to the rank of full professor. Therefore, students may opt to explore alternate options to obtain employment after their PhD is finished. These alternatives to the tenure track are often called “alt-academic” jobs. This roundtable will be focused on discussing various opportunities that exist for PhD’s in criminal justice and behavioral sciences spanning the public sector, the private sector, and non-profits/think tanks. Panelists will also discuss gaps in the typical PhD curriculum, with the goal of aiding current students to identify steps they can take to make themselves more competitive for alt-academic roles.

And here are each of the panelists bios:

Dr. Andrew Wheeler is currently a Data Scientist at HMS working on problems related to predictive modeling and optimization in relation to health insurance claims. Before joining HMS, he received a PhD degree in Criminal Justice from SUNY Albany. While in academia his research focused on collaborating with police departments for various problems including; evaluating crime reduction initiatives, place based and person based predictive modeling, data analytics for crime analysis, and developing models for the efficient and fair delivery of police resources.

Dr. Jennifer Gonzalez is the Senior Director of Population Health at the Meadows Mental Health Policy Institute, where she manages the Institute’s research and data portfolio. She earned her doctoral degree in epidemiology and a M.S. degree in criminal justice. Before joining MMHPI, Dr. Gonzalez was a tenured associate professor at the University of Texas School of Public Health, where she maintained a portfolio of more than $10 million in research funding and published more than one hundred interdisciplinary articles focused on the health of those who come into contact with—and work within—the criminal justice system.

Dr. Kyleigh Clark-Moorman is a Senior Research Associate for the Public Safety Performance Project at The Pew Charitable Trusts, a non-profit public policy organization. Kyleigh began working at Pew in 2019 and completed her PhD in Criminology and Criminal Justice at the University of Massachusetts, Lowell in May 2020. As an early career researcher, Dr. Clark-Moorman’s work has been published in Criminal Justice and Behavior, Criminal Justice Studies, and the Journal of Criminal Justice. In her role at Pew, Kyleigh is responsible for research design and data analysis focused on various criminal justice topics while also working with external partners to produce high-impact reports and analyses to raise awareness and drive public policy.

Matt Vogel is Associate Professor in the School of Criminal Justice at the University at Albany, SUNY and the Director of the Laboratory for Decision Making in Criminology and Criminal Justice. Matt regularly assists local agencies with data and evaluation needs. Some of his ongoing collaborations include assessments of racial representation on capital juries in Missouri, a longitudinal evaluation of a school-based violence reduction program, and the implementation of a police-hospital collaboration to help address retaliatory violence in St. Louis. Prior to joining the faculty at UAlbany, Matt worked in the Department of Criminology and Criminal Justice at the University of Missouri – St. Louis and held a long-term visiting appointment with the Faculty of Architecture at TU Delft (the Netherlands).

If you have any upfront questions you would like addressed by the panel, always feel free to send me a pre-emptive email (or comment below).


Update: The final roundtable is now posted on Youtube. See below for the panels thoughts on pursuing non-tenure track jobs with your social science Phd.

Regression with Simple Weights

I was reminded of this paper by Jung et al. on constructing simple rules via regression recently. So in the past few posts I have talked about how RTM (1,2) is aimed at making simple models. This is via variable selection and/or simplying the inputs to be binary yes/no. But in the end the final equation could be something like:

log(Crime) = -0.56 + 0.6923*NearbyBars + 0.329*HighDensity311

The paper linked above is about making the regression weights simple, so instead of a regression weight of 0.89728, you may just round the regression weight to 1. The Jung paper does a procedure where they use lasso regression and then round the weights. But there is a simpler approach IMO I will illustrate, just amend the lasso weights to push the coefficients to simple integers. (Also reminded by this example of using an iterative linear program to push weights to binary 0/1.)

So in lasso, you estimate your normal regression equation, but put a penalty on the weights that is typically something like lambda*(sum(abs(reg_weights)) - 1)**2. So if you have reg weights that add to more than 1, they are penalized by a particular amount (the lambda is a tuner to make the penalty higher/lower). And in the iterative algorithm to minimize your loss function plus this added penalty, it will converge to regression weights that meet the criteria of in total summing to around 1. Not exactly 1 but close.

You can however swap out that penalty term with whatever you want (or add to it additional penalties). I will show an example of using a penalty term to push regression coefficients towards integer values, creating simple regression weights.

Why Simple Models?

Dan Simpson has a good blog post of the Jung paper and why simple models are sometimes preferable (and I also have a comment why simple models like this tend to work out well for CJ datasets). But here are few quick examples why you might want a simple model results.

Example 1: If you have people in the field who are tabulating data and making quick decisions, it may be they need to use pen/paper and make a quick decision. No time to input results into a computer and pop out a prediction. Imagine a nurse in the ER, or even your general practitioner. There may be quite a bit of utility in making a simple check list that says if +4 on this scale, do a more intensive treatment.

Example 2: You have a complicated, large database. It is easier to create a simple predictive model in SQL to serve up predictions (either because of latency or because of the complexity of the data pipeline). Instead of a complicated random forest, a linear regression with simple weights will be much easier to implement.

Example 3: Transparency. Complicated models are more difficult to understand and monitor. If you have a vested interest in presenting the model to outside parties, it may make sense to sacrifice some accuracy to make the model more interpretable. Also similar to lasso, I suspect these simple weights will reduce the variance of predictions.

The reason that these simple weights work well in practice for many social science examples you could interpret either in a good light or a bad one. For the half-empty interpretation, our models are not well identified – we can literally swap out various weights in our regression equation and get near similar predictions. So it is fools errand to try to find the regression equation that describes the underlying system. But you can flip that around as well, we don’t even need to find the perfect equation, we can identify quite a few good predictive equations. And why not pick a good equation that is easier to interpret?

Pytorch Example

The example set of code here is very simple, so I will just put the python code entirely in this post. First I import my libraries I am using and change my directory.

############################################
import os
import torch
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\regression_simpleweights\analysis'
os.chdir(my_dir)
############################################

Next I read in the data, which I have previously used as an example in prior blog posts on doctor visits for medicare patients. One thing to note here, is that I rescale the independent variables I am using to min/max. So the age variable instead of going from 65-90 like in the original data, now is scaled to be between 0/1. This is a problem intrinsic to lasso as well, in that you can change the scale of the input variables and it changes the weights. Here with the original data, the education variable has a tiny regression coefficient (0.2), but is highly stat significant. So without rescaling that variable, the model said to hell with your penalty and still converged to a solution of that regression weight is 0.2. If you divide the education variable by 5 though, the corresponding regression weight would change to around 1.

###########################################
#Data from Stata, https://www.stata-press.com/data/r16/gsem_mixture.dta
#see pg 501 https://www.stata.com/manuals/sem.pdf

visit_dat = pd.read_stata('gsem_mixture.dta')
y_dat = visit_dat[['drvisits']]
x_vars = ['private','medicaid','age','educ','actlim','chronic']
#rescaling variables to 0/1
x_dat = visit_dat[x_vars]
visit_dat[x_vars] = (x_dat - x_dat.min(axis=0)) / ( x_dat.max(axis=0)  - x_dat.min(axis=0) )
x_dat = visit_dat[x_vars] #intentional not a copy
###########################################

Now in the next part, I estimate the default linear regression model using statsmodels for reference. Then I stuff the results into pytorch tensors (which I will use later as default starting points for the pytorch estimates). Below is a pic of the resulting summary for the regression model (with the scaled variables, so is slightly different than my prior post).

###########################################
#Estimating the same model in statsmodel
#for confirmation of the result

stats_mod = smf.ols(formula='drvisits ~ private + medicaid + age + educ + actlim + chronic',
                    data=visit_dat)
sm_results = stats_mod.fit()
print(sm_results.summary())

#What is the mean squared error
pred = sm_results.get_prediction().summary_frame()
print( ((y_dat['drvisits'] - pred['mean'])**2).mean() )
#169513.0122252265 for sum
#46.10 for mean

#for setting default initial weights
coef_table = sm_results.params
int_ten = torch.tensor([coef_table[0]], dtype=torch.float, requires_grad=True)
coef_ten = torch.tensor(pd.DataFrame(coef_table[1:]).T.to_numpy(), dtype=torch.float, requires_grad=True)
###########################################

Now creating the pytorch model is quite simple. For linear regression it is just one linear layer, and then setting the loss function to mean squared error. Then I create my own simple weight penalization factor in the simp_loss function. This takes the regression weights (not including the bias/intercept term), takes the difference between the observed weight and the rounded weight, takes the absolute value and sums those absolute values up. Then in the loop when I am fitting the model, you can see the loss = criterion(y_pred, y_ten) + 0.4*simp_loss(model) line. For the usual linear regression, it would just be the first criterion term. Here to add in the penalty term is super simple in pytorch, you just add it to the loss. (And you can incorporate additional penalities, the same way ala elastic-net. The Jung paper they put a penalty on the sum of coefficients as per the original lasso as well.)

Then the final part of the code after the loop is just putting the coefficients in a nicer data frame to print. And below the code snippet are the results.

###########################################
#Now estimating OLS model with simple coefficient
#Penalities in Pytorch

torch.manual_seed(10)

model = torch.nn.Sequential( 
  torch.nn.Linear(len(x_vars),1,bias=True)
)

##Initializing weights
#with torch.no_grad():
#    model[0].weight = torch.nn.Parameter(coef_ten)
#    model[0].bias = torch.nn.Parameter(int_ten)

x_ten = torch.tensor( x_dat.to_numpy(), dtype=torch.float)
y_ten = torch.tensor( y_dat.to_numpy(), dtype=torch.float)

criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

def simp_loss(mod):
    dif = mod[0].weight - torch.round(mod[0].weight)
    return dif.abs().sum()

for t in range(100000):
    #Forward pass
    y_pred = model(x_ten)
    #Loss
    loss = criterion(y_pred, y_ten) + 0.4*simp_loss(model)
    if t % 1000 == 99:
        print(f'iter: {t}, loss: {loss.item()}') 
    #Zero gradients
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

#Making a nice dataframe of coefficients

coef_vars = ['Inter'] + x_vars
vals = list(model[0].bias.detach().numpy()) + list(model[0].weight.detach().numpy()[0,:])
res = pd.DataFrame(zip(coef_vars, vals), columns=['Var','Coef'])
print( res )
###########################################

Here I did not round the coefficients, so you can see that they are not exactly integer values, but are very close. So this will result in a lower loss than taking the usual linear regression coefficients and rounding them like in the noted Jung paper. It is a more direct approach. Also note that the intercept is not close to an integer value. I did not include the intercept in my penalty term. You could if you wanted to, but for most examples I don’t think it makes much sense to do that.

But one of the things that I have noticed playing around with pytorch more is that it is very difficult to get random initialized weights to converge to the same solution. That identification problem I mentioned earlier. One way is instead of using random initialized weights, is to initialize them to some reasonable values. If you uncomment the lines with torch.no_grad(): in the above code and initialize the weights to start from the unregularized OLS solution, it converges much faster, has a slightly smaller mean square error term, and results in these effects:

So you can see in that solution it is exactly the same as rounding the initial OLS solution (ignoring the intercept again). But that may not always be the case. For example if actlim (activity limitations) and educ (education) had a very high correlation, it may be rounding both down is too big a hit to the fit of the equation, so one may go down and one go up. (You need to estimate the equation to know if things like that will occur.)

And that is all folks! While if I were sharing this more broadly, I would likely make a statsmodel like interface (and it appears they use cvxopt under the hood) instead of pytorch, it is very simple to amend pytorch to return simple weights, just add in the penalty to the loss function. Works the same way for lasso/ridge as it does for the simple weights example I give here.

Next up I want to try to figure out autograd in pytorch good enough to give standard errors for these various regression models I am estimating. While I don’t think hypothesis testing makes sense for these various models I am sharing, seeing a standard error that is very high may have prognostic value. In this case, if you had a very high standard error relative to the simple coefficient, it might suggest you should rescale the variable a different way or drop it entirely.

Also for this example, to be simple in the field it would not only need simple coefficients, but simple inputs as well. Wondering if there is a way to add in threshold layers in deep learning to automatically figure out the best way to make the inputs binary (e.g. above 70, educ below 10, etc.) instead of doing min/max scaling of the inputs.