The WDD test with different pre/post time periods

Eric Piza asked the other day if my and Jerry’s WDD test can be used when the pre/post time periods are different. The answer is yes out of the box, the identification strategy does not rely on equality of time periods. So for example, say we had two years pre and one year post data, and the crime counts in treated/control looked like this:

         Pre  Post 
Treated   80    20
Control  100    50

So then our difference-in-difference Poisson estimate of the treatment effect would be:

(20 - 80) - (50 - 100) =  -10

What the parallel trends assumption means here is that since you saw a decrease in 50 crimes in the control area, you would expect a decrease of 50 crimes in the treated area as well. The variance of this estimate is then 20 + 80 + 50 + 100 = 250, and so the standard error is sqrt(250) ~ 15.8. So this is not a statistically significant effect.

It is hard to interpret this effect size though, since it is not a standard unit of time comparison. Also the variance of the estimate will be larger if you have a longer pre time period, which is the opposite of what you want. We can actually amend the statistic though to be a per-unit-time comparison, which will reduce the variance of the estimate. It ends up being similar to my prior post on adding Harm Weights to the WDD, you can’t just pipe in the per unit time estimates in the spreadsheet I shared, but I will show here how to incorporate them into the estimator (and share some python code to show the estimator behaves as expected in simulations).

So again with a pre-time period of 2 years, and post of 1 year, we could do the prior table as per year estimates.

         Pre  Post 
Treated   40    20
Control   50    50

And here our estimate of the crime reduction effect is different:

(20 - 40) - (50 - 50) =  -20

So with a Poisson variable with a mean of 100, the variance of that variable is also 100. So here we are dividing that 100 by a constant 2 – this changes the variance to 100/(2^2). (Var(X*a) = a^2*Var(X) where X is a random variable and a is a constant.) The post variables are simply divided by one, so does not change their variance. So to carry this forward to our standard error estimate, we would calculate (edit, had some errors in this part, should be using the original counts, not the average counts, but my later python code was correct, so the simulation part is OK):

20/1 + 80/4 + 50/1 + 100/4 = 115

So you can see that our variance estimate here is much smaller, and that the standard error is sqrt(115) ~ 10.7. So here the reduction is not quite a statistically significant reduction in crimes at the 0.05 level. A 95% confidence interval would be -20 +/- 2*10.7 ~ [+1, -41]. Here the WDD estimate is easier to interpret as well, and that confidence interval corresponds to a per year estimate of somewhere between +1 and -41 crimes.

Below I share some python code to conduct simulations similar to the original WDD paper. This code will then establish the estimator has the null distribution as expected (when there are no changes it really is a standard normal distribution) and that the confidence intervals have coverage like you would expect.

Python Simulation Code

For set up, I import the libraries I need (stat distributions, numpy and pandas). I am not going to go into detail into the functions, but it allows you to generate simulated distributions in various ways to conduct analysis of the properties of my time weighted estimator I have specified above.

'''
WDD Simulation with differing time periods
Andy Wheeler
'''

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import poisson
from scipy.stats import uniform

#This works for the scipy functions
np.random.seed(seed=10)

# A function to generate the WDD estimate for simulated data
def wdd_sim(treat0,treat1,cont0,cont1,pre,post):
    tr_cr_0 = poisson.rvs(mu = treat0, size=int(pre)).sum()
    co_cr_0 = poisson.rvs(mu = cont0, size=int(pre)).sum()
    tr_cr_1 = poisson.rvs(mu = treat1, size=int(post)).sum()
    co_cr_1 = poisson.rvs(mu = cont1, size=int(post)).sum()
    est = ( tr_cr_1/post - tr_cr_0/pre ) - ( co_cr_1/post - co_cr_0/pre )
    post2 = (1/post)**2
    pre2 = (1/pre)**2
    var_est = tr_cr_0*pre2 + tr_cr_1*post2 + co_cr_0*pre2 + co_cr_1*post2
    true_val = ( treat1 - treat0 ) - ( cont1 - cont0 )
    z_score = est / np.sqrt(var_est)
    return (est, var_est, true_val, z_score)

def make_data(n, treat0, treat1, cont0, cont1, pre, post):
    base = pd.DataFrame( range(n), columns=['index'])
    base['treat0'] = treat0
    if treat1 is not None:
        base['treat1'] = treat1
    else:
        base['treat1'] = base['treat0']
    if cont0 is not None:
        base['cont0'] = cont0
    else:
        base['cont0'] = base['treat0']
    if cont1 is not None:
        base['cont1'] = cont1
    else:
        base['cont1'] = base['cont0']
    base.drop(columns='index',inplace=True)
    base['pre'] = pre
    base['post'] = post
    sim_vals = base.apply(lambda x: wdd_sim(**x), axis=1, result_type='expand')
    sim_vals.columns = ['est','var_est','true_val','z_score']
    return pd.concat([base,sim_vals], axis=1)

So for a first example, this code generates treatment/control areas with a Poisson mean of 5 in both the pre/post time periods. But, the pre time period is 4 units of time, and the post time period is only 1 unit. So this means there is no change, and the Z score estimator should on average have a 0 estimate and a standard deviation of 1. I do 10,000 simulations to keep it going a bit faster, but you can up that if you want.

# No change, with baseline of 5 crimes per unit time
sim_dat = make_data(10000, 5, 5, 5, 5, 4, 1)
sim_dat['z_score'].describe()

So here we can see these 10k simulated Poisson data have a mean z-score of 0 and a standard deviation of 1, right like we expected.

So I haven’t extensively tested, but if you have average crime counts well under 5, I would be a bit hesitant to use this estimator. (So you either need larger area aggregations or larger time aggregations.) Although you could do simulations on your own to see how it holds up.

The way I wrote the functions you can also pass in random variables as well, so here is an example with again no change, but the baseline varies uniformily from 5 to 100. And here also the pre time periods are 6, and the post time period is again just 1.

# Can pass in random functions instead of constant values
sim_n = 10000
tf = uniform.rvs(loc=5, scale=100, size=sim_n)

sim_dat2 = make_data(sim_n, tf, None, None, None, 6, 1)
sim_dat2.head()
sim_dat2['z_score'].describe()

So you can see the base simulated dataset pre/post always has the same means, but instead of being a set of constant 5’s, it changes for each row (simulation) in the dataset. And again the null distribution is right on the money with a mean of 0 and standard deviation of 1.

So those are examples of the null distribution of no changes in the time weighted estimator. This establishes that the false positive alpha rates are as you would expect. E.g. if you use the usual p-value < 0.05, if the differences are really 0 you only have a false positive reject the null 5 times out of 100.

But we also want to establish that when there is a difference, the estimator is not biased and that the variance estimates are correct. For the later part looking at the coverage rates of the confidence intervals is one way to do that. So here I show that with my hypothetical example in the intro part of this blog, the 95% and 90% confidence interval coverage rates are exactly as they should be. And the z-score estimate is right about where it should be as well.

# Lets look at the coverage rate for a decline from 40 to 20
def cover(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    cover = ( data['true_val'] > low) & ( data['true_val'] < high )
    return cover

sim_dat3 = make_data(sim_n, 40, 20, 50, 50, 2, 1)
sim_dat3.head()

# This should be centered on 2
sim_dat3['z_score'].describe()

# Should be ~ 0.9
co_90 = cover(sim_dat3, ci=0.9)
co_90.mean()

# Should be ~ 0.95
co_95 = cover(sim_dat3, ci=0.95)
co_95.mean()

So you can see the coverage is right on the money. The estimator is slightly biased downward in this simulation (should get a z-score on average around -2, but here the mean is -1.85). But it is good enough IMO to not worry about much in this situation.

Again, the original estimator without weighted for time is fine, if we do the same motions without doing weighting for different time periods, the coverage is still all fine and dandy.

# Note you can do the same coverage estimate without time weighted
sim_dat4 = make_data(sim_n, 80, 20, 100, 50, 1, 1)
sim_dat4.head()

# This should be around -0.6
sim_dat4['z_score'].describe()

co_90w = cover(sim_dat4, ci=0.9)
co_90w.mean()

co_95w = cover(sim_dat4, ci=0.95)
co_95w.mean()

So you can see again coverage is right on the money, and the z-score estimator actually has less bias than the time weighted one, it is right on the money as expected.

So why would you prefer the time weighted estimator if it shows more bias? It is because it has a lower variance, this code shows the length of the confidence intervals in the simulations.

# Does it make a difference?
def len_ci(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    return high - low

len4 = len_ci(sim_dat4)
len4.describe()

len3 = len_ci(sim_dat3)
len3.describe()

So you can see here that the non-time weighted estimator tends to have a confidence interval with a length of 62, whereas the time weighted estimator has a confidence interval on average of 42.

So above establishes that the time weighted estimator behaves as you would expect. You can also use this code to conduct some potential power analyses. So for the time weighted estimator we show, even though the reduction is around 50% in the treated area (going from 40 to 20), the power is not great, around 60%.

# Example power analyses, ONE TAILED
def reject_rate(data, alpha=0.05):
    p_vals = norm.cdf(data['z_score'])
    return p_vals < alpha
    
r3 = reject_rate(sim_dat3)
r3.mean()

So this means if you did this experiment in real life and it was that effective, you would still fail to reject the null of no differences 2/5 times.

But what if we say we will get more historical data? So 4 years back instead of just 2? How does that impact our power estimates?

# How about with more historical data
sim_dat5 = make_data(sim_n, 40, 20, 50, 50, 4, 1)
r5 = reject_rate(sim_dat5)
r5.mean()

The power goes up by alittle, to 0.67. The same is true if we up the post period to 4 time periods instead of 1:

# How about with more post data
sim_dat6 = make_data(sim_n, 40, 20, 50, 50, 4, 4)
r6 = reject_rate(sim_dat6)
r6.mean()

So now in this example you have an over 90% power to detect a crime reduction, going from 40 to 20 per time period (where the control has an average of 50 crimes per time period), if you have 4 pre time periods and 4 post time periods.

Future Stuff

So a few caveats with this. For one, you may think that since dividing per time period reduces the variance, why not divide by smaller time slivers. So instead of one year, why not divide by 365 days?

I have not studied extensively this property of the estimator. So I cannot say how it behaves with more/less time aggregation into smaller Poisson estimates. You will need to take that on yourself if you want to examine very fine time units and very small Poisson counts per unit time. Again I think a baseline rule of thumb that they should not be lower the 5 counts per unit time is the best advice I can give without doing simulations for your exact circumstances.

A second part is that with longer time periods comes the risk that the control areas are not as good. This is a problem intrinsic to synthetic control analysis as well (that I don’t believe anyone has a particular answer to). And I don’t have an answer either.

For the pre-time period, you can check the parallel trends assumption by simply plotting the two time series, they should be close to in step with one another. So that is not a big deal. But with the post time period, I think if you monitor long enough they will eventually depart from one another.

So I think it is best to set up a time period at the start you have committed to doing the experiment. And you can use the power analysis simulations like I showed to help you figure out that period. But it may be possible to extend this WDD estimate to continuously monitor an intervention (see here for example).

New book: Micro geographic analysis of Chicago homicides, 1965-2017

In joint work with Chris Herrmann and Dick Block, we now have a book out – Understanding Micro-Place Homicide Patterns in Chicago (1965 – 2017). It is a Springer Brief book, so I recommend anyone who has a journal article that is too long that this is a potential venue for the work. (Really this is like the length of three journal articles.)

A few things occurred to prompt me to look into this. First, Chicago increased a big spike of homicides in 2016 and 2017. Here is a graph breaking them down between domestic related homicides and all other homicides. You can see all of the volatility is related to non-domestic homicides.

So this (at least to me) begs the question of whether those spiked homicides show similar characteristics compared to historical homicides. Here we focus on long term spatial patterns and micro place grid cells in the city, 150 by 150 meter cells. Dick & Carolyn Block had collated data, including the address where the body was discovered, using detective case notes starting in 1965 (ending in 2000). The data from 2000 through 2017 is the public incident report data released by Chicago PD online. Although Dick and Carolyn’s public dataset is likely well known at this point, Dick has more detailed data than is released publicly on ICPSR and a few more years (through 2000). Here is a map showing those homicide patterns aggregated over the entire long time period.

So we really have two different broad exploratory analyses we employed in the work. One was to examine homicide clustering, and the other was to examine temporal patterns in homicides. For clustering, we go through a ton of different metrics common in the field, and I introduce even one more, Theil’s decomposition for within/between neighborhood clustering. This shows Theil’s clustering metric within neighborhoods in Chicago (based on the entire time period).

So areas around the loop showed more clustering in homicides, but here it appears it is somewhat confounded with neighborhood size – smaller neighborhoods appear to have more clustering. This is sort of par for the course for these clustering metrics (we go through several different Gini variants as well), in that they are pretty fickle. You do a different temporal slice of data or treat empty grid cells differently the clustering metrics can change quite a bit.

So I personally prefer to focus on long term temporal patterns. Here I estimated group based trajectory models using zero-inflated Poisson models. And here are the predicted outputs for those grid cells over the city. You can see unlike prior work David Weisburd (Seattle), myself (Albany), or Martin Andresen (Vancouver) has done, they are much more wavy patterns. This may be due to looking over a much longer horizon than any of those prior works though have.

The big wave, Group 9, ends up being clearly tied to former large public housing projects, which their demolitions corresponds to the downturn.

I have an interactive map to explore the other trajectory groups here. Unfortunately the others don’t show as clear of patterns as Group 9, so it is difficult to answer any hard questions about the uptick in 2016/2017, you could find evidence of homicides dispersing vs homicides being in the same places but at a higher intensity if you slice the data different ways.

Unfortunately the analysis is never ending. Chicago homicides have again spiked this year, so maybe we will need to redo some analysis to see if the more current trends still hold. I think I will migrate away from the clustering metrics though (Gini and Theil), they appear to be too volatile to say much of anything over short term patterns. I think there may be other point pattern analysis that are more diagnostic to really understand emerging/changing spatial patterns.

The coffee next to the cover image is Chris Herrmann’s beans, so go get yourself some as well at Fellowship Coffee!

Publishing in Peer Review?

I am close, but not quite, entirely finished with my current crim/cj peer reviewed papers. Only one paper hangs on, the CCTV clearance paper (with Yeondae Jung). Rejected twice so far (once on R&R from Justice Quarterly), and has been under review in toto around a year and a half so far. It will land somewhere eventually, but who knows where at this point. (The other pre-prints I have on my CV but are not in peer review journals I am not actively seeking to publish anymore.)

Given the typical lags in the peer review process, if you look at my CV I will appear active in terms of publishing in 2020 (6 papers) and 2021 (4 papers and a book). But I have not worked on any peer review paper in earnest since I started working at HMS in December 2019, only copy-editing things I had already produced. (Which still takes a bit of work, for example my Cost of Crime hot spots paper took around 40 hours to respond to reviewers.)

At this point I am not sure if I will pursue any more peer reviewed publications directly in criminology/criminal justice. (Maybe as part of a team in giving support, but not as the lead.) Also we have discussed at my workplace pursuing publications, but that will be in healthcare related projects, not in Crim/CJ.

Part of the reason is that the time it takes to do a peer review publication is quite a bit relative to publishing a simple blog post. Take for instance my recent post on incorporating harm weights into the WDD test. I received the email question for this on Wednesday 11/18, thought about how to tackle the problem overnight, and wrote the blog post that following Thursday morning before my CrimCon presentation, (I took off work to attend the panel with no distractions). So took me around 3 hours in total. Many of my blog posts take somewhat longer, but I definitely do not take any more than 10-20 hours on an individual one (that includes the coding part, the writing part is mostly trivial).

I have attempted to guess as to the relative time it takes to do a peer reviewed publication based on my past work. I averaged around 5 publications per year, worked on average 50 hours a week while I was an academic, and spent something like I am guessing 60% to 80% (or more) of my time on peer review publications. Say I work 51 weeks a year (I definitely did not take any long vacations!, and definitely still put in my regular 50 hours over the summertime), that is 51*50=2550 hours. So that means around (2550*0.6)/5 ~ 300 or (2550*0.8)/5 ~ 400 so an estimate of 300 to 400 hours devoted to an individual peer review publication over my career. This will be high (as it absorbs things like grants I did not get), but is in the ballpark of what I would guess (I would have guessed 200+).

So this is an average. If I had recorded the time, I may have had a paper only take around 100 hours (I don’t think I could squeeze any out in less than that). I have definitely had some take over 400 hours! (My Mapping RTM using Machine Learning I easily spent over 200 hours just writing computer code, not to brag, it was mostly me being inefficient and chasing a few dead ends. But that is a normal part of the research process.)

So it is hard for me to say, OK here is a good blog post that took me 3 hours. Now I should go and spend another 300 to write a peer review publication. Some of that effort to publish in peer review journals is totally legitimate. For me to turn those blog posts into a peer review article I would need a more substantive real-life application (if not multiple real-life applications), and perhaps detailed simulations and comparisons to other techniques for the methods blog posts. But a bunch is just busy work – the front end lit review and answering petty questions from peer reviewers is a very big chunk of that 300 hours (and has very little value added).

My blog posts typically get many more views than my peer review papers do, so I have very little motivation to get the stamp of approval for peer review. So my blog posts take far less time, are more wide read, and likely more accessible than peer reviewed papers. Since I am not on the tenure track and do not get evaluated by peer reviewed publications anymore, there is not much motivation to continue them.

I do have additional ideas I would like to pursue. Fairness and efficiency in siting CCTV cameras is a big one on my mind. (I know how to do it, I just need to put in the work to do the analysis and write it up.) But again, it will likely take 300+ hours for me to finish that project. And I do not think anyone will even end up using it in the end – peer reviewed papers have very little impact on policy. So my time is probably better spent writing a few blog posts and playing video games with all the extra time.

If you are an editor reading this, I still do quite a few peer reviews (so feel free to send me those). I actually have more time to do those promptly since I am not hustling writing papers! I have actually debated on whether it is worth it to start my own peer reviewed journal, or maybe contribute to editing an already existing journals (just joined the JQC editorial board). Or maybe start writing my own crime analysis or methods text books. I think that would be a better use of my time at this point than pursuing individual publications.

A changepoint logistic model in pystan

So the other day I showed how to use the mcp library in R to estimate a changepoint model with an unknown changepoint location. I was able to get a similar model to work in pystan, although it ends up being slower in practice than the mcp library (which uses JAGS under the hood). It also limits the changepoints to a specific grid of values. So offhand there isn’t a specific reason to prefer this approach to the R mcp library, but I post here to show my work. Also I illustrate that with this particular model, using 1000 simulated observations.

To be clear what this model is, instead of the many time series examples floating around about changepoints (like the one in the Stan guide), we have a model with a particular continuous independent variable x, and we are predicting the probability of something based on that x variable. It is not that different, but many of those time series examples the universe to check for changepoints is obvious, only the observed time series locations. But here we have a continuous input (distance a crime event is from a CCTV camera), but we can only check a finite number of locations. It ends up being closer in spirit to this recent post by Keith Goldfield.

So in some quick and dirty text math, here c is the changepoint location and l is the logit function:

l(Prob[y]) = intercept + b1*x; if x <= c
l(Prob[y]) = intercept + b1*x + b2*(x - c); if x > c

This model can be expanded however you want – such as other covariates that do not change with the changepoint. But for this simple simulation I am just looking at the one running variable x and the binary outcome y.

Python Code

So first, I load up the libraries I will be using, then I simulate some data. Here the changepoint is located at 0.42 for the x variable, and in the ylogit line you can see the underlying logistic regression equation.

#################################
# Libraries I am using
import pystan
import numpy as np
import pandas as pd
import statsmodels.api as sm
#################################

#################################
# Creating simulated data
np.random.seed(10)
total_cases = 1000 #30000
x = np.random.uniform(size=total_cases) #[total_cases,1]
change = 0.42
xdif = (x - change)*(x > change)
ylogit = 1.1 + -4.3*x + 2.4*xdif
yprob = 1/(1 + np.exp(-ylogit))
ybin = np.random.binomial(1,yprob)
#################################

When testing out these models, one mistake I made was thinking offhand that 1,000 observations should be plenty. (Easier to run more draws with a smaller dataset.) When I had smaller effect sizes, the logistic coefficients could be pretty badly biased. So I started as a check estimating the logistic model inputting the correct changepoint location. Those biased estimates are pretty much the best case scenario you could hope for in the subsequent MCMC models. So here is an example fitting a logit model inputting the correct location for the changepoint.

#################################
#Statsmodel code to get
#The coefficient estimates 
#And standard errors for the sims
con = [1]*len(x)
xcomb = pd.DataFrame(zip(con,list(x),list(xdif)),columns=['const','x','xdif'])
log_reg = sm.Logit(ybin, xcomb).fit()
print(log_reg.summary()) 
#################################

So you can see that my coefficient estimates and the frequentist standard errors are pretty large even with 1,000 samples. So I shouldn’t expect my later MCMC model to have any smaller credible intervals than above.

So here is the Stan model. I am using pystan here, but of course it would be the same text file if you wanted to fit the model using R. (Just compiles C++ code under the hood.) Of only real note is that I show how to use the softmax function to estimate the actual mean location of the changepoint. Note that that mean summary though only makes sense if you make your grid of changepoint locations regular and fairly fine. (So if you said a changepoint could be at 0.1, 0.36, and 0.87, taking a weighted mean of those three locations doesn’t make sense.)

#################################
#Stan model
changepoint_stan = """
data {
   int<lower=1> N;
   vector[N] x;
   int<lower=0,upper=1> y[N];
   int<lower=1> Samp_Points;
   vector[Samp_Points] change;
}
transformed data {
  real log_unif;
  log_unif = -log(Samp_Points);
}
parameters {
  real intercept;
  real b_x;
  real b_c;
}
transformed parameters {
  vector[Samp_Points] lp;
  real before;
  real after;
  lp = rep_vector(log_unif, Samp_Points);
  for (c in 1:Samp_Points){
    for (n in 1:N){
      before = intercept + b_x*x[n]; 
      after = before + b_c*(x[n] - change[c]);
      lp[c] = lp[c] + bernoulli_logit_lpmf(y[n] | x[n] < change[c] ? before : after );
    }  
  }
}
model {
  intercept ~ normal(0.0, 10.0);
  b_x ~ normal(0.0, 10.0);
  b_c ~ normal(0.0, 10.0);
  target += log_sum_exp(lp);
}
generated quantities {
  vector[Samp_Points] prob_point;
  real change_loc;
  prob_point = softmax(lp);
  change_loc = sum( prob_point .* change );
}
"""
#################################

And finally I show how to prepare the data for pystan (as a dictionary), compile the model, and then draw a ton of samples. I generate a regular grid of 0.01 intervals from 0.03 to 0.97 (can’t have a changepoint outside of the x data locations, which I drew as a random uniform 0,l). Note the more typical default of 1000 tended to not converge, the effective number of samples is quite small for that many. So 5k to 10k samples in my experiments tended to converge. Note that this is not real fast either, took about 40 minutes on my machine (the Stan guesstimates for time were always pretty good ballpark figures).

#################################
# Prepping data and fitting the model

stan_dat = {'N': ybin.shape[0]}
stan_dat['change'] = np.linspace(0.03,0.97,95) #[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
stan_dat['Samp_Points'] = len(stan_dat['change'])
stan_dat['x'] = x
stan_dat['y'] = ybin


sm = pystan.StanModel(model_code=changepoint_stan)
#My examples needed more like 10,000 iterations
#effective sample size very low, took about 40 minutes on my machine
fit = sm.sampling(data=stan_dat, iter=5000, 
                  warmup=500, chains=4, verbose=True)
#Prints some results at the terminal!
print(fit.stansummary(pars=['change_loc','intercept','b_x','b_c']))
#################################

So you can see the results – the credible intervals for the intercept and regression coefficient before the change point are not bad, just slightly larger than the logistic model. The credible interval for the changepoint location and the changepoint effect different are quite uncertain though. The changepoint location covers almost the whole interval I examined. It may be better to plot the individual probabilities, like Goldfield did in his post, as opposed to summarized a mean location for the distribution (which is discrete in the end based on your grid of locations you look at).

So that at least gives a partial warning that you need quite big data samples to effectively identify the changepoint location, at least for this Stan model as I have shown. I haven’t run it on my 26k actual data sample, as it will end up taking my computer around 30 hours to crunch out 10k draws with 4 chains. Next up I rather see if I can get a similar model working in pyro, as my GPU on my personal machine I think will be faster than the C++ code here. (There are probably smarter ways to vectorize the Stan model as well.)

Graphing Spline Predictions in SPSS

I might have around 10 blog posts about using splines in regression models – and you are about to get another. Instead of modeling non-linear effects via polynomial terms (e.g. including x^2, x^3 in a model, etc.), splines are a much better default procedure IMO. For a more detailed mathy exposition on splines and a walkthrough of the functions, see my class notes.

So I had a few questions about applying splines in generalized linear models and including control variables in my prior post (on a macro to estimate the spline terms). These include can you use them in different types of generalized linear models (yes), can you include other covariates into the model (yes). For either of those cases, interpreting the splines are more difficult though. I am going to show an example here of how to do that.

Additionally I have had some recent critiques of my paper on CCTV decay effects. One is that the locations of the knots we chose in that paper is arbitrary. So while that is true, one of the reasons I really like splines is that they are pretty robust – you can mis-specify the knot locations, and if you have enough of them they will tend to fit quite a few non-linear functions. (Also a note on posting pre-prints, despite being rejected twice and under review for around 1.5 years, it has over 2k downloads and a handful of citations. The preprint has more downloads than my typical published papers do.)

So here I am going to illustrate these points using some simulated data according to a particular logistic regression equation. So I know the true effect, and will show how mis-located spline knots still recovers the true effect quite closely. This example is in SPSS, and uses my macro on estimating spline basis.

Generating Simulated Data

So first in SPSS, I define the location where I am going to save my files. Then I import my Spline macro.

* Example of splines for generalized linear models 
* and multiple variables.

DATASET CLOSE ALL.
OUTPUT CLOSE ALL.

* Spline Macro.
FILE HANDLE macroLoc /name = "C:\Users\andre\OneDrive\Desktop\Spline_SPSS_Example".
INSERT FILE = "macroLoc\MACRO_RCS.sps".

Second, I create a set of synthetic data, in which I have a linear changepoint effect at x = 0.42. Then I generate observations according to a particular logistic regression model, with not only the non-linear X effects, but also two covariates Z1 (a binary variable) and Z2 (a continuous variable).

*****************************************************.
* Synthetic data.
SET SEED = 10.
INPUT PROGRAM.
LOOP Id = 1 to 10000.
END CASE.
END LOOP.
END file.
END INPUT PROGRAM.
DATASET NAME Sim.

COMPUTE X = RV.UNIFORM(0,1).
COMPUTE #Change = 0.42.
DO IF X <= #Change.
  COMPUTE XDif = 0.
ELSE.
  COMPUTE XDif = X - #Change.
END IF.
COMPUTE Z1 = RV.BERNOULLI(0.5).
COMPUTE Z2 = RV.NORMAL(0,1).  

DEFINE !INVLOGIT (!POSITIONAL  !ENCLOSE("(",")") ) 
1/(1 + EXP(-!1))
!ENDDEFINE.

*This is a linear changepoint at 0.42, other variables are additive.
COMPUTE ylogit = 1.1 + -4.3*x + 2.4*xdif + -0.4*Z1 + 0.2*Z2.
COMPUTE yprob = !INVLOGIT(ylogit).
COMPUTE Y = RV.BERNOULLI(yprob).
*These are variables you won't have in practice.
ADD FILES FILE =* /DROP ylogit yprob XDif.
FORMATS Id (F9.0) Y Z1 (F1.0) X Z2 (F3.2).
EXECUTE.
*****************************************************.

Creating Spline Basis and Estimating a Model

Now like I said, the correct knot location is at x = 0.42. Here I generate a set of regular knots over the x input (which varies from 0 to 1), at not the exact true value for the knot.

!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

Now if you look at your dataset, there are 3 new splinex? variables. (For restricted cubic splines, you get # of knots - 2 new variables, so with 5 knots you get 3 new variables here.)

We are then going to use those new variables in a logistic regression model. We are also going to save our model results to an xml file. This allows us to use that model to score a different dataset for predictions.

GENLIN Y (REFERENCE=0) WITH X splinex1 splinex2 splinex3 Z1 Z2 
  /MODEL X splinex1 splinex2 splinex3 Z1 Z2 
      INTERCEPT=YES DISTRIBUTION=BINOMIAL LINK=LOGIT
  /OUTFILE MODEL='macroLoc\LogitModel.xml'. 

And if we look at the coefficients, you will see that the coefficients look offhand very close to the true coefficients, minus splinex2 and splinex3. But we will show in a second that those effects should be of no real concern.

Generating New Data and Plotting Predictions

So you should do this in general with generalized linear models and/or non-linear effects, but to interpret spline effects you can’t really look at the coefficients and know what those mean. You need to make plots to understand what the non-linear effect looks like.

So here in SPSS, I create a new dataset, that has a set of regularly sampled locations along X, and then set the covariates Z1=1 and Z2=0. These set values you may choose to be at some average, such as mean, median, or mode depending on the type of covariate. So here since Z1 can only take on values of 0 and 1, it probably doesn’t make sense to choose 0.5 as the set value. Then I recreate my spline basis functions using the exact sample macro call I did earlier.

INPUT PROGRAM.
LOOP #xloc = 0 TO 300.
  COMPUTE X = #xloc/300.
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Fixed.
COMPUTE Z1 = 1.
COMPUTE Z2 = 0.
EXECUTE.
DATASET ACTIVATE Fixed.

*Redoing spline variables.
!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

Now in SPSS, we score this dataset using our prior model xml file we saved. Here this generates the predicted probability from our logistic model.

MODEL HANDLE NAME=LogitModel FILE='macroLoc\LogitModel.xml'. 
COMPUTE PredPr = APPLYMODEL(LogitModel, 'PROBABILITY', 1).
EXECUTE.
MODEL CLOSE NAME=LogitModel.

And to illustrate how close our model is, I generate what the true predicted probability should be based on our simulated data.

*Lets also do a line for the true effect to show how well it fits.
COMPUTE #change = 0.42.
DO IF X <= #change.
  COMPUTE xdif = 0.
ELSE.
  COMPUTE xdif = (X - #change).
END IF.
EXECUTE.
COMPUTE ylogit = 1.1 + -4.3*x + 2.4*xdif + -0.4*Z1 + 0.2*Z2.
COMPUTE TruePr = !INVLOGIT(ylogit).
FORMATS TruePr PredPr X (F2.1).
EXECUTE.

And now we can put these all into one graph.

DATASET ACTIVATE Fixed.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PredPr TruePr
  /FRAME INNER=YES
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: PredPr=col(source(s), name("PredPr"))
  DATA: TruePr=col(source(s), name("TruePr"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Prob"))
  SCALE: cat(aesthetic(aesthetic.shape), map(("PredPr",shape.solid),("TruePr",shape.dash)))
  ELEMENT: line(position(X*PredPr), shape("PredPr"))
  ELEMENT: line(position(X*TruePr), shape("TruePr")) 
END GPL.

So you can see that even though I did not choose the correct knot location, my predictions are nearly spot on with what the true probability should be.

Generating Predictions Over Varying Inputs

So in practice you can do more complicated models with these splines, such as allowing them to vary over different categories (e.g. interactions with other covariates). Or you may simply want to generate predicted plots such as above, but have a varying set of inputs. Here is an example of doing that; for Z1 we only have two options, but for Z2, since it is a continuous covariate we sample it at values of -2, -1, 0, 1, 2, and generate lines for each of those predictions.

*****************************************************.
* Can do the same thing, but vary Z1/Z2.

DATASET ACTIVATE Sim.
DATASET CLOSE Fixed.

INPUT PROGRAM.
LOOP #xloc = 0 TO 300.
  LOOP #z1 = 0 TO 1.
    LOOP #z2 = -2 TO 2.
      COMPUTE X = #xloc/300.
      COMPUTE Z1 = #z1.
      COMPUTE Z2 = #z2.
      END CASE.
    END LOOP.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Fixed.
EXECUTE.
DATASET ACTIVATE Fixed.

*Redoing spline variables.
!rcs x = X loc = [0.1 0.3 0.5 0.7 0.9].

MODEL HANDLE NAME=LogitModel FILE='macroLoc\LogitModel.xml'. 
COMPUTE PredPr = APPLYMODEL(LogitModel, 'PROBABILITY', 1).
EXECUTE.
MODEL CLOSE NAME=LogitModel.

FORMATS Z1 Z2 (F2.0) PredPr X (F2.1).
VALUE LABELS Z1
  0 'Z1 = 0'
  1 'Z1 = 1'.
EXECUTE.

*Now creating a graph of the predicted probabilities over various combos.
*Of input variables.
DATASET ACTIVATE Fixed.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X PredPr Z1 Z2
  /FRAME INNER=YES
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: PredPr=col(source(s), name("PredPr"))
  DATA: TruePr=col(source(s), name("TruePr"))
  DATA: Z1=col(source(s), name("Z1"), unit.category())
  DATA: Z2=col(source(s), name("Z2"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Predicted Probability"))
  GUIDE: axis(dim(3), opposite())
  GUIDE: legend(aesthetic(aesthetic.color), label("Z2"))
  SCALE: cat(aesthetic(aesthetic.color), map(("-2",color."8c510a"),("-1",color."d8b365"),
               ("0",color."f6e8c3"), ("1",color."80cdc1"), ("2",color."018571")))
  ELEMENT: line(position(X*PredPr*Z1), color(Z2))
END GPL.
*****************************************************.

So between all of these covariates, the form of the line does not change much (as intended, I simulated the data according to an additive model).

If you are interested in drawing more lines for Z2, you may want to use a continuous color scale instead of a categorical one (see here for a similar example).

A failed attempt at optimal search paths

Recently saw Kim Rossmo have a paper that describes a Bayesian approach to prioritizing areas for a search for missing persons. So he illustrates an approach to give a probability surface, but that still leaves implicit how individuals are to traverse over that probability space in the search itself.

For an example of where there can be potential ambiguity even with the probability surface, in the surface below we have three hot spots. So if we have four people to search this area, and they can only search a finite connected area (so no hop-scotching around), should we have them split between each of the hot spots, or should they cover one of the hot spots in more detail. (It is hard to tell in my graph, but the hot spot in the central western part of the graph has a higher hump, but is steeper, so top right has more mass but is more spread out.)

I’ve actually failed to be able to generate a decent algorithm to do this though. It is similar to this prior work of mine, but I actually discovered some errors in that work in trying to apply it to this situation (can have disconnected subtours that are complicated paths). So attempted several other variants and have yet to come up with a decent solution.

I tried out a greedy algorithm to solve the problem (pick the highest hump, march like an ant around until you have covered your max tour, and then start again). But this was not good either. But it generated some interesting accidental art. So here is my greedy approach to pick four tours in which they can traverse 300 grid cells, and here it says better to ignore the bottom hotspot and spread around your effort in the other two areas:

I know this is pretty sup-optimal though, as you can continue to generate more tours through this approach and eventually find better ones.

This is going to bug me forever now, but posting a blog to move on. So if you know of a solution please fill me in!

Lit reviews are (almost) functionally worthless

The other day I got an email from ACJS about the most downloaded articles of the year for each of their journals. For The Journal of Criminal Justice Education it was a slightly older piece, How to write a literature review in 2012 by Andrew Denney & Richard Tewksbury, DT from here on. As you can guess by the title of my blog post, it is not my most favorite subject. I think it is actually an impossible task to give advice about how to write a literature review. The reason for this is that we have no objective standards by which to judge a literature review – whether one is good or bad is almost wholly subject to the discretion of the reader.

The DT article I don’t think per se gives bad advice. Use an outline? Golly I suggest students do that too! Be comprehensive in your lit review about covering all relevant work? Well who can argue with that!

I think an important distinction to make in the advice DT give is the distinction between functional actions and symbolic actions. Functional in this context means an action that makes the article better accomplish some specific function. So for example, if I say you should translate complicated regression models to more intuitive marginal effects to make your results more interpretable for readers, that has a clear function (improved readability).

Symbolic actions are those that are merely intended to act as a signal to the reader. So if the advice is along the lines of, you should do this to pass peer review, that is on its face symbolic. DT’s article is nearly 100% about taking symbolic actions to make peer reviewers happy. Most of the advice doesn’t actually improve the content of the manuscript (or in the most charitable interpretation how it improves the manuscript is at best implicit). In DT’s section Why is it important this focus on symbolic actions becomes pretty clear. Here is the first paragraph of that section:

Literature reviews are important for a number of reasons. Primarily, literature reviews force a writer to educate him/herself on as much information as possible pertaining to the topic chosen. This will both assist in the learning process, and it will also help make the writing as strong as possible by knowing what has/has not been both studied and established as knowledge in prior research. Second, literature reviews demonstrate to readers that the author has a firm understanding of the topic. This provides credibility to the author and integrity to the work’s overall argument. And, by reviewing and reporting on all prior literature, weaknesses and shortcomings of prior literature will become more apparent. This will not only assist in finding or arguing for the need for a particular research question to explore, but will also help in better forming the argument for why further research is needed. In this way, the literature review of a research report “foreshadows the researcher’s own study” (Berg, 2009, p. 388).

So the first argument, a lit review forces a writer to educate themselves, may offhand seem like a functional objective. It doesn’t make sense though, as lit. reviews are almost always written ex post research project. The point of writing a paper is not to educate yourself, but educate other people on your research findings. The symbolic motivation for this viewpoint becomes clear in DT’s second point, you need to demonstrate credibility to your readers. In terms of integrity if the advice in DT was ‘consider creating a pre-analysis plan’ or ‘release data and code files to replicate your results’ that would be functional advice. But no, it is important to wordsmith how smart you are so reviewers perceive your work as more credible.

Then the last point in the paragraph, articulating the need for a particular piece of research, is again a symbolic action in DT’s essay. You are arguing to peer reviewers about the need for a particular research question. I understand the spirit of this, but think back to what function does this serve? It is merely a signal to reviewers to say, given finite space in a journal, please publish my paper over some other paper, because my topic is more important.

You actually don’t need a literature review to demonstrate a topic is important and/or needed – you can typically articulate that in a sentence or two. For a paper I reviewed not too long ago on crime reductions resulting from CCTV installations in a European city, I was struck by another reviewers critique saying that the authors “never really motivate the study relative to the literature”. I don’t know about you, but the importance of that study seems pretty obvious to me. But yeah sure, go ahead and pad that citation list with a bunch of other studies looking at the same thing to make some peer reviewers happy. God forbid you simply cite a meta-analysis on prior CCTV studies and move onto better things.

What should a lit review accomplish?

So again I don’t think DT give bad advice – mostly vapid but not obviously bad. DT focus on symbolic actions in lit reviews because as lit reviews are currently performed in CJ/Crim journals, they are almost 100% symbolic. They serve almost no functional purpose other than as a signal to reviewers that you are part of the club. So DT give about the best advice possible navigating a series of arbitrary critiques with no clear standard.

As an example for this position that lit reviews accomplish practically nothing, conduct this personal experiment. The next peer review article you pick up, do not read the literature review section. Only read the abstract, and then the results and conclusion. Without having read the literature review, does this change the validity of a papers findings? It for the most part does not. People get feelings hurt by not being cited (including myself), but even if someone fails to cite some of my work that is related it pretty much never impacts the validity of that persons findings.

So DT give advice about how peer review works now. No doubt those symbolic actions are important to getting your paper published, even if they do not improve the actual quality of the manuscript in any clear way. I rather address the question about what I think a lit review should look like – not what you should do to placate three random people and the editor. So again I think the best way to think about this is via articulating specific functions a lit review accomplishes in terms of improving the manuscript.

Broadening the scope abit to consider the necessity of citations, the majority of citations in articles are perfunctory, but I don’t think people should plagiarize. So when you pull a very specific piece of information from a source, I think it is important to cite that work. Say you are using a survey instrument developed by someone else, citing the work that establishes that instruments reliability and validity, as well as the original population those measures were established on, is certainly useful information to the reader. Sources of information/measures, a recent piece saying the properties of your statistical model are I think other good examples of things to cite in your work. Unfortunately I cannot give a bright line here, I don’t cite Gauss every time I use the normal distribution. But if I am using a code library someone else developed that is important, inasmuch as that if someone wants to do a similar project they could use the same library.

In terms of discussing relevant results in prior studies, again the issue is the boundary of what is relevant is very difficult to articulate. If there is a relevant meta-analysis on a topic, it seems sufficient to me to simply state the results of the meta-analysis. Why do I think that is important though? It helps inform your priors about the current study. So if you say a meta-analysis effect size is X, and the current study has an effect size much larger, it may give you pause. It is also relevant if you are generalizing from the results of the study, it is just another piece of evidence in addition to the meta-analysis, not an island all by itself.

I am not saying discussing prior specific results are not needed entirely, but they do not need to be extensive. So if studies Z, Y, X are similar to yours but all had null results, and you think it was because the sample sizes were too small, that is relevant and useful information. (Again it changes your priors.) But it does not need to be belabored on in detail. The current standard of articulating different theoretical aspects ad-nauseum in Crim/CJ journals does not improve the quality of manuscripts. If you do a hot spots policing experiment, you do not need to review all the different minutia of general deterrence theory. Simply saying this experiment is likely to only accomplish general deterrence, not specific deterrence, seems sufficient to me personally.

When you propose a book you need to say ‘here are some relevant examples’ – I think the same idea would be sufficient for a lit review. OK here is my study, here are a few additional studies I think the reader may be interested in that are related. This accomplishes what contemporary lit reviews do in a much more efficient manner – citing more articles makes it much more difficult to pull out the really relevant related work. So admit this does not improve the quality of the current manuscript in a specific way, but helps the reader identify other sources of interest. (I as a reader typically go through the citation list and note a few articles I am interested in, this helps me accomplish that task much quicker.)

I’ve already sprinkled a few additional pieces of advice in this blog post (marginal effect estimates, pre-analysis plans, sharing data code), although you may say they don’t belong in the lit review. Whatever, those are things that actually improve either the content of the manuscript or the actual integrity of the research, not some spray paint on your flowers.

Relevant Other Work

Changepoints in CCTV Effects

So I am a big fan of using splines in regression equations to model non-linear effects. But a limitation of these is that you need to upfront say how many knots you want, as well as where the knots are. So I have explored a bit on fitting models that can identify the changepoints themselves. It was a tricky road, I tried building some in deep learning using pytorch, then tried variational auto-encoders in pyro, then pystan (marginalizing the changepoint out), and then pymc3 (using different samplers). All of my attempts failed! But when I used the R mcp library (Lindeløv, 2020), it was able to find my changepoint using simulated data. (It uses JAGS under the hood, no idea why JAGS behaved better than my other attempts.)

Usecase: Dropoff effect of CCTV on clearance rates

So in spatial criminology, a popular hypothesis is estimating distance decay effects. Ratcliffe (2012) was the first example of using a changepoint regression model to do this, showing a changepoint in the effect of bars on the spatial density of crime nearby. This has been replicated in Xu & Griffiths (2017), and in my work using machine learning and partial dependence plots I show similar changepoint patterns as well (Wheeler & Steenbeek, 2020).

One example use case though I want to mention is not in terms of estimating the spatial density of crime, but with the characteristics of the crime events themselves. Sometimes people I think mistakenly think since I have spatial data, I need to aggregate it to some areal unit, and then do analysis of that areal unit data. That approach is not per-se wrong, but is sometimes a step removed from what you want, and can result in some tricky inferences.

Take for example a recent paper looking at clearances and using RTM by Kennedy et al. (2020). What they do is spatially aggregate homicides cleared and homicides not cleared, and run RTM on each. You might be tempted to interpret if a factor is selected for both models that it does not impact clearances, but it also depends on the size of the effect. So for example, in Brooklyn for drug markets they report a rate ratio of 3.1 and 2.4 (both at the same spatial distance). To translate this into a clearance rate, you need to add the two density estimates for all cases, and then take the cleared cases as the numerator.

# Example R code
clear <- exp(-0.1 + log(3.1))
nonclear <- exp(-0.1 + log(2.4))
prop <- clear/(clear + nonclear)
prop #0.5636364

Here I am treating -0.1 as the intercept. So here this is lower, but close to the overall clearance in Brooklyn, 58%. This 56% will be the estimate iff the intercept for each equation is the same, if they are not though it could change the clearance rate estimate either way. Since the Kennedy paper did not report this, we cannot know. So for instance, if we change the intercept estimates so clearances are higher and non-clearances are lower, we get an estimate that drug markets increase clearances slightly, not decrease them:

clear <- exp(-0.05 + log(3.1))
nonclear <- exp(-0.2 + log(2.4))
prop <- clear/(clear + nonclear)
prop #0.6001124

In this example it probably won’t push them too far either way, but takes a bit of work going from the aggregate data analysis to the estimate we want, how those spatial risk factors impact the clearance rate. There is an easier way though – just incorporate your spatial features, such as the distance the nearest crime generator factor, and estimate a model on the micro level incident data. This is what Kennedy et al. (2020) do later in the paper when incorporating the RTM predictions – I just think they should have done the RTM machinery directly on this problem, instead of the two-step approach.

Examples of my work I have done this approach in the past (incorporating spatial data into the micro level incidents) is with fatalities from gun shot wounds (Circo & Wheeler, 2020). We actually investigated non-linear effects though of distance/drive-time, and did not find evidence of that. Going back to the crime clearance example though, another pre-print I examine the effects of CCTV cameras and find a diminishing effect of case clearances given the distance to the camera (Jung & Wheeler, 2019).

So here we use a pre-post design to show there are some selection effects, and we do further analysis to show this camera bump in clearances is only limited to thefts. But we set the splines at 500, 1000, and 1500 feet pre-emptively for the analysis. A reviewer critique of this is that those three locations are arbitrary (which is correct), so here I will see if I do a changepoint model that allows us to find the knot locations if it will show the same ones.

The idea behind this analysis is that CCTV are often used in investigations. Yeondae is an officer in Korea, same as here in the states first things detectives do is to go and grab CCTV footage. Analysis of cameras are often aggregated to their viewsheds, but I think estimating distance decay effects make as much sense. So events closer to the cameras presumably will provide more clear evidence than events at the border of the viewshed. A second point is that even if the event takes place off-camera, there may be evidence cross by the camera viewshed. Detectives will often try to follow individuals across multiple cameras. So both of those factors suggest a distance decay effect both within a cameras viewshed and a decaying effect even outside of the viewshed. (In addition to this, geo coordinates of crime locations are not perfectly accurate measures either, so that could cause effects outside of the viewshed as well.)

Here I am just limiting the data to the post camera data within 3000 feet for thefts, which still is over 26,000 observations. I’ve posted the data/code to follow along here.

Analysis using mcp in R

Again given my hardship in coding this up myself in python, I created a simulated data example and checked the results using mcp (which you can check in my code). Since mcp recovered my simulated changepoint, (and my python attempts did not), going to go ahead with the mcp library! First, we will import my clearance data and get rid of a few missing cases.

#################
library(mcp)
library(ggplot2)
set.seed(10)
#can see I planned on doing this in pytorch at first!
setwd('D:\\Dropbox\\Dropbox\\Documents\\BLOG\\changepoint_pytorch\\Analysis')
theft_clear <- read.csv('PostTheft_CCTV.csv')
theft_clear <- theft_clear[complete.cases(theft_clear),]
#################

So first for a reference, if I assume there is a linear changepoint at 1000 feet, here are what my results look like. Note here that this is not aggregated data to spatial locations, each row in this dataset is a theft offense, whether it was cleared, and the distance to the nearest CCTV camera.

#################
#What are the coefficients if assume a changepoint of 1000 feet
theft_clear$x_dif <- (theft_clear$CAM.DIST - 1000)*(theft_clear$CAM.DIST > 1000)
theft_mod <- glm(formula = 'STATUSi ~ CAM.DIST + x_dif', family = "binomial", data = theft_clear)
summary(theft_mod) #This gives an estimate of 
#################

And here you can visualize the results alittle easier than trying to back out probabilities for the regression equation:

#################
pred_mod <- predict(theft_mod,type='response')
plot(theft_clear$CAM.DIST,pred_mod, main="Changepoint at 1000 ft",
  xlab="Distance from Camera (ft)", ylab="Probability Clearance")
#################

So this shows clearances nearby cameras in Dallas are around 15%, and they trail off to around 9% at 1000 feet. After that they continue to tail off, but are nearly flat. But again that is assuming a change point at 1000 feet. But the mcp package lets us actual estimate the changepoint itself using Bayesian regression. Here is the set up that is equivalent to my formulation earlier, in that the changepoint cannot be discontinuous.

#################
theft_clear$x <- theft_clear$CAM.DIST 
model = list(
  STATUSi | trials(const) ~ 1 + x,
  ~ 0 + x  #joined changing rate
)

fit = mcp(model, data = theft_clear, family = binomial(), iter = 3000, adapt = 500)
#################

And then if you are following along you can go ahead and take a nap (maybe took 2 hours on my machine?), and when we get back summary(fit) gives us:

So we have very similar coefficients to the manual changepoint model earlier, but the changepoint is around 1600 feet, not 1000. (Although note these are Bayesian credible intervals, not frequentist confidence intervals.) And now to make a nice plot of the fitted model.

#Fitted values for new data
newdat <- data.frame(x = (0:300)*10)
newdat$const <- 1
newdat$CAM.DIST <- newdat$x
res <- fitted(fit, newdata = newdat)

p_pred <- ggplot(data=res) + 
  geom_line(size=1.2, color='black', aes(x = x, y = fitted)) + 
  geom_ribbon(alpha=0.5, fill='black', aes(x = x, ymin=Q2.5 , ymax=Q97.5)) + 
  scale_x_continuous(name="Feet from Camera",breaks=seq(0,3000,500),minor_breaks=NULL) + 
  scale_y_continuous(name="P(Clearance)",breaks=seq(0.06,0.16,0.02),minor_breaks=NULL) +
  theme_bw() + theme(panel.grid.major = element_line(colour = 'grey', linetype = 'dashed', size=0.1)) + 
  theme(text = element_text(size=20))

p_pred

So you can see that here it is a nearly linear drop off until 1600 feet, and then starts to climb back up. The climb up I think is likely due to selection effects, but we can’t 100% rule out displacement effects. Displacement effects could occur with cameras if detectives prioritize events around cameras and de-prioritize other events not nearby cameras. Skeptical that applies to thefts in Dallas though, as they very rarely will be assigned a detective at all.

Wrap Up

So this ended up taking me for a few different turns. One of the things I wanted to be able to test multiple changepoints, maybe if I can ever get pymc3 to give me a reasonable fit, this example is a good illustration. That should also maybe say if you should have no changepoint as well. I think maybe it is much harder to fit those models with binomial data though than with continuous (maybe good for another blog post as well, did simulations at first with 1000 observations and that was a bad idea).

One thing that would be good for evaluating whether change points are reasonable are out of sample predictive comparisons. So say estimate a no changepoint model, a linear changepoint model, and then a model with fixed spline locations. Then see which of those better fits the out of sample data. But since this is a blog post, will leave it as is. But this is a simple illustration to extend prior spatial analysis of changepoints in distance decay effects to one example – crime clearances and CCTV cameras – that I think makes alot of sense.

References

Confidence intervals around proportions

So you probably learned about confidence intervals around means in your introductory statistics class. For a refresher, a confidence interval covers a particular statistic at a pre-specified rate. So if I generate 100 90% intervals around a mean, I expect that those confidence intervals would cover the true underlying mean around 90 times out of those 100. So it is a statement about the procedure overall – not any individual test.

This repeated coverage property is often not exactly what we want in statistics. But, I often find examining confidence intervals around samples to be an informative way to quantify uncertainty in estimates. For example, I have a machine learning model serving up predictions to a subsequent auditing process. I expect this to maintain a hit rate above 20%. The past week I only had a hit rate of 30/200 (15%), should I be worried? Probably not, a 95% confidence interval around that proportion is 10% to 21%.

Proportions come up so often that intro stats courses should probably talk more extensively about generating confidence intervals around them. There are many different confidence intervals for proportions, Wikipedia lists 7 different options!

I prefer where possible to use the Clopper-Pearson intervals by default. I will show an examples of generating Clopper-Pearson intervals in Excel and Python. But, another situation I have come across is I want to do these intervals entirely in SQL. For that situation, I will show how to use Agresti–Coull intervals.

Excel Clopper-Pearson

In Excel, if the A column contains the numerator, the B column contains the denominator, and if G1 has the alpha level, this brutish formula gets you the lower bound of your confidence interval;

=IF(A2=0, 0, BETA.INV($G$1/2, A2, B2-A2+1))

A here is your upper bound;

=IF(A2=B2, 1, BETA.INV(1-$G$1/2, A2+1, B2-A2))

And here is a screenshot of the filled in results:

Note for my criminology friends, you can use this for very extreme proportions as well. So say you had a homicide rate of 10 per 100,000, where the observed sample was 30 homicides in a city of 300,000. You can generate a binomial confidence interval around the proportion and then translate back to the rate per 100,000. So in that scenario, it results in a 95% confidence interval of a homicide rate of 6.7 to 14.3.

This is actually the reason I like defaulting to Clopper-Pearson. The other approximations can fail very badly for extreme tail events like this.

Python Clopper-Pearson

Here is a simple function in python to return the Clopper-Pearson intervals. This works for vectorized inputs as well (e.g. numpy arrays or pandas series).

import numpy as np
from scipy.stats import beta

def binom_int(num,den, confint=0.95):
    quant = (1 - confint)/ 2.
    low = beta.ppf(quant, num, den - num + 1)
    high = beta.ppf(1 - quant, num + 1, den - num)
    return (np.nan_to_num(low), np.where(np.isnan(high), 1, high))

And here is an example use:

hits = np.array([0, 1, 2, 3, 97, 98, 99, 100])
tries = np.array([100]*len(hits))
lowCI, highCI = binom_int(hits, tries)

Check out my prior blog post on making smoothed scatterplots on how to plot those proportion spikes in matplotlib.

SQL Agresti–Coull

So as I mentioned previously, I prefer the Clopper-Pearson intervals. This however relies on the availability of a function for the inverse beta distribution. One common situation is I just have all my tables in SQL, and I want to make a dashboard connected to a view of my tables. So the proportion of some event broken downs by days/weeks/months etc.

In that case exporting the data to python and re-uploading to the database can be a bit of a hassle, whereas creating a view is much less work. So here is an example query to calculate the proportion intervals entirely in SQL. So the initial table is a micro level table of events with 0/1 for a particular group. (This screenshot is for Access, but this should work in various databases.)

And then it is a groupby to get the original numerator, denominator, and proportion. Then a few rows calculating the adjusted proportion (add 2 to the numerator and 2*2 to the denominator), then finally this can still produce lower than 0 and higher than 1 intervals, so I cap those off.

/* This is for Access, for others may want to use SQRT() instead of SQR()
   Also may want to use CASE WHEN instead of IIF */
SELECT
   GroupID,
   SUM(Outcome) AS Num,
   COUNT(Outcome) AS Den,
   Num/Den AS Prop,
   Num + 2 AS nadj,
   Den + 2*2 as dadj,
   nadj/dadj as padj,
   2*SQR((padj/nadj)*(1 - padj)) AS zadj,
   IIF( padj < zadj, 0, padj - zadj) AS LowCI,
   IIF( (1 - padj) < zadj, 1, padj + zadj) AS HighCI
FROM ExampleData
GROUP BY GroupID;

This produces a 95% confidence interval for the final two columns. If you wanted to generate say a 99% confidence interval, you would replace the 2’s in the above table with 2.6. (In R you can do qnorm(1 - a/2), where a is 1 - confidence_level, to figure out this constant.)

What you shouldn’t use these intervals for

While I believe many applications of dashboards are well suited to including confidence intervals, confidence intervals (like p-values) are apt to be misinterpreted. One common one is that for a single 95% confidence interval, that does not mean the interval covers the true estimate with a 95% probability. This is an inference for an individual sample that is not possible in frequentist statistics – that summary would be akin to a posterior credible interval in Bayesian statistics. Confidence intervals are about the procedure, if we do this procedure over and over again, in the long run it will cover the true statistic (which we do not observe for any individual sample), according to the level we set.

Another common mistake with confidence intervals is when comparing two different intervals, them overlapping is sometimes interpreted as no difference. But this is a very conservative test (e.g. will fail to reject the null of no differences too often).

So say we were monitoring a process over time, and in October the process was 20% (40/200) and in November it was 28% (168/600). October’s confidence interval is 15% to 26%, and November’s confidence interval is 24% to 32%. So since those intervals overlap, we should conclude there are no differences correct? Not exactly, if we do a direct test for the differences in proportions (akin to a t-test of mean differences), we get a confidence interval of the difference as -14% to -1% (in R prop.test(c(40,168), c(200,600))). So in that direct hypothesis test, we would conclude October’s percent is lower than Novembers percent.

Geoff Cumming suggests that when going from individual confidence intervals to comparisons between groups, one confidence interval needs to cover the point estimate for the other group to conclude the two groups are different.

But that being said, I believe many dashboards would be improved if incorporating such confidence intervals. So although they may not always provide the test of interest, they are a good way to prevent yourself from over-interpreting noisy trends in smaller samples. In the case of comparing two intervals, for most situations I deal with, being conservative in saying this process is not showing differences is a better approach than worrying about minor fluctuations (although just depends on the use case whether that default behavior makes sense.)

So please, when reporting proportions with small samples, provide a confidence interval around those proportions!

Outliers in Distributions

If you google ‘outlier’, all of the results that come up are in terms of individual observation outliers. So if you have a set of transaction data that is 10, 20, 30, 8000, the singlet observation 8000 is an outlier. But for many situations with transaction data, you don’t want to examine individual outlier incidents, but look for systematic patterns. For example, if I am looking at healthcare insurance claims for my work, a single claim that is $100,000 is actually not that rare. But if we have a hospital that has mostly $100,000 claims for a specific treatment, whereas another with similar cases has a range of $50,000 to $100,000, that may signal there is some funny business going on.

There is no singular way to examine outliers in distribution. A plain old t-test of mean differences may make sense for some situations. But a generally more useful way IMO to think about the problem is to examine the distribution of the outcome in CDF space, as opposed to looking at particular moments of the distribution. A t-test basically only looks at the differences in means for the distributions, whereas examining the CDF we are looking for weird patterns at any point in the distribution.

Here is an example of comparing the cost of hospital stays (per length of stay), for a hospital compared to all others from the same datasource (details on the data in a sec). The way to read this graph is that at 10^3 (so $1000 per day claims) for facility 1458, we have around 20% of the claims data are below this value. For the rest of the hospital data, a larger proportion of claims are under a thousand dollars, more like 25%. Since the red line is always below the black line, it also means that the claims at this hospital are pretty much always larger than the claims at all the other hospitals.

For this example analysis, I am using data from New York State health insurance claims data (SPARC). I have posted python code to replicate here (note if you cannot access dropbox links, feel free to email and I will forward).

Here I am specifically analyzing medical, in-patient insurance claims (I dropped surgical claims) for around 300+ hospitals. There are quite a few claims in this data, over 2 million, and the majority of hospitals have plenty of claims to examine (so no hospitals with only 10 claims). I also specifically examine costs per length of stay. Initially I just examined costs, but will get to why I changed to the normalized version towards the end of the post.

Analysis of CDF Outliers

So first what I did was attempt to do a leave-one-out type stat test using the Kolmogorov-Smirnov test. This is a test that looks at the maximum vertical difference between the CDFs I showed earlier. I should have known better though. Given this large of sample size, even with multiple comparison adjustments for false discovery rate, every hospital was considered an outlier. This is sort of the curse of null hypothesis significance testing, it is either underpowered, so you get null results when things should really be flagged, or with large samples everything is flagged.

So what I did first was make a graph of all the different CDFs for each individual hospital. You can see from this plot we have a mass of the distribution that looks very similar in shape, but is shifted left or right. (Hospitals can bill different values, i.e. casemix, so can have the same types of events but have different bills, so that is normal.) But then we have a few outliers really stick out.

To characterize the central mass in this image, what I did was calculate each empirical CDF for each hospital (over 300 in this sample). Then I estimated the CDF for each hospital at a sample of points logspace distributed between $100 to $100,000. Then I took the 90% distribution between the ECDF values. This is easier to show than to say, so in the below pic the grey area is the 90% region for the CDFs. Then you can do stats to see how hospitals may fall outside that band.

So here 1320 is looking good until around 60% of the distribution, and then it is shifted right. There is a kink in the CDF as well, so this suggests really a set of different types of claims, and in that second group it is the outlier. 1320 was the hospital that had the most sample points outside of my grey coverage area, but you could also do outliers in terms of the distance between those two lines (again like a KS test stat), or in the area between those two lines (that is like a version of the Wasserstein distance only considering above/below moves). So here is the hospital that has the largest distance below the band (above the band signals that a hospital has lower claims on average):

Flat lines horizontally signal an absence of data, whereas vertical lines signal a set of claims with the exact same bill. So here we have a set of claims around $1000 per day that look normal, then an abnormal absence of data from $1,000 to $10,000. Then a large spike of claims that end up being around $45k per day.

So this is looking at the distribution relative to other hospitals, but a few examples I am familiar with look for these flat/vertical spikes in the CDF to identify fraud. Mike Maltz has an example of identifying collusion in bids. In another, Chris Stucchio identifies spikes in transaction data signaling potential fraud. Here I am just doing a test relative to other data to identify weird curves, not just flat lines though.

One limitation of this analysis I have conducted here is that it does not take into account the nature of the claims data. So say you had a hospital that specializes in cancer treatment, it may be totally normal for them to have claims that are higher value overall than a more typical hospital that spreads claims among a wider variety of types of visits/treatments. Initially I analyzed just the cost data, and it identified a few big outliers that ended up being hospice/nursing homes. So they had really high dollar value claims, but also really long stays. So when analyzing the claim per length of say, they were totally normal in that central mass.

So ultimately there could be other characteristics in the types of claims hospitals submit that could explain the weird CDF. One way to take that into account is to do a conditional model for the claims, and then do the ECDF tests on those conditional models. One way may be to look at the residuals for each individual hospital, another would be to draw a matched comparison sample. (Greg Ridgeway did this when examining police behavior in the NYPD.)

That would be like making a single comparison line (like my initial black/red line graph). So controlling the false discovery after that will be tough with larger samples (again the typical KS test, even with a matched sample, will likely always reject). So wondering if there is another machine learning way to identify outliers in CDF space, like a mashup of isolation forests and conditional density forests. Essentially I want to fit a model to draw those grey CDF bands, instead of relying on my sample of hospitals to draw the grey band in those latter plots.