A changepoint logistic model in pystan

So the other day I showed how to use the mcp library in R to estimate a changepoint model with an unknown changepoint location. I was able to get a similar model to work in pystan, although it ends up being slower in practice than the mcp library (which uses JAGS under the hood). It also limits the changepoints to a specific grid of values. So offhand there isn’t a specific reason to prefer this approach to the R mcp library, but I post here to show my work. Also I illustrate that with this particular model, using 1000 simulated observations.

To be clear what this model is, instead of the many time series examples floating around about changepoints (like the one in the Stan guide), we have a model with a particular continuous independent variable x, and we are predicting the probability of something based on that x variable. It is not that different, but many of those time series examples the universe to check for changepoints is obvious, only the observed time series locations. But here we have a continuous input (distance a crime event is from a CCTV camera), but we can only check a finite number of locations. It ends up being closer in spirit to this recent post by Keith Goldfield.

So in some quick and dirty text math, here c is the changepoint location and l is the logit function:

l(Prob[y]) = intercept + b1*x; if x <= c
l(Prob[y]) = intercept + b1*x + b2*(x - c); if x > c

This model can be expanded however you want – such as other covariates that do not change with the changepoint. But for this simple simulation I am just looking at the one running variable x and the binary outcome y.

Python Code

So first, I load up the libraries I will be using, then I simulate some data. Here the changepoint is located at 0.42 for the x variable, and in the ylogit line you can see the underlying logistic regression equation.

#################################
# Libraries I am using
import pystan
import numpy as np
import pandas as pd
import statsmodels.api as sm
#################################

#################################
# Creating simulated data
np.random.seed(10)
total_cases = 1000 #30000
x = np.random.uniform(size=total_cases) #[total_cases,1]
change = 0.42
xdif = (x - change)*(x > change)
ylogit = 1.1 + -4.3*x + 2.4*xdif
yprob = 1/(1 + np.exp(-ylogit))
ybin = np.random.binomial(1,yprob)
#################################

When testing out these models, one mistake I made was thinking offhand that 1,000 observations should be plenty. (Easier to run more draws with a smaller dataset.) When I had smaller effect sizes, the logistic coefficients could be pretty badly biased. So I started as a check estimating the logistic model inputting the correct changepoint location. Those biased estimates are pretty much the best case scenario you could hope for in the subsequent MCMC models. So here is an example fitting a logit model inputting the correct location for the changepoint.

#################################
#Statsmodel code to get
#The coefficient estimates 
#And standard errors for the sims
con = [1]*len(x)
xcomb = pd.DataFrame(zip(con,list(x),list(xdif)),columns=['const','x','xdif'])
log_reg = sm.Logit(ybin, xcomb).fit()
print(log_reg.summary()) 
#################################

So you can see that my coefficient estimates and the frequentist standard errors are pretty large even with 1,000 samples. So I shouldn’t expect my later MCMC model to have any smaller credible intervals than above.

So here is the Stan model. I am using pystan here, but of course it would be the same text file if you wanted to fit the model using R. (Just compiles C++ code under the hood.) Of only real note is that I show how to use the softmax function to estimate the actual mean location of the changepoint. Note that that mean summary though only makes sense if you make your grid of changepoint locations regular and fairly fine. (So if you said a changepoint could be at 0.1, 0.36, and 0.87, taking a weighted mean of those three locations doesn’t make sense.)

#################################
#Stan model
changepoint_stan = """
data {
   int<lower=1> N;
   vector[N] x;
   int<lower=0,upper=1> y[N];
   int<lower=1> Samp_Points;
   vector[Samp_Points] change;
}
transformed data {
  real log_unif;
  log_unif = -log(Samp_Points);
}
parameters {
  real intercept;
  real b_x;
  real b_c;
}
transformed parameters {
  vector[Samp_Points] lp;
  real before;
  real after;
  lp = rep_vector(log_unif, Samp_Points);
  for (c in 1:Samp_Points){
    for (n in 1:N){
      before = intercept + b_x*x[n]; 
      after = before + b_c*(x[n] - change[c]);
      lp[c] = lp[c] + bernoulli_logit_lpmf(y[n] | x[n] < change[c] ? before : after );
    }  
  }
}
model {
  intercept ~ normal(0.0, 10.0);
  b_x ~ normal(0.0, 10.0);
  b_c ~ normal(0.0, 10.0);
  target += log_sum_exp(lp);
}
generated quantities {
  vector[Samp_Points] prob_point;
  real change_loc;
  prob_point = softmax(lp);
  change_loc = sum( prob_point .* change );
}
"""
#################################

And finally I show how to prepare the data for pystan (as a dictionary), compile the model, and then draw a ton of samples. I generate a regular grid of 0.01 intervals from 0.03 to 0.97 (can’t have a changepoint outside of the x data locations, which I drew as a random uniform 0,l). Note the more typical default of 1000 tended to not converge, the effective number of samples is quite small for that many. So 5k to 10k samples in my experiments tended to converge. Note that this is not real fast either, took about 40 minutes on my machine (the Stan guesstimates for time were always pretty good ballpark figures).

#################################
# Prepping data and fitting the model

stan_dat = {'N': ybin.shape[0]}
stan_dat['change'] = np.linspace(0.03,0.97,95) #[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
stan_dat['Samp_Points'] = len(stan_dat['change'])
stan_dat['x'] = x
stan_dat['y'] = ybin


sm = pystan.StanModel(model_code=changepoint_stan)
#My examples needed more like 10,000 iterations
#effective sample size very low, took about 40 minutes on my machine
fit = sm.sampling(data=stan_dat, iter=5000, 
                  warmup=500, chains=4, verbose=True)
#Prints some results at the terminal!
print(fit.stansummary(pars=['change_loc','intercept','b_x','b_c']))
#################################

So you can see the results – the credible intervals for the intercept and regression coefficient before the change point are not bad, just slightly larger than the logistic model. The credible interval for the changepoint location and the changepoint effect different are quite uncertain though. The changepoint location covers almost the whole interval I examined. It may be better to plot the individual probabilities, like Goldfield did in his post, as opposed to summarized a mean location for the distribution (which is discrete in the end based on your grid of locations you look at).

So that at least gives a partial warning that you need quite big data samples to effectively identify the changepoint location, at least for this Stan model as I have shown. I haven’t run it on my 26k actual data sample, as it will end up taking my computer around 30 hours to crunch out 10k draws with 4 chains. Next up I rather see if I can get a similar model working in pyro, as my GPU on my personal machine I think will be faster than the C++ code here. (There are probably smarter ways to vectorize the Stan model as well.)

Confidence intervals around proportions

So you probably learned about confidence intervals around means in your introductory statistics class. For a refresher, a confidence interval covers a particular statistic at a pre-specified rate. So if I generate 100 90% intervals around a mean, I expect that those confidence intervals would cover the true underlying mean around 90 times out of those 100. So it is a statement about the procedure overall – not any individual test.

This repeated coverage property is often not exactly what we want in statistics. But, I often find examining confidence intervals around samples to be an informative way to quantify uncertainty in estimates. For example, I have a machine learning model serving up predictions to a subsequent auditing process. I expect this to maintain a hit rate above 20%. The past week I only had a hit rate of 30/200 (15%), should I be worried? Probably not, a 95% confidence interval around that proportion is 10% to 21%.

Proportions come up so often that intro stats courses should probably talk more extensively about generating confidence intervals around them. There are many different confidence intervals for proportions, Wikipedia lists 7 different options!

I prefer where possible to use the Clopper-Pearson intervals by default. I will show an examples of generating Clopper-Pearson intervals in Excel and Python. But, another situation I have come across is I want to do these intervals entirely in SQL. For that situation, I will show how to use Agresti–Coull intervals.

Excel Clopper-Pearson

In Excel, if the A column contains the numerator, the B column contains the denominator, and if G1 has the alpha level, this brutish formula gets you the lower bound of your confidence interval;

=IF(A2=0, 0, BETA.INV($G$1/2, A2, B2-A2+1))

A here is your upper bound;

=IF(A2=B2, 1, BETA.INV(1-$G$1/2, A2+1, B2-A2))

And here is a screenshot of the filled in results:

Note for my criminology friends, you can use this for very extreme proportions as well. So say you had a homicide rate of 10 per 100,000, where the observed sample was 30 homicides in a city of 300,000. You can generate a binomial confidence interval around the proportion and then translate back to the rate per 100,000. So in that scenario, it results in a 95% confidence interval of a homicide rate of 6.7 to 14.3.

This is actually the reason I like defaulting to Clopper-Pearson. The other approximations can fail very badly for extreme tail events like this.

Python Clopper-Pearson

Here is a simple function in python to return the Clopper-Pearson intervals. This works for vectorized inputs as well (e.g. numpy arrays or pandas series).

import numpy as np
from scipy.stats import beta

def binom_int(num,den, confint=0.95):
    quant = (1 - confint)/ 2.
    low = beta.ppf(quant, num, den - num + 1)
    high = beta.ppf(1 - quant, num + 1, den - num)
    return (np.nan_to_num(low), np.where(np.isnan(high), 1, high))

And here is an example use:

hits = np.array([0, 1, 2, 3, 97, 98, 99, 100])
tries = np.array([100]*len(hits))
lowCI, highCI = binom_int(hits, tries)

Check out my prior blog post on making smoothed scatterplots on how to plot those proportion spikes in matplotlib.

SQL Agresti–Coull

So as I mentioned previously, I prefer the Clopper-Pearson intervals. This however relies on the availability of a function for the inverse beta distribution. One common situation is I just have all my tables in SQL, and I want to make a dashboard connected to a view of my tables. So the proportion of some event broken downs by days/weeks/months etc.

In that case exporting the data to python and re-uploading to the database can be a bit of a hassle, whereas creating a view is much less work. So here is an example query to calculate the proportion intervals entirely in SQL. So the initial table is a micro level table of events with 0/1 for a particular group. (This screenshot is for Access, but this should work in various databases.)

And then it is a groupby to get the original numerator, denominator, and proportion. Then a few rows calculating the adjusted proportion (add 2 to the numerator and 2*2 to the denominator), then finally this can still produce lower than 0 and higher than 1 intervals, so I cap those off.

/* This is for Access, for others may want to use SQRT() instead of SQR()
   Also may want to use CASE WHEN instead of IIF */
SELECT
   GroupID,
   SUM(Outcome) AS Num,
   COUNT(Outcome) AS Den,
   Num/Den AS Prop,
   Num + 2 AS nadj,
   Den + 2*2 as dadj,
   nadj/dadj as padj,
   2*SQR((padj/nadj)*(1 - padj)) AS zadj,
   IIF( padj < zadj, 0, padj - zadj) AS LowCI,
   IIF( (1 - padj) < zadj, 1, padj + zadj) AS HighCI
FROM ExampleData
GROUP BY GroupID;

This produces a 95% confidence interval for the final two columns. If you wanted to generate say a 99% confidence interval, you would replace the 2’s in the above table with 2.6. (In R you can do qnorm(1 - a/2), where a is 1 - confidence_level, to figure out this constant.)

What you shouldn’t use these intervals for

While I believe many applications of dashboards are well suited to including confidence intervals, confidence intervals (like p-values) are apt to be misinterpreted. One common one is that for a single 95% confidence interval, that does not mean the interval covers the true estimate with a 95% probability. This is an inference for an individual sample that is not possible in frequentist statistics – that summary would be akin to a posterior credible interval in Bayesian statistics. Confidence intervals are about the procedure, if we do this procedure over and over again, in the long run it will cover the true statistic (which we do not observe for any individual sample), according to the level we set.

Another common mistake with confidence intervals is when comparing two different intervals, them overlapping is sometimes interpreted as no difference. But this is a very conservative test (e.g. will fail to reject the null of no differences too often).

So say we were monitoring a process over time, and in October the process was 20% (40/200) and in November it was 28% (168/600). October’s confidence interval is 15% to 26%, and November’s confidence interval is 24% to 32%. So since those intervals overlap, we should conclude there are no differences correct? Not exactly, if we do a direct test for the differences in proportions (akin to a t-test of mean differences), we get a confidence interval of the difference as -14% to -1% (in R prop.test(c(40,168), c(200,600))). So in that direct hypothesis test, we would conclude October’s percent is lower than Novembers percent.

Geoff Cumming suggests that when going from individual confidence intervals to comparisons between groups, one confidence interval needs to cover the point estimate for the other group to conclude the two groups are different.

But that being said, I believe many dashboards would be improved if incorporating such confidence intervals. So although they may not always provide the test of interest, they are a good way to prevent yourself from over-interpreting noisy trends in smaller samples. In the case of comparing two intervals, for most situations I deal with, being conservative in saying this process is not showing differences is a better approach than worrying about minor fluctuations (although just depends on the use case whether that default behavior makes sense.)

So please, when reporting proportions with small samples, provide a confidence interval around those proportions!

Outliers in Distributions

If you google ‘outlier’, all of the results that come up are in terms of individual observation outliers. So if you have a set of transaction data that is 10, 20, 30, 8000, the singlet observation 8000 is an outlier. But for many situations with transaction data, you don’t want to examine individual outlier incidents, but look for systematic patterns. For example, if I am looking at healthcare insurance claims for my work, a single claim that is $100,000 is actually not that rare. But if we have a hospital that has mostly $100,000 claims for a specific treatment, whereas another with similar cases has a range of $50,000 to $100,000, that may signal there is some funny business going on.

There is no singular way to examine outliers in distribution. A plain old t-test of mean differences may make sense for some situations. But a generally more useful way IMO to think about the problem is to examine the distribution of the outcome in CDF space, as opposed to looking at particular moments of the distribution. A t-test basically only looks at the differences in means for the distributions, whereas examining the CDF we are looking for weird patterns at any point in the distribution.

Here is an example of comparing the cost of hospital stays (per length of stay), for a hospital compared to all others from the same datasource (details on the data in a sec). The way to read this graph is that at 10^3 (so $1000 per day claims) for facility 1458, we have around 20% of the claims data are below this value. For the rest of the hospital data, a larger proportion of claims are under a thousand dollars, more like 25%. Since the red line is always below the black line, it also means that the claims at this hospital are pretty much always larger than the claims at all the other hospitals.

For this example analysis, I am using data from New York State health insurance claims data (SPARC). I have posted python code to replicate here (note if you cannot access dropbox links, feel free to email and I will forward).

Here I am specifically analyzing medical, in-patient insurance claims (I dropped surgical claims) for around 300+ hospitals. There are quite a few claims in this data, over 2 million, and the majority of hospitals have plenty of claims to examine (so no hospitals with only 10 claims). I also specifically examine costs per length of stay. Initially I just examined costs, but will get to why I changed to the normalized version towards the end of the post.

Analysis of CDF Outliers

So first what I did was attempt to do a leave-one-out type stat test using the Kolmogorov-Smirnov test. This is a test that looks at the maximum vertical difference between the CDFs I showed earlier. I should have known better though. Given this large of sample size, even with multiple comparison adjustments for false discovery rate, every hospital was considered an outlier. This is sort of the curse of null hypothesis significance testing, it is either underpowered, so you get null results when things should really be flagged, or with large samples everything is flagged.

So what I did first was make a graph of all the different CDFs for each individual hospital. You can see from this plot we have a mass of the distribution that looks very similar in shape, but is shifted left or right. (Hospitals can bill different values, i.e. casemix, so can have the same types of events but have different bills, so that is normal.) But then we have a few outliers really stick out.

To characterize the central mass in this image, what I did was calculate each empirical CDF for each hospital (over 300 in this sample). Then I estimated the CDF for each hospital at a sample of points logspace distributed between $100 to $100,000. Then I took the 90% distribution between the ECDF values. This is easier to show than to say, so in the below pic the grey area is the 90% region for the CDFs. Then you can do stats to see how hospitals may fall outside that band.

So here 1320 is looking good until around 60% of the distribution, and then it is shifted right. There is a kink in the CDF as well, so this suggests really a set of different types of claims, and in that second group it is the outlier. 1320 was the hospital that had the most sample points outside of my grey coverage area, but you could also do outliers in terms of the distance between those two lines (again like a KS test stat), or in the area between those two lines (that is like a version of the Wasserstein distance only considering above/below moves). So here is the hospital that has the largest distance below the band (above the band signals that a hospital has lower claims on average):

Flat lines horizontally signal an absence of data, whereas vertical lines signal a set of claims with the exact same bill. So here we have a set of claims around $1000 per day that look normal, then an abnormal absence of data from $1,000 to $10,000. Then a large spike of claims that end up being around $45k per day.

So this is looking at the distribution relative to other hospitals, but a few examples I am familiar with look for these flat/vertical spikes in the CDF to identify fraud. Mike Maltz has an example of identifying collusion in bids. In another, Chris Stucchio identifies spikes in transaction data signaling potential fraud. Here I am just doing a test relative to other data to identify weird curves, not just flat lines though.

One limitation of this analysis I have conducted here is that it does not take into account the nature of the claims data. So say you had a hospital that specializes in cancer treatment, it may be totally normal for them to have claims that are higher value overall than a more typical hospital that spreads claims among a wider variety of types of visits/treatments. Initially I analyzed just the cost data, and it identified a few big outliers that ended up being hospice/nursing homes. So they had really high dollar value claims, but also really long stays. So when analyzing the claim per length of say, they were totally normal in that central mass.

So ultimately there could be other characteristics in the types of claims hospitals submit that could explain the weird CDF. One way to take that into account is to do a conditional model for the claims, and then do the ECDF tests on those conditional models. One way may be to look at the residuals for each individual hospital, another would be to draw a matched comparison sample. (Greg Ridgeway did this when examining police behavior in the NYPD.)

That would be like making a single comparison line (like my initial black/red line graph). So controlling the false discovery after that will be tough with larger samples (again the typical KS test, even with a matched sample, will likely always reject). So wondering if there is another machine learning way to identify outliers in CDF space, like a mashup of isolation forests and conditional density forests. Essentially I want to fit a model to draw those grey CDF bands, instead of relying on my sample of hospitals to draw the grey band in those latter plots.

CrimCon Roundtable: Flipping a Criminal Justice PhD to an alt-academic Data Science Career

This Thursday 11/19/2020 at 1 PM Eastern, I will be participating in a roundtable for the online CrimCon event. This is free for everyone to zoom in, and here is the link to the program, I am on Stream 3!

The title is above — I have been a private sector data scientist at HMS for not quite a year now. I wanted to organize a panel to help upcoming PhD’s in criminal justice get some more exposure to potential data science positions, outside the traditional tenure track. Here is the abstract:

Tenure-track positions in academia are becoming more challenging to obtain, and only a small portion of junior faculty continue in academia to the rank of full professor. Therefore, students may opt to explore alternate options to obtain employment after their PhD is finished. These alternatives to the tenure track are often called “alt-academic” jobs. This roundtable will be focused on discussing various opportunities that exist for PhD’s in criminal justice and behavioral sciences spanning the public sector, the private sector, and non-profits/think tanks. Panelists will also discuss gaps in the typical PhD curriculum, with the goal of aiding current students to identify steps they can take to make themselves more competitive for alt-academic roles.

And here are each of the panelists bios:

Dr. Andrew Wheeler is currently a Data Scientist at HMS working on problems related to predictive modeling and optimization in relation to health insurance claims. Before joining HMS, he received a PhD degree in Criminal Justice from SUNY Albany. While in academia his research focused on collaborating with police departments for various problems including; evaluating crime reduction initiatives, place based and person based predictive modeling, data analytics for crime analysis, and developing models for the efficient and fair delivery of police resources.

Dr. Jennifer Gonzalez is the Senior Director of Population Health at the Meadows Mental Health Policy Institute, where she manages the Institute’s research and data portfolio. She earned her doctoral degree in epidemiology and a M.S. degree in criminal justice. Before joining MMHPI, Dr. Gonzalez was a tenured associate professor at the University of Texas School of Public Health, where she maintained a portfolio of more than $10 million in research funding and published more than one hundred interdisciplinary articles focused on the health of those who come into contact with—and work within—the criminal justice system.

Dr. Kyleigh Clark-Moorman is a Senior Research Associate for the Public Safety Performance Project at The Pew Charitable Trusts, a non-profit public policy organization. Kyleigh began working at Pew in 2019 and completed her PhD in Criminology and Criminal Justice at the University of Massachusetts, Lowell in May 2020. As an early career researcher, Dr. Clark-Moorman’s work has been published in Criminal Justice and Behavior, Criminal Justice Studies, and the Journal of Criminal Justice. In her role at Pew, Kyleigh is responsible for research design and data analysis focused on various criminal justice topics while also working with external partners to produce high-impact reports and analyses to raise awareness and drive public policy.

Matt Vogel is Associate Professor in the School of Criminal Justice at the University at Albany, SUNY and the Director of the Laboratory for Decision Making in Criminology and Criminal Justice. Matt regularly assists local agencies with data and evaluation needs. Some of his ongoing collaborations include assessments of racial representation on capital juries in Missouri, a longitudinal evaluation of a school-based violence reduction program, and the implementation of a police-hospital collaboration to help address retaliatory violence in St. Louis. Prior to joining the faculty at UAlbany, Matt worked in the Department of Criminology and Criminal Justice at the University of Missouri – St. Louis and held a long-term visiting appointment with the Faculty of Architecture at TU Delft (the Netherlands).

If you have any upfront questions you would like addressed by the panel, always feel free to send me a pre-emptive email (or comment below).


Update: The final roundtable is now posted on Youtube. See below for the panels thoughts on pursuing non-tenure track jobs with your social science Phd.

Regression with Simple Weights

I was reminded of this paper by Jung et al. on constructing simple rules via regression recently. So in the past few posts I have talked about how RTM (1,2) is aimed at making simple models. This is via variable selection and/or simplying the inputs to be binary yes/no. But in the end the final equation could be something like:

log(Crime) = -0.56 + 0.6923*NearbyBars + 0.329*HighDensity311

The paper linked above is about making the regression weights simple, so instead of a regression weight of 0.89728, you may just round the regression weight to 1. The Jung paper does a procedure where they use lasso regression and then round the weights. But there is a simpler approach IMO I will illustrate, just amend the lasso weights to push the coefficients to simple integers. (Also reminded by this example of using an iterative linear program to push weights to binary 0/1.)

So in lasso, you estimate your normal regression equation, but put a penalty on the weights that is typically something like lambda*(sum(abs(reg_weights)) - 1)**2. So if you have reg weights that add to more than 1, they are penalized by a particular amount (the lambda is a tuner to make the penalty higher/lower). And in the iterative algorithm to minimize your loss function plus this added penalty, it will converge to regression weights that meet the criteria of in total summing to around 1. Not exactly 1 but close.

You can however swap out that penalty term with whatever you want (or add to it additional penalties). I will show an example of using a penalty term to push regression coefficients towards integer values, creating simple regression weights.

Why Simple Models?

Dan Simpson has a good blog post of the Jung paper and why simple models are sometimes preferable (and I also have a comment why simple models like this tend to work out well for CJ datasets). But here are few quick examples why you might want a simple model results.

Example 1: If you have people in the field who are tabulating data and making quick decisions, it may be they need to use pen/paper and make a quick decision. No time to input results into a computer and pop out a prediction. Imagine a nurse in the ER, or even your general practitioner. There may be quite a bit of utility in making a simple check list that says if +4 on this scale, do a more intensive treatment.

Example 2: You have a complicated, large database. It is easier to create a simple predictive model in SQL to serve up predictions (either because of latency or because of the complexity of the data pipeline). Instead of a complicated random forest, a linear regression with simple weights will be much easier to implement.

Example 3: Transparency. Complicated models are more difficult to understand and monitor. If you have a vested interest in presenting the model to outside parties, it may make sense to sacrifice some accuracy to make the model more interpretable. Also similar to lasso, I suspect these simple weights will reduce the variance of predictions.

The reason that these simple weights work well in practice for many social science examples you could interpret either in a good light or a bad one. For the half-empty interpretation, our models are not well identified – we can literally swap out various weights in our regression equation and get near similar predictions. So it is fools errand to try to find the regression equation that describes the underlying system. But you can flip that around as well, we don’t even need to find the perfect equation, we can identify quite a few good predictive equations. And why not pick a good equation that is easier to interpret?

Pytorch Example

The example set of code here is very simple, so I will just put the python code entirely in this post. First I import my libraries I am using and change my directory.

############################################
import os
import torch
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\regression_simpleweights\analysis'
os.chdir(my_dir)
############################################

Next I read in the data, which I have previously used as an example in prior blog posts on doctor visits for medicare patients. One thing to note here, is that I rescale the independent variables I am using to min/max. So the age variable instead of going from 65-90 like in the original data, now is scaled to be between 0/1. This is a problem intrinsic to lasso as well, in that you can change the scale of the input variables and it changes the weights. Here with the original data, the education variable has a tiny regression coefficient (0.2), but is highly stat significant. So without rescaling that variable, the model said to hell with your penalty and still converged to a solution of that regression weight is 0.2. If you divide the education variable by 5 though, the corresponding regression weight would change to around 1.

###########################################
#Data from Stata, https://www.stata-press.com/data/r16/gsem_mixture.dta
#see pg 501 https://www.stata.com/manuals/sem.pdf

visit_dat = pd.read_stata('gsem_mixture.dta')
y_dat = visit_dat[['drvisits']]
x_vars = ['private','medicaid','age','educ','actlim','chronic']
#rescaling variables to 0/1
x_dat = visit_dat[x_vars]
visit_dat[x_vars] = (x_dat - x_dat.min(axis=0)) / ( x_dat.max(axis=0)  - x_dat.min(axis=0) )
x_dat = visit_dat[x_vars] #intentional not a copy
###########################################

Now in the next part, I estimate the default linear regression model using statsmodels for reference. Then I stuff the results into pytorch tensors (which I will use later as default starting points for the pytorch estimates). Below is a pic of the resulting summary for the regression model (with the scaled variables, so is slightly different than my prior post).

###########################################
#Estimating the same model in statsmodel
#for confirmation of the result

stats_mod = smf.ols(formula='drvisits ~ private + medicaid + age + educ + actlim + chronic',
                    data=visit_dat)
sm_results = stats_mod.fit()
print(sm_results.summary())

#What is the mean squared error
pred = sm_results.get_prediction().summary_frame()
print( ((y_dat['drvisits'] - pred['mean'])**2).mean() )
#169513.0122252265 for sum
#46.10 for mean

#for setting default initial weights
coef_table = sm_results.params
int_ten = torch.tensor([coef_table[0]], dtype=torch.float, requires_grad=True)
coef_ten = torch.tensor(pd.DataFrame(coef_table[1:]).T.to_numpy(), dtype=torch.float, requires_grad=True)
###########################################

Now creating the pytorch model is quite simple. For linear regression it is just one linear layer, and then setting the loss function to mean squared error. Then I create my own simple weight penalization factor in the simp_loss function. This takes the regression weights (not including the bias/intercept term), takes the difference between the observed weight and the rounded weight, takes the absolute value and sums those absolute values up. Then in the loop when I am fitting the model, you can see the loss = criterion(y_pred, y_ten) + 0.4*simp_loss(model) line. For the usual linear regression, it would just be the first criterion term. Here to add in the penalty term is super simple in pytorch, you just add it to the loss. (And you can incorporate additional penalities, the same way ala elastic-net. The Jung paper they put a penalty on the sum of coefficients as per the original lasso as well.)

Then the final part of the code after the loop is just putting the coefficients in a nicer data frame to print. And below the code snippet are the results.

###########################################
#Now estimating OLS model with simple coefficient
#Penalities in Pytorch

torch.manual_seed(10)

model = torch.nn.Sequential( 
  torch.nn.Linear(len(x_vars),1,bias=True)
)

##Initializing weights
#with torch.no_grad():
#    model[0].weight = torch.nn.Parameter(coef_ten)
#    model[0].bias = torch.nn.Parameter(int_ten)

x_ten = torch.tensor( x_dat.to_numpy(), dtype=torch.float)
y_ten = torch.tensor( y_dat.to_numpy(), dtype=torch.float)

criterion = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

def simp_loss(mod):
    dif = mod[0].weight - torch.round(mod[0].weight)
    return dif.abs().sum()

for t in range(100000):
    #Forward pass
    y_pred = model(x_ten)
    #Loss
    loss = criterion(y_pred, y_ten) + 0.4*simp_loss(model)
    if t % 1000 == 99:
        print(f'iter: {t}, loss: {loss.item()}') 
    #Zero gradients
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

#Making a nice dataframe of coefficients

coef_vars = ['Inter'] + x_vars
vals = list(model[0].bias.detach().numpy()) + list(model[0].weight.detach().numpy()[0,:])
res = pd.DataFrame(zip(coef_vars, vals), columns=['Var','Coef'])
print( res )
###########################################

Here I did not round the coefficients, so you can see that they are not exactly integer values, but are very close. So this will result in a lower loss than taking the usual linear regression coefficients and rounding them like in the noted Jung paper. It is a more direct approach. Also note that the intercept is not close to an integer value. I did not include the intercept in my penalty term. You could if you wanted to, but for most examples I don’t think it makes much sense to do that.

But one of the things that I have noticed playing around with pytorch more is that it is very difficult to get random initialized weights to converge to the same solution. That identification problem I mentioned earlier. One way is instead of using random initialized weights, is to initialize them to some reasonable values. If you uncomment the lines with torch.no_grad(): in the above code and initialize the weights to start from the unregularized OLS solution, it converges much faster, has a slightly smaller mean square error term, and results in these effects:

So you can see in that solution it is exactly the same as rounding the initial OLS solution (ignoring the intercept again). But that may not always be the case. For example if actlim (activity limitations) and educ (education) had a very high correlation, it may be rounding both down is too big a hit to the fit of the equation, so one may go down and one go up. (You need to estimate the equation to know if things like that will occur.)

And that is all folks! While if I were sharing this more broadly, I would likely make a statsmodel like interface (and it appears they use cvxopt under the hood) instead of pytorch, it is very simple to amend pytorch to return simple weights, just add in the penalty to the loss function. Works the same way for lasso/ridge as it does for the simple weights example I give here.

Next up I want to try to figure out autograd in pytorch good enough to give standard errors for these various regression models I am estimating. While I don’t think hypothesis testing makes sense for these various models I am sharing, seeing a standard error that is very high may have prognostic value. In this case, if you had a very high standard error relative to the simple coefficient, it might suggest you should rescale the variable a different way or drop it entirely.

Also for this example, to be simple in the field it would not only need simple coefficients, but simple inputs as well. Wondering if there is a way to add in threshold layers in deep learning to automatically figure out the best way to make the inputs binary (e.g. above 70, educ below 10, etc.) instead of doing min/max scaling of the inputs.

A latent variable approach to RTM using hidden layers in deep learning

Sorry about the long title! Previously I have blogged about how to use Deep Learning to generate an RTM like model variable selection and positive constraints. Deep learning frameworks often do not rely on variable selection like that though, they more often leverage hidden layers. For social scientists familiar with structural equation modelling, these hidden layers are very much akin to formative latent variables. (More traditionally folks use reflective latent variables in factor analysis, so the latent variable causes the observed measures. This is the obverse, the observed measures cause/define the latent variable, and we find the loadings that best predict some outcome further down the stream.)

In a nutshell, instead of the typical RTM way of picking the best variable to use, e.g. Alcohol Density < 100 meters OR Alcohol Density < 500 meters, it allows both to contribute to a latent variable, call it AlcoholDens, but allows those weights to vary. Then I see how well the AlcoholDens latent variable predicts crime. I will show later in the results that the loadings are often spread out among different density/distance measures in this sample, suggesting the approach just pick one is perhaps misguided.

I’ve posted the data and code to follow along here. There are two py files, 00_RTMHidden.py runs the main analysis, but dl_rtm_funcs.py has various functions used to build the deep learning model in pytorch. I am just going to hit some of the highlights instead of walking through bit by bit.

Some helper functions

First, last blog post I simply relied on using Poisson loss. This time, I took some effort to figure out my own loss function for the negative binomial model. Here I am using the NB2 form, and you can see I took the likelihood function from the Stata docs (they are a really great reference for various regression model info). To incorporate this into your deep learning model, you need to add a single parameter in your model, here I call it disp.

#Log likelihood taken from Stata docs, pg 11 
#https://www.stata.com/manuals13/rnbreg.pdf
def nb2_loss(actual, log_pred, disp):
    m = 1/disp.exp()
    mu = log_pred.exp()
    p = 1/(1 + disp.exp()*mu)
    nll = torch.lgamma(m + actual) - torch.lgamma(actual+1) - torch.lgamma(m)
    nll += m*torch.log(p) + actual*torch.log(1-p)
    return -nll.mean()

A second set of helper functions I will illustrate at the end of the post is evaluating the fit for Poisson/Negative Binomial models. I’ve discussed these metrics before, they are just a python rewrite of older SPSS code I made.

def pred_nb(mu, disp, int_y):
    inv_disp = 1/disp
    p1 = gamma(int_y + inv_disp) / ( factorial(int_y)*gamma(inv_disp) )
    p2 = ( inv_disp / (inv_disp + mu) ) ** inv_disp
    p3 = ( mu / (inv_disp + mu) ) ** int_y
    pfin = p1*p2*p3
    return pfin
    
def nb_fit(mu, obs, disp, max_y):
    res = []
    cum_fit = mu - mu
    for i in range(max_y+1):
        pred_fit = pred_nb(mu=mu, disp=disp, int_y=i)
        pred_obs = (obs == i)
        res.append( (str(i), pred_obs.mean(), pred_fit.mean(), pred_obs.sum(), pred_fit.sum()) )
        cum_fit += pred_fit
    fin_fit = 1 - cum_fit
    fin_obs = (obs > max_y)
    res.append( (str(max_y+1)+'+', fin_obs.mean(), fin_fit.mean(),
                  fin_obs.sum(), fin_fit.sum()) )
    dat = pd.DataFrame(res, columns=['Int','Obs','Pred','ObsN','PredN'])
    return dat

Main Analysis

Now onto the main analysis. Skipping the data loading (it is near copy-paste from my prior RTM Deep Learning post), here are the main guts to building and fitting the RTM model.

model = dl_rtm_funcs.RTM_hidden(gen_list=[alc_set,metro_set,c311_set], 
                                gen_names=['AlcOutlets','MetroEntr','Dens311'])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #1e-4

for t in range(5001):
    #Forward pass
    y_pred = model(comb_ten)
    #Loss 
    loss_insample = dl_rtm_funcs.nb2_loss(y_ten, y_pred, model.dispersion)
    optimizer.zero_grad()
    loss_insample.backward() #retain_graph=True
    optimizer.step()
    if t % 100 == 0:
        loss_out = dl_rtm_funcs.nb2_loss(out_ten, y_pred, model.dispersion)
        print(f'iter {t}: loss in = {loss_insample.item():.5f}, loss out = {loss_out.item():.5f}')

And in terms of iterations, on my machine this takes less than 20 seconds to do the 5000 iterations, and it has clearly peaked out by then (both in sample 2011 and out of sample 2012).

I’ve loading the RTM model object with a few helper functions, so if you then run print( model.coef_table() ), you get out the final regression coefficients, including the dispersion term. For my negative binomial models for my dissertation, the dispersion term tended to be around ~4 for many models, so this corresponds pretty closely with my prior work.

These have interpretations as latent variables representing the effect of nearby alcohol outlets (both distance and density), metro entrances (just distance), and 311 calls for service (just density). Similar to original RTM, I have restricted the crime generator effects to be positive.

I also have another helper function, model.loadings(), that gives you a nice table. Here this shows how the original variables contribute to the latent variable. So here are the loadings for the distance to the nearest metro.

You can see that the dummy variables for met_dis_300 (meters) and smaller all contribute to the latent variable. So instead of picking one variable in the end, it allows multiple variables to contribute to the latent risk score. It may make more sense in this set up to encode variables as not cumulative, e.g. < 50 meters, < 100 meters, but orthogonal, e.g. [0,50),[50,100), etc.), but just stuck with the prior data in the same format for now. I force the loadings to sum to 1 and be positive, so the latent variables still have a very apples-to-apples comparison in terms of effect sizes.

Here are the loadings for alcohol outlets, so we have both some distance and density effects in the end.

And here are the loadings for 311 density variables:

So you can see for the last one, only the furthest away had an effect at all. Which is contra to the broken windows theory! But also shows that this is more general than the original RTM approach. If it only should be one variable the model will learn that, but if it should be more it will incorporate a wider array of weights.

Next is to check out how well the model does overall. For calibration for Poisson/Negative Binomial models, I just detach my pytorch tensors, and feed them into my functions to do the evaluations.

#Calibration for Negative Binomial predictions
pred_pd = pd.Series( y_pred.exp().detach().numpy() )
disp_val = model.dispersion.exp().item()

nb_fit = dl_rtm_funcs.nb_fit(mu=pred_pd, obs=crime_data['Viol_2011'], 
                             disp=disp_val, max_y=10)
print( nb_fit )

So this shows that the model is pretty well calibrated in terms of overall predictions. Both samples predict 83% zeroes. I predict a few more 3/4 crime areas than observed, and my tails are somewhat thinner than they should be, but only by a tiny bit. (No doubt this would improve if I incorporated more covariates, kept it simple to debug on purpose.)

We can ignore the negative binomial dispersion term and see what our model would predict in the usual Poisson case (the mean functions are the same, it is just changing the variance). To do this, just pass in a dispersion term of 1.

pois_fit = dl_rtm_funcs.nb_fit(mu=pred_pd, obs=crime_data['Viol_2011'], 
                               disp=1, max_y=10)
print( pois_fit )

You can see that the Poisson model is a much worse fit. Underpredicting zero crime areas by 6%, and areas with over 10 crimes should pretty much never happen according to the Poisson model.

We should be assessing these metrics out of sample as well, and you can see that given crime is very historically stable, the out of sample 2012 violent crime counts are similarly well calibrated.

Finally, I have suggested in the past to use a weighted ROC curve as a metric for crime counts. Here is a simple example of doing that in python.

crime_data['Weights'] = crime_data['Viol_2012'].clip(1)
crime_data['Outcome'] = crime_data['Viol_2012'].clip(0,1)

fpr, tpr, thresh = roc_curve(crime_data['Outcome'], pred_pd, sample_weight=crime_data['Weights'])
weighted_auc = auc(fpr, tpr)
print( weighted_auc ) 

So you can see the AUC is nothing to brag about here, 0.61 (it is only 0.63 in the 2011 sample). But again I am sure I could get that up by quite a bit by incorporating more covariates into the model.

A linear programming example for TURF analysis in python

Recently on LinkedIn I saw a very nice example of TURF (Total Unduplicated Reach & Frequency) analysis via Jarlath Quinn. I suggest folks go and watch the video, but for a simple example imagine you are an ice cream food truck, and you only have room in your truck to sell 5 different ice-creams at a time. Some people only like chocolate, others like vanilla and neapolitan, and then others like me like everything. So what are the 5 best flavors to choose to maximize the number of people that like at least one flavor?

You may think this is a bit far-fetched to my usual posts related to criminal justice, but it is very much related to the work I did on identifying optimal gang members to deliver the message in a Focused Deterrence initiative. (I’m wondering if also there is an application in Association Rules/Conjunctive analysis.) But most examples I see of this are in the marketing space, e.g. whether to open a new store, or to carry a new product in a store, or to spend money on ads to reach an audience, etc.

Here I have posted the python code and data used in the analysis, below I go through the steps in formulating different linear programs to tackle this problem. I ended up taking some example simulated data from the XLStat website. (If you have a real data example feel free to share!)

A bit of a side story – growing up in rural Pennsylvania, going out to restaurants was sort of a big event. I specifically remember when we would travel to Williamsport, we would often go to eat at a restaurant called Hoss’s and we would all just order the salad bar buffet. So I am going to pretend this restaurant survey is maximizing the reach for Hoss’s buffet options.

Frontmatter

Here I am using pulp to fit the linear programming, reading in the data, and I am making up names for the columns for different food items. I have a set of main course meals, sides, and desserts. You will see in a bit how I incorporate this info into the buffet plans.

############################################################
#FRONT MATTER 

import pandas as pd
import pulp
import os

os.chdir('D:\Dropbox\Dropbox\Documents\BLOG\TURF_Analysis\Analysis')

#This is simulated data from XLStat
surv_data = pd.read_excel('demoTURF.xls',sheet_name='Data',index_col=0)

#Need 27 total of match up simulated data
main = ['Steak',
        'Pizza',
        'FriedChicken',
        'BBQChicken',
        'GrilledSalmon',
        'FriedHaddock',
        'LemonHaddock',
        'Roast',
        'Burger',
        'Wings']

sides = ['CeaserSalad',
         'IcebergSalad',
         'TomatoSoup',
         'OnionSoup',
         'Peas',
         'BrusselSprouts',
         'GreenBeans',
         'Corn',
         'DeviledEggs',
         'Pickles']

desserts = ['ChoclateIceCream',
            'VanillaIceCream',
            'ChocChipCookie',
            'OatmealCookie',
            'Brownie',
            'Blondie',
            'CherryPie']

#Renaming columns
surv_data.columns = main + sides + desserts

#Replacing the likert scale data with 0/1
surv_data.replace([1,2,3],0,inplace=True)
surv_data.replace([4,5],1,inplace=True)

#A customer weight example, here setting to 1 for all
surv_data['CustWeight'] = 1
cust_weight = surv_data['CustWeight']
############################################################

Maximizing Customers Reached

And now onto the good stuff, here is an example TURF model linear program. I end up picking the same 5 items that the XLStat program picked in their spreadsheet as well.

############################################################
#TRADITIONAL TURF ANALYSIS

k = 5 #pick 5 items
Cust_Index = surv_data.index
Prod_Index = main + sides + desserts

#Problem and Decision variables
P = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)

#Objective Function
P += pulp.lpSum( Cust_Dec[i] * cust_weight[i] for i in Cust_Index )

surv_items = surv_data[Prod_Index] #Dont want the weight variable
#Reached Constraint
for i in Cust_Index:
    #Get the products selected
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    #Set the constraint
    P += Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )
    
#Total number of products selected constraint
P += pulp.lpSum( Prod_Dec[j] for j in Prod_Index) == k

#Now solve the model
P.solve()

#Figure out the total reached people
print( pulp.value(P.objective) ) #129

#Print out the products you picked
picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec[j].varValue == 1:
        picked.append( (n+1,j) )

print(picked) 

#Same as XLStat
#[(14, 'OnionSoup'), (15, 'Peas'), (16, 'BrusselSprouts'), 
# (23, 'ChocChipCookie'), (26, 'Blondie')]

#For 5 items, XLStat selected items 
# 14 15 16 23 26 that reached 129 people
############################################################

One of the things I have done here is to create a ‘weight’ variable associated with each customer. So here I say all of the customers weights are all equal to 1, but you could swap out whatever you wanted. Say you had estimates on how much different individuals spend, so you could give big spenders more weight. (In a criminal justice example, for the Focused Deterrence initiative, folks typically want to target ‘leaders’ more frequently, so you may give them more weight in this example.) Since these examples are based on surveys, you may also want the weight to correspond to the proportion that survey respondent represents in the population, aka raking weights. Or if you have a crazy large survey population, you could use frequency weights for responses that give the exact same picks.

One thing to note as well in this formula is that I recoded the data earlier to be 0/1. You might however consider the likert scale rating 1 to 5 directly, subtract 1 and divide by 4. Then take that weight, and instead of the line:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )

You may want something like:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j]*likert_weight[i,j] for j in sel_prod )

In that case you would want to set the Cust_i decision variable to a continuous value, and then maybe cap it at 1 (so you can partially reach customers).

The total number of decision variables will be the number of customers plus the number of potential products, so here only 185 + 27 = 212. And the number of constraints will be the number of customers plus an additional small number. I’d note you can easily solve systems with 100,000’s of decision variables and constraints on your laptop, so at least for the example TURF analyses I have seen they are definitely within the ‘can solve this in a second on a laptop’ territory.

You can add in additional constraints into this problem. So imagine we always wanted to select one main course, at least two side dishes, and no more than three desserts. Also say you never wanted to pair two items together, say you had two chicken dishes and never wanted both at the same time. Here is how you could do each of those different constraints in the problem.

############################################################
#CONSTRAINTS ON ITEMS SELECTED

#Redoing the initial problem, but select 7 items
k = 7
P2 = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec2 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec2 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
P2 += pulp.lpSum( Cust_Dec2[i] * cust_weight[i] for i in Cust_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    P2 += Cust_Dec2[i] <= pulp.lpSum( Prod_Dec2[j] for j in sel_prod )
P2 += pulp.lpSum( Prod_Dec2[j] for j in Prod_Index) == k

#No Fried and BBQ Chicken at the same time
P2 += pulp.lpSum( Prod_Dec2['FriedChicken'] + Prod_Dec2['BBQChicken']) <= 1
#Exactly one main course
P2 += pulp.lpSum( Prod_Dec2[m] for m in main) == 1
#At least two sides (but could have 0)
P2 += pulp.lpSum( Prod_Dec2[s] for s in sides) >= 2
#No more than 3 desserts
P2 += pulp.lpSum( Prod_Dec2[d] for d in desserts) <= 3

#Now solve the model and print results
P2.solve()
print( pulp.value(P2.objective) ) #137
picked2 = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec2[j].varValue == 1:
        picked2.append( (n+1,j) )
print(picked2)
#[(10, 'Wings'), (12, 'IcebergSalad'), (14, 'OnionSoup'), (15, 'Peas'), 
# (16, 'BrusselSprouts'), (23, 'ChocChipCookie'), (27, 'CherryPie')]
############################################################

You could also draw a trade-off curve for how many more people you will reach if you can up the total number of items you can place on the menu, so estimate the model with 4, 5, 6, etc items and see how many more people you can reach if you extend the menu.

One of the other constraints you may consider in this formula is a budget constraint. So imagine instead of the food example, you are working for a marketing company, and you have an advertisement budget. You want to maximize the customer reach given the budget, so here a “product” may be a billboard, radio ad, newspaper ad, etc, but each have different costs. So instead of the constraint Prod_j == k where you select so many products, you have the constraint Prod_j*Cost_j <= Budget, where each product is associated with a particular cost.

Alt Formula, Minimizing Cost while Reaching a Set Amount

So in that last bit I mentioned costs for selecting a particular portfolio of products. Another way you may think about the problem is minimizing cost while meeting constraints on the reach (instead of maximizing reach while minimizing cost). So say you were a marketer, and wanted an estimate of how much budget you would need to reach a million people. Or going with our buffet example, imagine we wanted to appeal to at least 50% of our sample (so at least 93 people). Our formula would then be below (where I make up slightly different costs for buffet each of the buffet options).

############################################################
#MINIMIZE COST

#Cost dictionary made up prices
cost_prod = {'Steak' : 5.0,
             'Pizza' : 2.0,
             'FriedChicken' : 4.0,
             'BBQChicken' : 3.5,
             'GrilledSalmon' : 4.5,
             'FriedHaddock' : 5.3,
             'LemonHaddock' : 4.7,
             'Roast' : 3.9,
             'Burger' : 1.5,
             'Wings' : 2.4,
             'CeaserSalad' : 1.0,
             'IcebergSalad' : 0.8,
             'TomatoSoup' : 0.4,
             'OnionSoup' : 0.9,
             'Peas' : 0.6,
             'BrusselSprouts' : 0.5,
             'GreenBeans' : 0.4,
             'Corn' : 0.3,
             'DeviledEggs' : 0.7,
             'Pickles' : 0.73,
             'ChoclateIceCream' : 1.3,
             'VanillaIceCream' : 1.2,
             'ChocChipCookie' : 1.5,
             'OatmealCookie' : 0.9,
             'Brownie' : 1.2,
             'Blondie' : 1.3,
             'CherryPie' : 1.9}

#Setting up the model with the same selection constraints
n = 100
Pmin = pulp.LpProblem("TURF", pulp.LpMinimize)
Cust_Dec3 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec3 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Minimize this instead of Maximize reach
Pmin += pulp.lpSum( Prod_Dec3[j] * cost_prod[j] for j in Prod_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    Pmin += Cust_Dec3[i] <= pulp.lpSum( Prod_Dec3[j] for j in sel_prod )
#Instead of select k items, we want to reach at least n people
Pmin += pulp.lpSum( Cust_Dec3[i]*cust_weight[i] for i in Cust_Index) >= n

#Same constraints on meal choices
Pmin += pulp.lpSum( Prod_Dec3['FriedChicken'] + Prod_Dec3['BBQChicken']) <= 1
Pmin += pulp.lpSum( Prod_Dec3[m] for m in main) == 1
Pmin += pulp.lpSum( Prod_Dec3[s] for s in sides) >= 2
Pmin += pulp.lpSum( Prod_Dec3[d] for d in desserts) <= 3

#Now solve the model and print results
Pmin.solve()

reached = 0
for i in Cust_Index:
    reached += Cust_Dec3[i].varValue
print(reached) #100 reached on the nose

picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec3[j].varValue == 1:
        picked.append( (n+1,j,cost_prod[j]) )
        cost += cost_prod[j]
print(picked)
#[(9, 'Burger', 1.5), (13, 'TomatoSoup', 0.4), (15, 'Peas', 0.6)]
print(pulp.value(Pmin.objective)) #Total Cost 2.5
############################################################

So for this example our minimum budget buffet has some burgers, tomato soup, and peas. (Sounds good to me, I am not a picky eater!)

You can still incorporate all of the same other constraints I discussed before in this formulation. So here we need at a minimum to serve only 3 items to get the (over) 50% reach that we desire. If you wanted fairness type constraints, e.g. you want to reach 60% of females and 40% of males, you could do that as well. In that case you would just have two separate constraints for each group level you wanted to reach (which would also be applicable to the prior maximize reach formula, although you may need to keep upping the number of products selected before you identify a feasible solution).

In the end you could mash up these two formulas into one bi-objective function. You would need to define a term though to balance reach and cost. I’m wondering as well if there is a way to incorporate marginal benefits of sales into this as well, e.g. if you sell a Steak you may make a larger profit than if you sell a Pizza. But I am not 100% sure how to do that in this set up (even though I like all ice-cream, I won’t necessarily buy every flavor if I visit the shop). Similar for marketing adverts some forms may have better reach, but may have worse conversion rates.

Making smoothed scatterplots in python

The other day I made a blog post on my notes on making scatterplots in matplotlib. One big chunk of why you want to make scatterplots though is if you are interested in a predictive relationship. Typically you want to look at the conditional value of the Y variable based on the X variable. Here are some example exploratory data analysis plots to accomplish that task in python.

I have posted the code to follow along on github here, in particular smooth.py has the functions of interest, and below I have various examples (that are saved in the Examples_Conditional.py file).

Data Prep

First to get started, I am importing my libraries and loading up some of the data from my dissertation on crime in DC at street units. My functions are in the smooth set of code. Also I change the default matplotlib theme using smooth.change_theme(). Only difference from my prior posts is I don’t have gridlines by default here (they can be a bit busy).

#################################
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import os
import sys

mydir = r'D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\Smooth'
data_loc = r'https://dl.dropbox.com/s/79ma3ldoup1bkw6/DC_CrimeData.csv?dl=0'
os.chdir(mydir)

#My functions
sys.path.append(mydir)
import smooth
smooth.change_theme()

#Dissertation dataset, can read from dropbox
DC_crime = pd.read_csv(data_loc)
#################################

Binned Conditional Plots

The first set of examples, I bin the data and estimate the conditional means and standard deviations. So here in this example I estimate E[Y | X = 0], E[Y | X = 1], etc, where Y is the total number of part 1 crimes and x is the total number of alcohol licenses on the street unit (e.g. bars, liquor stores, or conv. stores that sell beer).

The function name is mean_spike, and you pass in at a minimum the dataframe, x variable, and y variable. I by default plot the spikes as +/- 2 standard deviations, but you can set it via the mult argument.

####################
#Example binning and making mean/std dev spike plots

smooth.mean_spike(DC_crime,'TotalLic','TotalCrime')

mean_lic = smooth.mean_spike(DC_crime,'TotalLic','TotalCrime',
                             plot=False,ret_data=True)
####################

This example works out because licenses are just whole numbers, so it can be binned. You can pass in any X variable that can be binned in the end. So you could pass in a string for the X variable. If you don’t like the resulting format of the plot though, you can just pass plot=False,ret_data=True for arguments, and you get the aggregated data that I use to build the plots in the end.

mean_lic = smooth.mean_spike(DC_crime,'TotalLic','TotalCrime',
                             plot=False,ret_data=True)

Another example I am frequently interested in is proportions and confidence intervals. Here it uses exact binomial confidence intervals at the 99% confidence level. Here I clip the burglary data to 0/1 values and then estimate proportions.

####################
#Example with proportion confidence interval spike plots

DC_crime['BurgClip'] = DC_crime['OffN3'].clip(0,1)
smooth.prop_spike(DC_crime,'TotalLic','BurgClip')

####################

A few things to note about this is I clip out bins with only 1 observation in them for both of these plots. I also do not have an argument to save the plot. This is because I typically only use these for exploratory data analysis, it is pretty rare I use these plots in a final presentation or paper.

I will need to update these in the future to jitter the data slightly to be able to superimpose the original data observations. The next plots are a bit easier to show that though.

Restricted Cubic Spline Plots

Binning like I did prior works out well when you have only a few bins of data. If you have continuous inputs though it is tougher. In that case, typically what I want to do is estimate a functional relationship in a regression equation, e.g. Y ~ f(x), where f(x) is pretty flexible to identify potential non-linear relationships.

Many analysts are taught the loess linear smoother for this. But I do not like loess very much, it is often both locally too wiggly and globally too smooth in my experience, and the weighting function has no really good default.

Another popular choice is to use generalized additive model smoothers. My experience with these (in R) is better than loess, but they IMO tend to be too aggressive, and identify overly complicated functions by default.

My favorite approach to this is actually then from Frank Harrell’s regression modeling strategies. Just pick a regular set of restricted cubic splines along your data. It is arbitrary where to set the knot locations for the splines, but my experience is they are very robust (so chaning the knot locations only tends to change the estimated function form by a tiny bit).

I have class notes on restricted cubic splines I think are a nice introduction. First, I am going to make the same dataset from my class notes, the US violent crime rate from 85 through 2010.

years = pd.Series(list(range(26)))
vcr = [1881.3,
       1995.2,
       2036.1,
       2217.6,
       2299.9,
       2383.6,
       2318.2,
       2163.7,
       2089.8,
       1860.9,
       1557.8,
       1344.2,
       1268.4,
       1167.4,
       1062.6,
        945.2,
        927.5,
        789.6,
        734.1,
        687.4,
        673.1,
        637.9,
        613.8,
        580.3,
        551.8,
        593.1]

yr_df = pd.DataFrame(zip(years,years+1985,vcr), columns=['y1','years','vcr'])

I have a function that allows you to append the spline basis to a dataframe. If you don’t pass in a data argument, in returns a dataframe of the basis functions.

#Can append rcs basis to dataframe
kn = [3.0,7.0,12.0,21.0]
smooth.rcs(years,knots=kn,stub='S',data=yr_df)

I also have in the code set Harrell’s suggested knot locations for the data. This ranges from 3 to 7 knots (it will through an error if you pass a number not in that range). This here suggests the locations [1.25, 8.75, 16.25, 23.75].

#If you want to use Harrell's rules to suggest knot locations
smooth.sug_knots(years,4)

Note if you have integer data here these rules don’t work out so well (can have redundant suggested knot locations). So Harell’s defaults don’t work with my alcohol license data. But it is one of the reasons I like these though, I just pick regular locations along the X data and they tend to work well. So here is a regression plot passing in those knot locations kn = [3.0,7.0,12.0,21.0] I defined a few paragraphs ago, and the plot does a few vertical guides to show the knot locations.

#RCS plot
smooth.plot_rcs(yr_df,'y1','vcr',knots=kn)

Note that the error bands in the plot are confidence intervals around the mean, not prediction intervals. One of the nice things though about this under the hood, I used statsmodels glm interface, so if you want you can change the underlying link function to Poisson (I am going back to my DC crime data here), you just pass it in the fam argument:

#Can pass in a family argument for logit/Poisson models
smooth.plot_rcs(DC_crime,'TotalLic','TotalCrime', knots=[3,7,10,15],
                fam=sm.families.Poisson(), marker_size=12)

This is a really great example for the utility of splines. I will show later, but a linear Poisson model for the alcohol license effect extrapolates very poorly and ends up being explosive. Here though the larger values the conditional effect fits right into the observed data. (And I swear I did not fiddle with the knot locations, there are just what I picked out offhand to spread them out on the X axis.)

And if you want to do a logistic regression:

smooth.plot_rcs(DC_crime,'TotalLic','BurgClip', knots=[3,7,10,15],
                fam=sm.families.Binomial(),marker_alpha=0)

I’m not sure how to do this in a way you can get prediction intervals (I know how to do it for Gaussian models, but not for the other glm families, prediction intervals probably don’t make sense for binomial data anyway). But one thing I could expand on in the future is to do quantile regression instead of glm models.

Smooth Plots by Group

Sometimes you want to do the smoothed regression plots with interactions per groups. I have two helper functions to do this. One is group_rcs_plot. Here I use the good old iris data to illustrate, which I will explain why in a second.

#Superimposing rcs on the same plot
iris = sns.load_dataset('iris')
smooth.group_rcs_plot(iris,'sepal_length','sepal_width',
               'species',colors=None,num_knots=3)

If you pass in the num_knots argument, the knot locations are different for each subgroup of data (which I like as a default). If you pass in the knots argument and the locations, they are the same though for each subgroup.

Note that the way I estimate the models here I estimate three different models on the subsetted data frame, I do not estimate a stacked model with group interactions. So the error bands will be a bit wider than estimating the stacked model.

Sometimes superimposing many different groups is tough to visualize. So then a good option is to make a set of small multiple plots. To help with this, I’ve made a function loc_error, to pipe into seaborn’s small multiple set up:

#Small multiple example
g = sns.FacetGrid(iris, col='species',col_wrap=2)
g.map_dataframe(smooth.loc_error, x='sepal_length', y='sepal_width', num_knots=3)
g.set_axis_labels("Sepal Length", "Sepal Width")

And here you can see that the not locations are different for each subset, and this plot by default includes the original observations.

Using the Formula Interface for Plots

Finally, I’ve been experimenting a bit with using the input in a formula interface, more similar to the way ggplot in R allows you to do this. So this is a new function, plot_form, and here is an example Poisson linear model:

smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ TotalLic',
                 fam=sm.families.Poisson(), marker_size=12)

You can see the explosive effect I talked about, which is common for Poisson/negative binomial models.

Here with the formula interface you can do other things, such as a polynomial regression:

#Can do polynomial terms
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ TotalLic + TotalLic**2 + TotalLic**3',
                 fam=sm.families.Poisson(), marker_size=12)

Which here ends up being almost indistinguishable from the linear terms. You can do other smoothers that are available in the patsy library as well, here are bsplines:

#Can do other smoothers
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ bs(TotalLic,df=4,degree=3)',
                 fam=sm.families.Poisson(), marker_size=12)

I don’t really have a good reason to prefer restricted cubic splines to bsplines, I am just more familiar with restricted cubic splines (and this plot does not illustrate the knot locations that were by default chosen, although you could pass in knot locations to the bs function).

You can also do other transformations of the x variable. So here if you take the square root of the total number of licenses helps with the explosive effect somewhat:

#Can do transforms of the X variable
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ np.sqrt(TotalLic)',
                 fam=sm.families.Poisson(), marker_size=12)
             

In the prior blog post about explosive Poisson models I also showed a broken stick type model if you wanted to log the x variable but it has zero values.

#Can do multiple transforms of the X variable
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ np.log(TotalLic.clip(1)) + I(TotalLic==0)',
                 fam=sm.families.Poisson(), marker_size=12)

Technically this “works” if you transform the Y variable as well, but the resulting plot is misleading, and the prediction interval is for the transformed variable. E.g. if you pass a formula 'np.log(TotalCrime+1) ~ TotalLic', you would need to exponentiate the the predictions and subtract 1 to get back to the original scale (and then the line won’t be the mean anymore, but the confidence intervals are OK).

I will need to see if I can figure out patsy and sympy to be able to do the inverse transformation to even do that. That type of transform to the y variable directly probably only makes sense for linear models, and then I would also maybe need to do a Duan type smearing estimate to get the mean effect right.

RTM Deep Learning Style

In my quest to better understand deep learning, I have attempted to replicate some basic models I am familiar with in criminology, just typical OLS and the more complicated group based trajectory models. Another example I will illustrate is doing a variant of Risk Terrain Modeling.

The typical way RTM is done is:

Data Prep Part:

  1. create a set of independent variables for crime generators (e.g. bars, subway stops, liquor stores, etc.) that are either the distance to the nearest or the kernel density estimate
  2. Turn these continuous estimates into dummy variables, e.g. if within 100 meters = 1, else = 0. For kernel density they typically z-score and if a z-score > 2 the dummy variable equals 1.
  3. Do 2 for varying distance/bandwidth selections, e.g. 100 meters, 200 meters, etc. So you end up with a collection of distance variables, e.g. Bars_100, Bars_200, Bars_400, etc.

Modeling Part

  1. Fit a Lasso regression predicting your crime outcome constraining all of the variables to be positive. (So RTM will never say a crime generator has a negative effect.)
  2. For the variables that passed this Lasso stage, then do a variable selection routine. So instead of the final model having Bars_100 and Bars_400, it will only choose one of those variables.

For the modeling part, we can replicate various parts of these in a deep learning environment. For the constrain the coefficients to be positive, when you see folks refer to a “RelU” or the rectified linear unit function, all this means is that the coefficients are constrained to be positive. For the variable selection part, I needed to hack my own – it ends up being a combo of a custom dropout scheme and then pruning in deep learning lingo.

Although RTM is typically done on raster grid cells for the spatial unit of analysis, this is not a requirement. You can do all these steps on vector (e.g. street segments) or other areal spatial units of analysis.

Here I illustrate using street units (intersections and street segments) from DC. The crime generator data I take from my dissertation (and I have a few pubs in Crime & Delinquency based on that work). The crime data I illustrate using 2011 violent Part 1 UCR (homicide, agg assault, robbery, no rape in the public data).

The crime dataset is over time, and I describe in an analysis (with Billy Zakrzewski) on examining pre/post crime around DC medical marijuana dispensaries.

The data and code to replicate can be downloaded here. It is python, and for the deep learning model I used pytorch.

RTM Example in Python

So I will walk through briefly my second script, 01_DeepLearningRTM.py. The first script, 00_DataPrep.py, does the data prep, so this data file already has the crime generator variables prepared in the manner RTM typically creates them. (The rtm_dl_funcs.py has the functions to do the feature extraction and create the deep learning model, to do distance/density in sci-kit is very slick, only a few lines of code.)

So first I just define the libraries I will be using, and import my custom rtm functions I created.

######################################################
import numpy as np
import pandas as pd
import torch
device = torch.device("cuda:0")
import os
import sys

my_dir = r'C:\Users\andre\OneDrive\Desktop\RTM_DeepLearning'
os.chdir(my_dir)
sys.path.append(my_dir)
import rtm_dl_funcs
######################################################

The next set of code grabs the crime data, and then defines my variable sets. I have plenty more crime generator data from my dissertation, but to make it easier on myself I just focus on distance to metro entrances, the density of 311 calls (a measure of disorder), and the distance and density of alcohol outlets (this includes bars/liquor stores/gas stations that sell beer, etc.).

Among these variable sets, the final selected model will only choose one within each set. But I have also included the ability for the model to incorporate other variables that will just enter in no-matter what (and are not constrained to be positive). This is mostly to incorporate an intercept into the regression equation, but here I also include the percent of sidewalk encompassing one of my street units (based on the Voronoi tessellation), and a dummy variable for whether the street unit is an intersection. (I also planned on included the area of the tessalation, but it ended up being an explosive effect, my dissertation shows its effect is highly non-linear, so didn’t want to worry about splines here for simplicity.)

######################################################
#Get the Prepped Data
crime_data = pd.read_csv('Prepped_Crime.csv')

#Variable sets for each
db = [50, 100, 200, 300, 400, 500, 600, 700, 800]
metro_set = ['met_dis_' + str(i) for i in db]
alc_set = ['alc_dis_' + str(i) for i in db]
alc_set += ['alc_kde_' + str(i) for i in db]
c311_set = ['c31_kde_' + str(i) for i in db]

#Creating a few other generic variables
crime_data['PercSidewalk'] = crime_data['SidewalkArea'] / crime_data['AreaMinWat']
crime_data['Const'] = 1
const_li = ['Const','Intersection','PercSidewalk']
full_set = const_li + alc_set + metro_set + c311_set
######################################################

The next set of code turns my data into a set of torch tensors, then I grab the size of my independent variable sets, which I will end up needing when initializing my pytorch model.

Then I set the seed (to be able to reproduce the results), create the model, and set the loss function and optimizer. I use a Poisson loss function (will need to figure out negative binomial another day).

######################################################
#Now creating the torch tensors
x_ten = torch.tensor(crime_data[full_set].to_numpy(), dtype=float)
y_ten = torch.tensor(crime_data['Viol_2011'].to_numpy(), dtype=float)
out_ten = torch.tensor(crime_data['Viol_2012'].to_numpy(), dtype=float)

#These I need to initialize the deep learning model
gen_lens = [len(alc_set), len(metro_set), len(c311_set)]
    
#Creating the model 
torch.manual_seed(10)

model = rtm_dl_funcs.RTM_torch(const=len(const_li), 
                               gen_list=gen_lens)
criterion = torch.nn.PoissonNLLLoss(log_input=True, reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #1e-4
print( model )
######################################################

If you look at the printed out model, it gives a nice summary of the different layers. We have our one layer for the fixed coefficients, and another three sets for our alcohol outlets, 311 calls, and metro entrances. We then have a final cancel layer. The idea behind the final cancel layer is that the variable selection routine in RTM can still end up not selecting any variables for a set. I ended up not using it here though, as it was too aggressive in this example. (So will need to tinker with that some more!)

The variable selection routine is very volatile – so if you have very correlated inputs, you can essentially swap one for the other and get near equivalent predictions. I often see folks who do RTM analyses say something along the lines of, “OK this RTM selected A, and this RTM selected B, so they are different effects in these two samples” (sometimes pre/post, other times comparing different areas, and other times different crime outcomes). I think this is probably wrong though to make that inference, as there is quite a bit of noise in the variable selection process (and the variable selection process itself precludes making inferences on the coefficients themselves).

My deep learning example inherited the same problems. So if you change the initialized weights, it may end up selecting totally different inputs in the end. To get the variable selection routine to at least select the same crime generator variables in my tests, I do a burn in period in which I implement a random dropout scheme. So instead of the typical dropout, for every forward pass it does a random dropout to only select one variable randomly out of each crime generator set. After that converges, I then use a pruning layer to only pick the coefficient that has the largest effect, and again do a large set of iterations to make sure the results converged. So different means but same ends to the typical RTM steps 4 and 5 above. I also have like I said a ReLU transformation after each layer, so the crime generator variables are always positive, any negative effects will be pruned out.

One thing that is nice about deep learning is that it can be quite fast. Here each of these 10,000 iteration sets take less than a minute on my desktop with a GPU. (I’ve been prototyping models with more parameters and more observations at work on my laptop with just a CPU that only take like 10 to 20 minutes).

######################################################
#Burn in part, random dropout
for t in range(10000):
    #Forward pass
    y_pred = model(x=x_ten)
    #Loss
    loss_insample = criterion(y_pred, y_ten)
    optimizer.zero_grad()
    loss_insample.backward(retain_graph=True)
    optimizer.step()
    if t % 1000 == 0:
        print(f'loss: {loss_insample.item()}' )

#Switching to pruning all but the largest effects
model.l1_prune()

for t in range(10000):
    #Forward pass
    y_pred = model(x=x_ten, mask_type=None, cancel=False)
    #Loss
    loss_insample = criterion(y_pred, y_ten)
    optimizer.zero_grad()
    loss_insample.backward(retain_graph=True)
    optimizer.step()
    if t % 1000 == 0:
        print(f'loss: {loss_insample.item()}' )

print( model.coef_df(nm_li=full_set, cancel=False) )
######################################################

And this prints out the results (as incident rate ratios), so you can see it selected 50 meters alcohol kernel density, 50 meters distance to the nearest metro station, and kernel density for 311 calls with an 800 meter bandwidth.

I have in the code another example model when using a different seed. So testing out on around 5 different seeds it always selected these same distance/density variables, but the coefficients are slightly different each time. Here is an example from setting the seed to 12.

These models are nothing to brag about, using the typical z-score the predictions and set the threshold to above 2, the PAI is only around 3 (both for in-sample 2011 and out of sample 2012 is slightly lower). It is a tough prediction task – the mean number of violent crimes per street unit per year is only 0.3. Violent crime is fortunately very rare!

But with only three different risk variables, we can do a quick conjunctive analysis, and look at the areas of overlap.

######################################################
#Adding model 1 predictions back into the dataset
pred_mod1 = pd.Series(model(x=x_ten, mask_type=None, cancel=False).exp().detach().numpy())
crime_data['Pred_M1'] = pred_mod1

#Check out the areas of overlapping risk
mod1_coef = model.coef_df(nm_li=full_set, cancel=False)
risk_vars = list(set(mod1_coef['Variable']) - set(const_li))
conj_set = crime_data.groupby(risk_vars, as_index=False)['Const','Pred_M1','Viol_2012'].sum()
print(conj_set)
######################################################

In this table Const is the total number of street units selected, Pred_M1 is the expected number of crimes via Model 1, and then I show how well it conforms to the predictions out of sample 2012. So you can see in the aggregate the predictions are not too far off. There only ends up being one street unit that overlaps for all three risk factors in the study area.

I believe the predictions would be better if I included more crime generator variables. But ultimately the nature of how RTM works it trades off accuracy for simple models. Which is fair – it helps to ease the nature of how a police department (or some other entity) responds to the predictions.

But this trade off results in predictions that don’t fare as well compared with more complicated models. For example I show (with Wouter Steenbeek) that random forests do much better than RTM. To make those models more interpretable we did local decompositions for hot spots, so say this hot spot is 30% alcohol outlets, 20% nearby apartments, etc.

So there is no doubt more extensions for RTM you could do in a deep learning framework, but they will likely always result in more complicated and less interpretable models. Also here I don’t think this code will be better than the traditional RTM folks, the only major benefit of this code is it will run faster – minutes instead of overnight for most jobs.

Creating high crime sub-tours

I was nerdsniped a bit by this paper, Targeting Knife-Enabled Homicides For Preventive Policing: A Stratified Resource Allocation Model by Vincent Hariman and Larry Sherman (HS from here on).

It in, HS attempt to define a touring schedule based on knife crime risk at the lower super output area in London. So here are the identified high risk areas:

And here are HS’s suggested hot spot tours schedule.

This is ad-hoc, but an admirable attempt to figure out a reasonable schedule. As you can see in their tables, the ‘high’ knife crime risk areas still only have a handful of homicides, so if reducing homicides is the objective, this program is a bit dead in the water (I’ve written about the lack of predictive ability of the model here).

I don’t think defining tours to visit everywhere makes sense, but I do think a somewhat smaller in scope question, how to figure out geographically informed tours for hot spot areas does. So instead of the single grid cell target ala PredPol, pick out multiple areas to visit for hot spots. (I don’t imagine the 41 LSOA areas are geographically contiguous either, e.g. it would make more sense to pick a tour for areas connected than for areas very far apart.)

Officers don’t tend to like single tiny areas either really, and I think it makes more sense to widen the scope a bit. So here is my attempt to figure those reasonable tours out.

Defining the Problem

The way I think about that problem is like this, look at the hypothetical diagram below. We have two choices for the hot spot location we are targeting, where the crime counts for locations are noted in the text label.

In the select the top hot spot (e.g. PredPol) approach, you would select the singlet grid cell in the top left, it is the highest intensity. We have another choice though, the more spread out hot spot in the lower right. Even though it is a lower density, it ends up capturing more crime overall.

I subsequently formulated an integer linear program to try to tackle the problem of finding good sub-tours through the graph that cumulatively capture more crime. So with the above graph, if I select two subtours, I get the results as (where nodes are identified by their (x,y) position):

  • ['Begin', (1, 4), 'End']
  • ['Begin', (4, 0), (4, 1), (3, 1), (3, 0), (2, 0), 'End']

So it can select singlet areas if they are islands (the (1,4) area in the top left), but will grow to wind through areas. Also note that the way I have programmed this network, it doesn’t skip the zero area (4,1) (it needs to go through at least one in the bottom right unless it doubles back on itself).

I will explain the meaning of the begin and end nodes below in my description of the linear program. It ends up being sort of a mash-up of traveling salesman type vehicle routing and min cost max flow type problems.

The Linear Program

The way I think about this problem formulation is like this: we have a directed graph, in which you can say, OK I start from location A, then can go to B, than go to C. In my set of decision variables, I have choices that look like this, where the first subscript denotes the from node, and the second subscript denotes the to node.

D_ab := node a -> node b
D_bc := node b -> node c

etc. In our subsequent linear program, the destination node is the node that we calculate our cumulative crime density statistics. So if node B had 10 crimes and 0.1 square kilometers, we would have a density of 100 crimes per square kilometer.

Now to make this formulation work, we need to add in a set of special nodes into our usual location network. These nodes I call ‘Begin’ and ‘End’ nodes (you may also call them source/sink nodes though). The begin nodes all look like this:

D_{begin},a
D_{begin},b
D_{begin},c

So you do that for every node in your network. Then you have End nodes as well, e.g.

D_a,{end}
D_b,{end}
D_c,{end}

In this formulation, since we are only concerned about the crime stats for the to node, not the from node, the Begin nodes just inherit the crime density stats from the original node data. For the end nodes though, you just set their objective value stats to zero (they are only relevant to define the constraints).

Now here is my linear program formulation:

Maximize 
  Sum [ D_ij ( CrimeDensity_j - DensityPenalty_j ) ]

Subject To:

 1. Sum( D_in for each neighbor of n ) <= 1, 
      for each original node n
 2. Sum( D_in for each neighbor of n ) =  Sum( D_ni for each neighbor of n ), 
      for each original node n
 3. Sum( D_bi for each begin node ) = k routes
 4. Sum( D_ie for each end node ) = k routes
 5. Sum( D_ij + D_ji ) <= 1, for each unique i,j pair
 6. D_ij is an element of {0,1}

Constraint 1 is a flow constraint. If a node has an incoming edge set to one, it cannot have any other incoming edge set to one (so a location can only be chosen once).

Constraint 2 is a constraint that says if an incoming node is selected, one of the outgoing edges needs to be selected.

Constraints 3 & 4 determine the number of k tours/routes to choose in the end. Since the begin/end nodes are special we have k routes going out of the begin nodes, and k routes going into the end nodes.

With just these constraints, you can still get micro-cycles I found. So something like, X -> Z -> X. Constraint 5 (for only the undirected edges) prevents this from happening.

Constraint 6 is just setting the decision variables to binary 0/1. So it is a mixed integer linear program.

The final thing to note is the objective function, I have CrimeDensity_j - DensityPenalty_j, so what exactly is DensityPenalty? This is a value that penalizes visiting areas that are below this threshold. Basically the way that this works is that, the density penalty sets an approximate threshold for the minimum density a tour should contain.

I suggest a default of a predictive accuracy index of 10. Where do I get 10 you ask? Weisburd’s law of crime concentration suggests 5% of the areas should contain 50% of the crime, which is a PAI of 0.5/0.05 = 10. In my example with DC data then I just calculate the actual density of crime per unit area that corresponds to a PAI of 10.

You can adjust this though, if you prefer smaller tours of higher crime density you would up the value. If you prefer longer tours decrease it.

This is the best way I could figure out how to trade off the idea of spreading out the targeted hot spot vs selecting the best areas. If you spread out you will ultimately have a lower density. This turns it into a soft objective penalty to try to keep the selected tours at a particular density threshold (and will scoop up better tours if they are available). For awhile I tried to figure out if I could maximize the PAI metric, but it is the case with larger areas the PAI will always go down, so you need to define the objective some other way.

This formulation I only consider linked nodes (unlike the usual traveling salesman in which it is a completely linked distance graph). That makes it much more manageable. In this formulation, if you have N as the number of nodes/areas, and E is the number of directed edges between those areas, we will then have:

  • 2*N + E decision variables
  • 2 + 2*N + E/2 constraints

Generally if you are doing directly connected areas in geographic networks (i.e. contiguity connections), you will have less than 8 (typically more like an average of 6) neighbors per each area. So in the case of the 4k London lower super output areas, if I chose tours I would guess it would end up being fewer than 2*4,000 + 8*4,000 = 40,000 decision variables, and then fewer than that constraints.

Since that is puny (and I would suggest doing this at a smaller geographic resolution) I tested it out on a harder network. I used the data from my dissertation, a network of 21,506 street units (both street segments and intersections) in Washington, D.C. The contiguity I use for these micro units is based on the Voronoi tessellation, so tends to have more neighbors than you would with a strictly road based network connectivity. Still in the end it ends up being a shade fewer than 200k decision variables and 110k constraints. So is a better test for in the wild whether the problem can be feasibly solved I think.

Example with DC Data

Here I have posted the python code and data used for this analysis, I end up having a nice function that you just submit your network with the appropriate attributes and out pops the different tours.

So I end up doing examples of 4 and 8 subtours based on 2011 violent UCR crime data (agg assaults, robberies, and homicides, no rapes in the public data). I use for the penalty that PAI = 10 threshold, so it should limit tours to approximately that value. It only ends up taking 2 minutes for the model to converge for the 4 tours and less than 2.5 minutes for the 8 tours on my desktop. So it should be not a big problem to up the decision variables to more sub-areas and still be solvable in real life applications.

The area estimates are in square meters, hence the high numbers. But on the right you can see that each sub-tour has a PAI above 10.

Here is an interactive map for you to zoom into each 4 subtour example. Below is a screenshot of one of the subtours. You can see that since I have defined my connected areas in terms of Voronoi tessalations, they don’t exactly follow the street network.

For the 8 tour example, it ends up returning several zero tours, so it is not possible in this data to generate 8 sub-tours that meet that PAI >= 10 threshold.

You can see that it ends up being the tours have higher PAI values, but lower overall crime counts.

You may think, why does it not pick at least singlet areas with at least one crime? It ends up being that I weight areas here by their area (this formulation would be better with grid cells of equal area, so my objective function is technically Sum [ D_ij * w_j *( CrimeDensity_j - DensityPenalty_j ) ], where w_j is the percent of the total area (so the denominator in the PAI calculation). So it ends up picking areas that are the tiniest areas, as they result in the smallest penalty to the objective function (w_j is tiny). I think this is OK though in the end – I rather know that some of the tours are worthless.

You can also see I get one subtour that is just under the PAI 10 threshold. Again possible here, but should be only slightly below in the worst case scenario. The way the objective function works is that it is pretty tricky to pick out subtours below that PAI value but still have a positive contribution to the overall objective function.

Future Directions

The main thing I wish I could do with the current algorithm (but can’t the way the linear program is set up), is to have minimum and maximum tour area/length constraints. I think I can maybe do this by adapting this code (I’m not sure how to do the penalties/objectives though). So if others have ideas let me know!

I admit that this may be overkill, and maybe just doing more typical crime clustering algorithms may be sufficient. E.g. doing DBSCAN hot spots like I did here.

But this is my best attempt shake at the problem for now!