A latent variable approach to RTM using hidden layers in deep learning

Sorry about the long title! Previously I have blogged about how to use Deep Learning to generate an RTM like model variable selection and positive constraints. Deep learning frameworks often do not rely on variable selection like that though, they more often leverage hidden layers. For social scientists familiar with structural equation modelling, these hidden layers are very much akin to formative latent variables. (More traditionally folks use reflective latent variables in factor analysis, so the latent variable causes the observed measures. This is the obverse, the observed measures cause/define the latent variable, and we find the loadings that best predict some outcome further down the stream.)

In a nutshell, instead of the typical RTM way of picking the best variable to use, e.g. Alcohol Density < 100 meters OR Alcohol Density < 500 meters, it allows both to contribute to a latent variable, call it AlcoholDens, but allows those weights to vary. Then I see how well the AlcoholDens latent variable predicts crime. I will show later in the results that the loadings are often spread out among different density/distance measures in this sample, suggesting the approach just pick one is perhaps misguided.

I’ve posted the data and code to follow along here. There are two py files, 00_RTMHidden.py runs the main analysis, but dl_rtm_funcs.py has various functions used to build the deep learning model in pytorch. I am just going to hit some of the highlights instead of walking through bit by bit.

Some helper functions

First, last blog post I simply relied on using Poisson loss. This time, I took some effort to figure out my own loss function for the negative binomial model. Here I am using the NB2 form, and you can see I took the likelihood function from the Stata docs (they are a really great reference for various regression model info). To incorporate this into your deep learning model, you need to add a single parameter in your model, here I call it disp.

#Log likelihood taken from Stata docs, pg 11 
#https://www.stata.com/manuals13/rnbreg.pdf
def nb2_loss(actual, log_pred, disp):
    m = 1/disp.exp()
    mu = log_pred.exp()
    p = 1/(1 + disp.exp()*mu)
    nll = torch.lgamma(m + actual) - torch.lgamma(actual+1) - torch.lgamma(m)
    nll += m*torch.log(p) + actual*torch.log(1-p)
    return -nll.mean()

A second set of helper functions I will illustrate at the end of the post is evaluating the fit for Poisson/Negative Binomial models. I’ve discussed these metrics before, they are just a python rewrite of older SPSS code I made.

def pred_nb(mu, disp, int_y):
    inv_disp = 1/disp
    p1 = gamma(int_y + inv_disp) / ( factorial(int_y)*gamma(inv_disp) )
    p2 = ( inv_disp / (inv_disp + mu) ) ** inv_disp
    p3 = ( mu / (inv_disp + mu) ) ** int_y
    pfin = p1*p2*p3
    return pfin
    
def nb_fit(mu, obs, disp, max_y):
    res = []
    cum_fit = mu - mu
    for i in range(max_y+1):
        pred_fit = pred_nb(mu=mu, disp=disp, int_y=i)
        pred_obs = (obs == i)
        res.append( (str(i), pred_obs.mean(), pred_fit.mean(), pred_obs.sum(), pred_fit.sum()) )
        cum_fit += pred_fit
    fin_fit = 1 - cum_fit
    fin_obs = (obs > max_y)
    res.append( (str(max_y+1)+'+', fin_obs.mean(), fin_fit.mean(),
                  fin_obs.sum(), fin_fit.sum()) )
    dat = pd.DataFrame(res, columns=['Int','Obs','Pred','ObsN','PredN'])
    return dat

Main Analysis

Now onto the main analysis. Skipping the data loading (it is near copy-paste from my prior RTM Deep Learning post), here are the main guts to building and fitting the RTM model.

model = dl_rtm_funcs.RTM_hidden(gen_list=[alc_set,metro_set,c311_set], 
                                gen_names=['AlcOutlets','MetroEntr','Dens311'])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #1e-4

for t in range(5001):
    #Forward pass
    y_pred = model(comb_ten)
    #Loss 
    loss_insample = dl_rtm_funcs.nb2_loss(y_ten, y_pred, model.dispersion)
    optimizer.zero_grad()
    loss_insample.backward() #retain_graph=True
    optimizer.step()
    if t % 100 == 0:
        loss_out = dl_rtm_funcs.nb2_loss(out_ten, y_pred, model.dispersion)
        print(f'iter {t}: loss in = {loss_insample.item():.5f}, loss out = {loss_out.item():.5f}')

And in terms of iterations, on my machine this takes less than 20 seconds to do the 5000 iterations, and it has clearly peaked out by then (both in sample 2011 and out of sample 2012).

I’ve loading the RTM model object with a few helper functions, so if you then run print( model.coef_table() ), you get out the final regression coefficients, including the dispersion term. For my negative binomial models for my dissertation, the dispersion term tended to be around ~4 for many models, so this corresponds pretty closely with my prior work.

These have interpretations as latent variables representing the effect of nearby alcohol outlets (both distance and density), metro entrances (just distance), and 311 calls for service (just density). Similar to original RTM, I have restricted the crime generator effects to be positive.

I also have another helper function, model.loadings(), that gives you a nice table. Here this shows how the original variables contribute to the latent variable. So here are the loadings for the distance to the nearest metro.

You can see that the dummy variables for met_dis_300 (meters) and smaller all contribute to the latent variable. So instead of picking one variable in the end, it allows multiple variables to contribute to the latent risk score. It may make more sense in this set up to encode variables as not cumulative, e.g. < 50 meters, < 100 meters, but orthogonal, e.g. [0,50),[50,100), etc.), but just stuck with the prior data in the same format for now. I force the loadings to sum to 1 and be positive, so the latent variables still have a very apples-to-apples comparison in terms of effect sizes.

Here are the loadings for alcohol outlets, so we have both some distance and density effects in the end.

And here are the loadings for 311 density variables:

So you can see for the last one, only the furthest away had an effect at all. Which is contra to the broken windows theory! But also shows that this is more general than the original RTM approach. If it only should be one variable the model will learn that, but if it should be more it will incorporate a wider array of weights.

Next is to check out how well the model does overall. For calibration for Poisson/Negative Binomial models, I just detach my pytorch tensors, and feed them into my functions to do the evaluations.

#Calibration for Negative Binomial predictions
pred_pd = pd.Series( y_pred.exp().detach().numpy() )
disp_val = model.dispersion.exp().item()

nb_fit = dl_rtm_funcs.nb_fit(mu=pred_pd, obs=crime_data['Viol_2011'], 
                             disp=disp_val, max_y=10)
print( nb_fit )

So this shows that the model is pretty well calibrated in terms of overall predictions. Both samples predict 83% zeroes. I predict a few more 3/4 crime areas than observed, and my tails are somewhat thinner than they should be, but only by a tiny bit. (No doubt this would improve if I incorporated more covariates, kept it simple to debug on purpose.)

We can ignore the negative binomial dispersion term and see what our model would predict in the usual Poisson case (the mean functions are the same, it is just changing the variance). To do this, just pass in a dispersion term of 1.

pois_fit = dl_rtm_funcs.nb_fit(mu=pred_pd, obs=crime_data['Viol_2011'], 
                               disp=1, max_y=10)
print( pois_fit )

You can see that the Poisson model is a much worse fit. Underpredicting zero crime areas by 6%, and areas with over 10 crimes should pretty much never happen according to the Poisson model.

We should be assessing these metrics out of sample as well, and you can see that given crime is very historically stable, the out of sample 2012 violent crime counts are similarly well calibrated.

Finally, I have suggested in the past to use a weighted ROC curve as a metric for crime counts. Here is a simple example of doing that in python.

crime_data['Weights'] = crime_data['Viol_2012'].clip(1)
crime_data['Outcome'] = crime_data['Viol_2012'].clip(0,1)

fpr, tpr, thresh = roc_curve(crime_data['Outcome'], pred_pd, sample_weight=crime_data['Weights'])
weighted_auc = auc(fpr, tpr)
print( weighted_auc ) 

So you can see the AUC is nothing to brag about here, 0.61 (it is only 0.63 in the 2011 sample). But again I am sure I could get that up by quite a bit by incorporating more covariates into the model.

RTM Deep Learning Style

In my quest to better understand deep learning, I have attempted to replicate some basic models I am familiar with in criminology, just typical OLS and the more complicated group based trajectory models. Another example I will illustrate is doing a variant of Risk Terrain Modeling.

The typical way RTM is done is:

Data Prep Part:

  1. create a set of independent variables for crime generators (e.g. bars, subway stops, liquor stores, etc.) that are either the distance to the nearest or the kernel density estimate
  2. Turn these continuous estimates into dummy variables, e.g. if within 100 meters = 1, else = 0. For kernel density they typically z-score and if a z-score > 2 the dummy variable equals 1.
  3. Do 2 for varying distance/bandwidth selections, e.g. 100 meters, 200 meters, etc. So you end up with a collection of distance variables, e.g. Bars_100, Bars_200, Bars_400, etc.

Modeling Part

  1. Fit a Lasso regression predicting your crime outcome constraining all of the variables to be positive. (So RTM will never say a crime generator has a negative effect.)
  2. For the variables that passed this Lasso stage, then do a variable selection routine. So instead of the final model having Bars_100 and Bars_400, it will only choose one of those variables.

For the modeling part, we can replicate various parts of these in a deep learning environment. For the constrain the coefficients to be positive, when you see folks refer to a “RelU” or the rectified linear unit function, all this means is that the coefficients are constrained to be positive. For the variable selection part, I needed to hack my own – it ends up being a combo of a custom dropout scheme and then pruning in deep learning lingo.

Although RTM is typically done on raster grid cells for the spatial unit of analysis, this is not a requirement. You can do all these steps on vector (e.g. street segments) or other areal spatial units of analysis.

Here I illustrate using street units (intersections and street segments) from DC. The crime generator data I take from my dissertation (and I have a few pubs in Crime & Delinquency based on that work). The crime data I illustrate using 2011 violent Part 1 UCR (homicide, agg assault, robbery, no rape in the public data).

The crime dataset is over time, and I describe in an analysis (with Billy Zakrzewski) on examining pre/post crime around DC medical marijuana dispensaries.

The data and code to replicate can be downloaded here. It is python, and for the deep learning model I used pytorch.

RTM Example in Python

So I will walk through briefly my second script, 01_DeepLearningRTM.py. The first script, 00_DataPrep.py, does the data prep, so this data file already has the crime generator variables prepared in the manner RTM typically creates them. (The rtm_dl_funcs.py has the functions to do the feature extraction and create the deep learning model, to do distance/density in sci-kit is very slick, only a few lines of code.)

So first I just define the libraries I will be using, and import my custom rtm functions I created.

######################################################
import numpy as np
import pandas as pd
import torch
device = torch.device("cuda:0")
import os
import sys

my_dir = r'C:\Users\andre\OneDrive\Desktop\RTM_DeepLearning'
os.chdir(my_dir)
sys.path.append(my_dir)
import rtm_dl_funcs
######################################################

The next set of code grabs the crime data, and then defines my variable sets. I have plenty more crime generator data from my dissertation, but to make it easier on myself I just focus on distance to metro entrances, the density of 311 calls (a measure of disorder), and the distance and density of alcohol outlets (this includes bars/liquor stores/gas stations that sell beer, etc.).

Among these variable sets, the final selected model will only choose one within each set. But I have also included the ability for the model to incorporate other variables that will just enter in no-matter what (and are not constrained to be positive). This is mostly to incorporate an intercept into the regression equation, but here I also include the percent of sidewalk encompassing one of my street units (based on the Voronoi tessellation), and a dummy variable for whether the street unit is an intersection. (I also planned on included the area of the tessalation, but it ended up being an explosive effect, my dissertation shows its effect is highly non-linear, so didn’t want to worry about splines here for simplicity.)

######################################################
#Get the Prepped Data
crime_data = pd.read_csv('Prepped_Crime.csv')

#Variable sets for each
db = [50, 100, 200, 300, 400, 500, 600, 700, 800]
metro_set = ['met_dis_' + str(i) for i in db]
alc_set = ['alc_dis_' + str(i) for i in db]
alc_set += ['alc_kde_' + str(i) for i in db]
c311_set = ['c31_kde_' + str(i) for i in db]

#Creating a few other generic variables
crime_data['PercSidewalk'] = crime_data['SidewalkArea'] / crime_data['AreaMinWat']
crime_data['Const'] = 1
const_li = ['Const','Intersection','PercSidewalk']
full_set = const_li + alc_set + metro_set + c311_set
######################################################

The next set of code turns my data into a set of torch tensors, then I grab the size of my independent variable sets, which I will end up needing when initializing my pytorch model.

Then I set the seed (to be able to reproduce the results), create the model, and set the loss function and optimizer. I use a Poisson loss function (will need to figure out negative binomial another day).

######################################################
#Now creating the torch tensors
x_ten = torch.tensor(crime_data[full_set].to_numpy(), dtype=float)
y_ten = torch.tensor(crime_data['Viol_2011'].to_numpy(), dtype=float)
out_ten = torch.tensor(crime_data['Viol_2012'].to_numpy(), dtype=float)

#These I need to initialize the deep learning model
gen_lens = [len(alc_set), len(metro_set), len(c311_set)]
    
#Creating the model 
torch.manual_seed(10)

model = rtm_dl_funcs.RTM_torch(const=len(const_li), 
                               gen_list=gen_lens)
criterion = torch.nn.PoissonNLLLoss(log_input=True, reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #1e-4
print( model )
######################################################

If you look at the printed out model, it gives a nice summary of the different layers. We have our one layer for the fixed coefficients, and another three sets for our alcohol outlets, 311 calls, and metro entrances. We then have a final cancel layer. The idea behind the final cancel layer is that the variable selection routine in RTM can still end up not selecting any variables for a set. I ended up not using it here though, as it was too aggressive in this example. (So will need to tinker with that some more!)

The variable selection routine is very volatile – so if you have very correlated inputs, you can essentially swap one for the other and get near equivalent predictions. I often see folks who do RTM analyses say something along the lines of, “OK this RTM selected A, and this RTM selected B, so they are different effects in these two samples” (sometimes pre/post, other times comparing different areas, and other times different crime outcomes). I think this is probably wrong though to make that inference, as there is quite a bit of noise in the variable selection process (and the variable selection process itself precludes making inferences on the coefficients themselves).

My deep learning example inherited the same problems. So if you change the initialized weights, it may end up selecting totally different inputs in the end. To get the variable selection routine to at least select the same crime generator variables in my tests, I do a burn in period in which I implement a random dropout scheme. So instead of the typical dropout, for every forward pass it does a random dropout to only select one variable randomly out of each crime generator set. After that converges, I then use a pruning layer to only pick the coefficient that has the largest effect, and again do a large set of iterations to make sure the results converged. So different means but same ends to the typical RTM steps 4 and 5 above. I also have like I said a ReLU transformation after each layer, so the crime generator variables are always positive, any negative effects will be pruned out.

One thing that is nice about deep learning is that it can be quite fast. Here each of these 10,000 iteration sets take less than a minute on my desktop with a GPU. (I’ve been prototyping models with more parameters and more observations at work on my laptop with just a CPU that only take like 10 to 20 minutes).

######################################################
#Burn in part, random dropout
for t in range(10000):
    #Forward pass
    y_pred = model(x=x_ten)
    #Loss
    loss_insample = criterion(y_pred, y_ten)
    optimizer.zero_grad()
    loss_insample.backward(retain_graph=True)
    optimizer.step()
    if t % 1000 == 0:
        print(f'loss: {loss_insample.item()}' )

#Switching to pruning all but the largest effects
model.l1_prune()

for t in range(10000):
    #Forward pass
    y_pred = model(x=x_ten, mask_type=None, cancel=False)
    #Loss
    loss_insample = criterion(y_pred, y_ten)
    optimizer.zero_grad()
    loss_insample.backward(retain_graph=True)
    optimizer.step()
    if t % 1000 == 0:
        print(f'loss: {loss_insample.item()}' )

print( model.coef_df(nm_li=full_set, cancel=False) )
######################################################

And this prints out the results (as incident rate ratios), so you can see it selected 50 meters alcohol kernel density, 50 meters distance to the nearest metro station, and kernel density for 311 calls with an 800 meter bandwidth.

I have in the code another example model when using a different seed. So testing out on around 5 different seeds it always selected these same distance/density variables, but the coefficients are slightly different each time. Here is an example from setting the seed to 12.

These models are nothing to brag about, using the typical z-score the predictions and set the threshold to above 2, the PAI is only around 3 (both for in-sample 2011 and out of sample 2012 is slightly lower). It is a tough prediction task – the mean number of violent crimes per street unit per year is only 0.3. Violent crime is fortunately very rare!

But with only three different risk variables, we can do a quick conjunctive analysis, and look at the areas of overlap.

######################################################
#Adding model 1 predictions back into the dataset
pred_mod1 = pd.Series(model(x=x_ten, mask_type=None, cancel=False).exp().detach().numpy())
crime_data['Pred_M1'] = pred_mod1

#Check out the areas of overlapping risk
mod1_coef = model.coef_df(nm_li=full_set, cancel=False)
risk_vars = list(set(mod1_coef['Variable']) - set(const_li))
conj_set = crime_data.groupby(risk_vars, as_index=False)['Const','Pred_M1','Viol_2012'].sum()
print(conj_set)
######################################################

In this table Const is the total number of street units selected, Pred_M1 is the expected number of crimes via Model 1, and then I show how well it conforms to the predictions out of sample 2012. So you can see in the aggregate the predictions are not too far off. There only ends up being one street unit that overlaps for all three risk factors in the study area.

I believe the predictions would be better if I included more crime generator variables. But ultimately the nature of how RTM works it trades off accuracy for simple models. Which is fair – it helps to ease the nature of how a police department (or some other entity) responds to the predictions.

But this trade off results in predictions that don’t fare as well compared with more complicated models. For example I show (with Wouter Steenbeek) that random forests do much better than RTM. To make those models more interpretable we did local decompositions for hot spots, so say this hot spot is 30% alcohol outlets, 20% nearby apartments, etc.

So there is no doubt more extensions for RTM you could do in a deep learning framework, but they will likely always result in more complicated and less interpretable models. Also here I don’t think this code will be better than the traditional RTM folks, the only major benefit of this code is it will run faster – minutes instead of overnight for most jobs.