Open Source Criminology Related Network Datasets

So I am a big proponent of open source data analysis. There is a problem with using criminal justice data sources though – they often have private information that prevents us from sharing the data. For example, I have posted quite a few of my projects here (mostly spatial data analysis), but there are a few I cannot share. For example, I worked on a paper with chronic offender predictions, and I cannot share that data (Wheeler et al., 2019). The outcome, being a victim or perpetrator of gun violence, is so rare that by itself basically makes it impossible to publicly share the data without exposing the individuals under study.

One good resource all criminologists should be aware of is ICPSR, in particular NACJD. Many datasets on there though anymore are restricted, in that you need to get IRB permission and ICPSR permision to download the dataset to use. (Which typically takes like 2~3 months in my experience doing it a few times, which includes both your local Uni IRB and the ICPSR process.) For example here is one I went through the motions to get to (in the end) validate different survival prediction methods.

ICPSR is a great resource to be able to handle sharing potentially sensitive data. But this falls short in two areas. One is in teaching – you cannot go through the IRB ritual in a timely enough fashion to be able to use those datasets in a course environment. The other is in terms of methods, so for example if you wanted to say your model provides better predictions than some other model, they should be established on the same datasets. Current state of affairs in criminology in this regard is pretty bad to be curt – most everybody uses their own data they have access to. So much of the research on different risk assessment instruments for bail/probation/parole are pretty much impossible to say one is better than another.

One example type of data source that is almost entirely missing from NACJD (that I am aware of) is social network datasets relevant for criminology/criminal justice. So I have started a spreadsheet to collate different open source network datasets relevant for criminologists. So I have some from my work and a few other random examples I have come across on the internet.


I have made that spreadsheet open, so anyone should be able to edit in more sources. (Feel free to include links to ICPSR as well, but if you do edit a note to say whether it is restricted access or not.) For here I would be interested in really large networks, for example would love to try to replicate Marie’s work on gang network transitions (Oullet et al., 2019a).

And also while I am here, Jacob Young has created a very nice introductory course to social network analysis. I have a brief lecture in my advanced research design class, but Jacob’s is much more thorough (and he is more of an expert in this area than I am for sure).

I will add to that spreadsheet over time as well. I have made a separate sheet for survival analysis datasets. I would be particularly keen for example criminal justice examples. So for network analysis we have examples of looking at use-of-force networks (Oullet et al., 2019b), and for survival analysis I would be interested in a time to solve example dataset. Unfortunately for the solved cases, NIBRS is a good resource but has a large confound in they don’t measure whether a case was ever assigned to a detective.

Feel free to add whatever in that spreadsheet, but what I was thinking was oriented towards different methods (again as a main motivation is for teaching). So for example if you knew of datasets for age-period-cohort modelling, or for estimating group-based-trajectory models, I think those would be good examples to start new sheets and collate different data sources.


  • Ouellet, M., Bouchard, M., & Charette, Y. (2019a). One gang dies, another gains? The network dynamics of criminal group persistence. Criminology, 57(1), 5-33.
  • Ouellet, M., Hashimi, S., Gravel, J., & Papachristos, A. V. (2019b). Network exposure and excessive use of force: Investigating the social transmission of police misconduct. Criminology & Public Policy, 18(3), 675-704.
  • Wheeler, A. P., Worden, R. E., & Silver, J. R. (2019). The accuracy of the violent offender identification directive tool to predict future gun violence. Criminal Justice and Behavior, 46(5), 770-788.

Using the Google Vision and Streetview API to Explore Hotspots

So previously I have shown how to automate the process of downloading google street view imagery (for individual addresses & running down a street). One interesting application is to then code those streetview images. There are many applications in criminology of coding these images for disorder. So Rob Sampson initially had the idea of ecometrics, in which he used systematic social observations via taking a video going down various streets to code physical disorder, such as garbage on the street (Raudenbush & Sampson, 1999). Others than leveraged Google streetview imagery to do those same audits instead of collecting their own footage (Bader et al., 2017).

Those are all someone looks at the images and a human says, there is XYZ in this photo and ABC in this photo. I was interested in testing out the Google Vision API to automate identifying parts of the images. So instead of a human manually reviewing, you build a score automatically. See for example work on identifying the percieved safety of streets (Naik et al., 2014).

Here I was motivated by some recent work of a colleague, Nate Connealy, in which he used this imagery to identify the differences in hot spots vs not hot spots (Connealy, 2020). Also I am pretty sure I saw George Mohler present on this at some ASC before I had the idea (it was similar to this paper, Khorshidi et al., 2019, not 100% sure it was the same one though). For an overview of crim applications using streetview and google maps, which also span CPTED type analyses, check out Vandeviver (2014).

So with Google’s automated vision API, if I submit this photo of a parking garage (this is actually the image I get if I submit the address Bad Address, Dallas, TX to the streetview API, so take in mind errors like that in my subsequent analysis).

You get back these labels, where the first item is the description and the second is the ‘score’ for whether the item is in the image:

('Architecture', 0.817379355430603),
('Floor', 0.7577666640281677),
('Room', 0.7444316148757935),
('Building', 0.7440816164016724),
('Parking', 0.7051371335983276),
('Ceiling', 0.6624311208724976),
('Flooring', 0.6004095673561096),
('Wood', 0.5958532094955444),
('House', 0.5928719639778137),
('Metal', 0.5114516019821167)

So I don’t tell Google what to look for, it just gives me back a ton of different labels depending on what it detects in the image. So what I do here is based on my hotspot work (Wheeler & Reuter, 2020), I grab a sample of 300 addresses inside my Dallas based hot spot areas, and 300 addresses outside of hot spots. (These addresses are based on crime data themselves, so similar to Nate’s work I only sample locations that at least have 1 crime).

So this isn’t a way to do predictions, but I think it is potentially interesting application of exploratory data analysis for hot spots or high crime areas.

Python Code Snippet

I am just going to paste the python code-snippet in its entirety.

Grabbing streetview images and detecting
labels using the google vision API

from import vision
import pandas as pd
import io
import os
import urllib
import time


add_dat = pd.read_csv('Sampled_Adds.csv')
add_dat['FullAdd'] = add_dat['IncidentAddress'] + ", DALLAS, TX"

# Code to download image based on address 

myloc = r"./Images" #replace with your own location
key = "&key=????YourKeyHere????" 

def GetStreet(Add,SaveLoc,Name):
  base = ""
  MyUrl = base + urllib.parse.quote_plus(Add) + key #added url encoding
  fi = Name + ".jpg"
  loc_tosav = os.path.join(SaveLoc,fi)
  urllib.request.urlretrieve(MyUrl, loc_tosav)

# Code to get the google vision API labels
# for the image

client = vision.ImageAnnotatorClient.from_service_account_json('Geo Dallas-b5543ff0bb6d.json')

def LabelImage(ImageLoc):
    # Loads the image into memory
    with, 'rb') as image_file:
        content =
    image = vision.types.Image(content=content)
    response = client.label_detection(image=image)
    labels = response.label_annotations
    res = []
    if response.error.message:
        print(f'Error for image {ImageLoc}')
        print(f'Error Message {response.error.message}')
        res.append( ('Error', 1.0 ) )
        res = []
        for l in labels:
            res.append( (l.description , l.score) )
    return res

#A random parking garage!
GetStreet('Bad Address, Dallas, TX',myloc,'Bad_Address')    
long_tup = []
for index, row in add_dat.iterrows():
    #Name of the image
    nm = str(index) + "_" + str(row['Inside'])
    #Download the image    
    #Get the labels
    labs = LabelImage(os.path.join(myloc,nm + '.jpg'))
    #Build the new data tuples
    for l in labs:
        long_dat = (index, nm +'.jpg', row['Inside'], row['FullAdd'], l[0], l[1])
    #Sleep for a second to not spam the servers
    print(f'Done with index {index}')

long_dat = pd.DataFrame(long_tup, 

To get this to work you need a few things. First, you need to enable both the Vision API and the Streetview API in your Google API console. The streetview API has a key you can get directly from the API console (as described in my prior posts). But the vision API is different, and you can download a json file with all the necessary info and feed it into the client call. Once that is all done, you have it set up to query both API’s to get the images and then get the labels. But this is quick and dirty, it does not check for errors in either.

Here is a screenshot of some of the images downloaded, you can see that the streetview API doesn’t fail when their is no image available, it just does a mostly blank gray screenshot.

Analyzing the Results

I am not above just piping the results into an Excel document and doing some quick pivot tables. (I like doing that when there are many categories I want to explore quickly.) So here is a pivot table of the sum of the scores across the 300 outside hotspot (column 0) and 300 inside (column 1) images. So you can see the label of property is in more than half of the images for each (since the score value is never above 1). But property is more common outside hot spots than it is inside hot spots.

Here are contrast coded sums, so these identify the different labels that are more common in either hotspots or outside of hotspots. So outside of hotspots trees and plants appear more common (see Kondo et al., 2017 and Kondo’s other work on the topic). Inside hotspots we have more cars & asphault for examples.

This is just a quick and dirty analysis though. I do not take into account here missing images. The Screenshot label shows missing images are more common inside hotspots. And here since I use the addresses sometimes it gives me a shot of the street instead of the view perpendicular to the street. (I am not 100% sure the best way to do it, if you geocode and then use the lat/lon, you may not have the right view of the property either depending on the geocoding engine, so maybe going with the address directly is better?)

Future Work

In terms of predictive applications, I think using the streetview imagery is not likely to improve crime forecasts, that it is really only worthwhile for EDA or theory testing. In terms of predictive analysis, I actually think using the satellite imagery has more potential (see Jay, 2020 for an example, although that isn’t predictive but causal analysis).

So prior work has used 311 calls for service to identify high disorder areas (Magee, 2020; O’Brien & Winship, 2017; Wheeler, 2018), so I wonder if you can specifically build an image detector to identify particular disorder aspects that are not redundant with 311 calls. And also perhaps scales directly relevant to CPTED. The Google Vision labels are a bit superficial to really use for many theory crim applications I am afraid, but is an interesting exploratory data analysis to check them out.


Incorporating treatment non-compliance into call-ins

I have previously published work on identifying optimal individuals to prioritize for call-ins in Focused Deterrence interventions. The idea is we want to identify optimal people to spread the message, so you call in a small number of individuals and they should spread the message to the remaining group. There are better people than others to seed the message to to make sure it spreads throughout the network.

I knew of a direct improvement on that algorithm I published (very similar to the TURF problem I described the other day). But the bigger issue was that even when you call in individuals they do not always come to the meeting – treatment non-compliance. When working with state parole and/or local probation, the police department can ask those agencies to essentially make people come in, but otherwise it is voluntary.

The TURF problem I did the other day gave me a bit of inspiration on how to tackle that treatment non-compliance problem though. In a nutshell when you calculate whether someone is reached (via being directly connected to someone called-in), they can be partially reached based on the probability of the selected nodes treatment compliance. I have posted the code to follow along on dropbox here. I won’t go through the whole thing, but just some highlights.

The Model

First, in some quick and dirty text math, the model is:

Maximize Sum( R_i )

Subject to:

  • R_i <= Sum( S_j*p_j ) for each i
  • Sum( S_j ) = k
  • S_i element of [0,1]
  • R_i <= 1 for each i

Here i refers to an individual node in the gang/group network.

The first constraint R_i <= Sum( S_j*p_j ), the j’s are the nodes that are connected to i (and i itself). The p_j are the estimates that an individual will comply with coming into the call-in. For one agency we worked with for that project, they guessed that those who don’t need to come in comply about 1/6th of the time, so I use that estimate here in my examples, and give people who are on probation/parole a 1 for the probability of compliance.

Second constraint is we can only call in so many people, here k. The model solves very fast, so you can generate results for various k until you get the reach you want to in the end. (You could do the model the other way, minimize S_i while constraining the minimized acceptable reach, e.g. Sum( R_i ) >= threshold, I don’t suggest this in practice though, as when dealing with compliance there may be no feasible solution that gets you the amount of reach in the network you want.)

For the third constraint, the decision variables S_i are binary 0/1’s, but the R_i are continuous. But the trick here is that the last constraint, R_i <= 1, means that the expected reach is capped at 1. Here is a way to think about this, imagine you want to know the chance that person A is reached, and they are connected to two called-in individuals, who each have a 40% chance at complying with the treatment (coming to the call-in). The expected times person A would be reached then is additive in the probabilities, 0.4 + 0.4 = 0.8. If we had 3 people connected to A again at 40% apiece, the expected number of times A would be reached is then 0.4 + 0.4 + 0.4 = 1.2. So a person can be reached multiple times. (Note this is not the probability a person is reached at least once! It is a non-linear problem to model that.)

But if we took away the last constraint, what would happen is that the algorithm would just pick the nodes that had the highest number of neighbors. Since we are maximizing expected reach, if we had a sample of two people, the expected reach values of [2.5, 0] would be preferable to [1, 1], although clearly we rather have the reach spread out. So to prevent that, I cap the expected reach variable at 1, R_i <= 1 for each i, so this spreads out the selected individuals. So in the end the expected number of times people are reached are a lower bound estimate, but those are only people who are expected to receive the message multiple times.

This is a bit of a hack, but in my tests works quite well. I attempted to model the non-linear problem of estimating the probabilities at the person level and still maximizing the expected reach (in the code I have an example of using the CVXR R package). But it was quite fickle in when it would return a solution. So I am focusing on the linear program here, which is not perfect, but is an improvement over my prior published work.

Some Python Snippets

So for my example code, I am using City 4 Gang 4 from my paper. The reason is this was the largest network, and my original algorithm performed the worst. 99 nodes, and my original algorithm identified a 33 person dominant set, but Borgotti’s tool (that uses a genetic algorithm) identified a 29 dominant set.

Here is an example of calling my function to select the individuals for a call-in based on the non-compliance estimates. (g4 is the networkx graph object, the second arg is the number of individuals, and compliance is the node attribute that has the probability of treated compliance.) If we call in only 5 people, we still expect a reach of 29 individuals. Here there ends up being some highly connected people on parole/probation, so they have a 1 probability of complying with the treatment.

A consequence of this algorithm is that if you pipe in 1’s for the treatment compliance, you basically get an improvement to my original algorithm. So for a test we can see if I get the same minimal dominating set as Borgotti did for his algorithm here, where const is just everybody complies 100% of the time.

And yep we get a dominating set (all 99 people are reached). What happens if we go down one, and only select 28 people?

We only reach 98 out of the 99. So it appears a 29 set is the minimal dominating set here. But like I said the treatment non-compliance is a big deal in this setting. What is our expected reach if we take that into account, but still call-in 29 people?

It is still pretty high, around 2/3s of the network, but is still much smaller. Also if you look at the overlap between the constant versus non-compliance model, they select quite a few different individuals. It makes a big difference.

Here is a graph I made of selecting 20 individuals. Red means I selected that person, pink means they are reached at least some, and the size of the reach is proportion to the node. Then grey folks I wouldn’t expect to be reached by the message (at least by first degree connections).

So you can see that most of the people selected have that full 1 expected reach, so the algorithm does prioritize individuals on probation/parole who have a 100% expected compliance. But you can see a few folks who have a lower compliance who are selected as they are in places in the network not covered by those on probation/parole.

I have a tough time getting network layouts to look nice in python (even with the same layout algorithms, I feel like igraph in R just looks much better out of the box).

Future Work

Out of the box, this algorithm could incorporate several different pieces of information. So here I use the non-compliance estimate as a constant, but you could have varying estimates for that based on some other model no problem (e.g. older individuals comply more often than younger, etc.). Also another interesting extension (if you could get estimates) would be the probability a called-in individual spreads the message. In the part Sum( S_j*p_j ) it would just be something like Sum( S_j*p_cj*p_sj ), where p_cj is the compliance probability for attending, and p_sj is the probability to spread the message to those they are connected to.

Getting worthwhile estimates for either of those things will be tough though. Only way I can see it is via some shoe leather qualitative or survey approach.

Simulating runs of events

I still lurk on the Cross Validated statistics site every now and then. There was a kind of common question about the probability of a run of events occurring, and the poster provided a nice analytic solution to the problem using Markov Chains and absorbing states I was not familiar with.

I was familiar with a way to approximate the answer though using a simple simulation, and encoding data via run length encoding. Run length encoding works like this, if you have an original sequence that is AABBBABBBB, then the run length encoded version of this sequence is:


This is a quite convenient sparse data format to be familiar with. E.g. if you are using tensors in various deep learning libraries, you can encode the data like this and then stack the tensor. But the stacked tensor is just a view, so it doesn’t take up as much memory as the initial full tensor.

Using this encoding also makes a simulation to answer the question, how often do runs of 5+ occur in this hypothetical experiment quite easy to estimate. You just calculate the run length encoded version of the data, and see if any of the lengths are equal to or greater than 5. Below are code snippets in R and Python.

While the analytic solution is of course preferable when you can figure it out, simulations are nice to test whether the solution is correct, as well as to provide an answer when you are not familiar with how to analytically derive a solution.

R Code

R has a native run-length encoding command, rle. The reason is that runs tests are a common time series technique for looking at randomness. Encourage you to run the code yourself to see how my simulated answer lines up with the analytic answer provided on the stats site!

# R Code
die <- 1:6
run_sim <- function(rolls=1000, conseq=5){
    test <- sample(die,rolls,TRUE)
    res <- max(rle(test)$lengths) >= conseq

sims <- 1000000
results <- replicate(sims, run_sim(), TRUE)
print( mean(results) )

Python Code

The python code is very similar to the R code. Main difference is there is no native run length encoding command in numpy or scipy I am aware of (although there should be)! So I edited a function I found from Stackoverflow to accomplish the rle.

# Python code

import numpy as np

# Edited from
# input numpy arrary, return tuple (lengths, vals)
def rle(ia):
    y = np.array(ia[1:] != ia[:-1])         # pairwise unequal (string safe)
    i = np.append(np.where(y), len(ia) - 1) # must include last element
    z = np.diff(np.append(-1, i))           # run lengths
    return (z, ia[i])

die = list(range(6))

def run_sim(rolls=1000, conseq=5):
    rlen, vals = rle(np.random.choice(a=die,size=rolls,replace=True))
    return rlen.max() >= conseq

sims = 1000000
results = [run_sim() for i in range(sims)]
print( sum(results)/len(results) )

I debated on expanding this post to show how to do these simulations in parallel, this is a bit of a cheesy experiment to show though. To do 1 million simulations on my machine still only takes like 10~20 seconds for each of these code snippets. So that will have to wait until another post!

You may be thinking why do I care about runs of dice rolls? Well, it can be extended to many different types of time series monitoring problems. For example, when I worked as a crime analyst at Troy I thought about this in terms of analyzing domestic violent reports. They were too numerous for me to read through every report, so I needed to devise a system to identify if there were anomalous patterns in the recent number of reports. You could devise a test here, say how many days of 10+ reports in a row, and see how frequently you would expect that occur in say a year of monitoring. The simulations above could easily be amended to do that, via doing simulations of the Poisson distribution instead of dice rolls, or assigning weights to particular outcomes.

Street Network Distances and Correlations

Wouter Steenbeek (a friend and co-author for a few articles) has a few recent blog posts replicating some of my prior work replicating some of my work on street network vs Euclidean distances in Albany, NY (Wouters, 1, 2) and my posts (1,2).

In Wouter’s second post, he was particularly interested in checking out shorter distances (as that is what we are often interested in in criminology, checking crime clustering). When doing that, the relationship between network and Euclidean distances sometimes appear less strong, so my initial statement that they tend to be highly correlated is incorrect.

But this is an artifact for the correlation between any two measures – worth pointing out in general for analysis. If you artificially restrict the domain of one variable the correlation always goes down. See some examples on the cross-validated site (1, 2) that illustrate this with nicer graphs than I can whip up in a short time.

But for a quick idea about the issue, imagine a scenario where you slice out Euclidean distances in some X bin width, and check the scatterplot between Euclidean and network distances. So you will get less variation on the X axis, and more variation on the Y axis. Now take this to the extreme, and slice on Euclidean distances at only one value, say 100 meters exactly. In this scatterplot, there is no X variation, it is just a vertical line of points. So in that scenario the correlation is 0.

So I should not say the correlation between the two measures is high, as this is not always true – you can construct an artificial sample in which that statement is false. So a more accurate statement is that you can use the Euclidean distance to predict the network distance fairly accurately, or that the linear relationship between Euclidean and network distances is quite regular – no matter what the Euclidean distance is.

My analysis I have posted the python code here. But for a quick rundown, I grab the street networks for a buffer around Albany, NY using the osmnx library (so it is open street map network data). I convert this street network to an undirected graph (so no worrying about one-way streets) in a local projection. Then using all of the intersections in Albany (a few over 4000), I calculate all of the pairwise distances (around 8.7 million pairs, takes my computer alittle over a day to crunch it out in the background).

So again, the overall correlation is quite high:

But if you chunk the data up into tinier intervals, here 200 meter intervals, the correlations are smaller (an index of 100 means [0-200), 300 means [200-400), etc.).

But this does not mean the linear relationship between the two change. Here is a comparison of the linear regression line for the whole sample (orange), vs a broken-stick type model (the blue line). Imagine you take a slice of data, e.g. all Euclidean distances in the bin [100-200) and fit a regression line. And then do the same for the Euclidean distances [200-300) etc. The blue line here are those regression fits for each of those individual binned estimates. You can see that the two estimates are almost indistinguishable, so the relationship doesn’t change if you subset the data to shorter distances.

Technically the way I have drawn the blue line is misleading, I should have breaks in the line (it is not forced to be connected between bins, like my post on restricted cubic splines is). But I am too lazy to write code to do those splits at the moment.

Now, what does this mean exactly? So for research designs that may want to use network distances and an independent variable, e.g. look at prison visitation as a function of distance, or in my work on patrol redistricting I had to impute some missing travel time distances, these are likely OK to use typical Euclidean distances. Even my paper on survivability for gun shot fatality shows improved accuracy estimates using network distances, but very similar overall effects compared to using Euclidean distances.

So while here I have my computer crunch out the network distances for a day, where the Euclidean distances with the same data only takes a second, e.g. using scipy.spatial.distance. So it depends on the nature of the analysis whether that extra effort is worth it. (It helps to have good libraries ease the work, like here I used osmnx for python, and Wouter showed R code using sf to deal with the street networks, hardest part is the networks are often not stored in a way that makes doing the routing very easy. Neither of those libraries were available back in 2014.) Also note you only need to do the network calculations once and then can cache them (and I could have made these network computations go faster if I parallelized the lookup). So it is slightly onerous to do the network computations, but not impossible.

So where might it make a difference? One common use of these network distances in criminology is for analyses like Ripley’s K or near-repeat patterns. I don’t believe using network distances makes a big deal here, but I cannot say for sure. What I believe happens is that using network distances will dilate the distances, e.g. if you conclude two point patterns are clustered starting at 100 meters using Euclidean distances, then if using network it may spread out further and not show clustering until 200 meters. I do not think it would change overall inferences, such as where you make an inference whether two point patterns are clustered or not. (One point is does make a difference is doing spatial permutations in Ripley’s K, you should definitely restrict the simulations to generating hypothetical distributions on the street network and not anywhere in the study area.)

Also Stijn Ruiter makes the point (noted in Wouter’s second post), that street networks may be preferable for prediction purposes. Stijn’s point is related to spatial units of analyses, not to Euclidean vs Network distances. You could have a raster spatial unit of analysis but incorporate street network statistics, and vice-versa could have a vector street unit spatial unit of analysis and use Euclidean distance measures for different measures related to those vector units.

Wouter’s post also brought up another idea I’ve had for awhile, that when using spatial buffers around areas they can be bad control areas, as even if you normalize the area they have a very tiny sliver of network distance attributable to them. I will need to show that for another blog post though. (This was mostly my excuse to learn osmnx to do the routing!)

Recent Papers on Hot Spots of Crime in Dallas

So I have two different papers that were published recently. Both are on hot spots in Dallas, so might as well discuss them together.

For each I have posted the code to replicate the results (and that spreadsheet has links to preprints as well).

For each as a bit of a background as to the motivation for the projects, Dallas has had official hot spots, named TAAG (Target Area Action Grid). These were clearly larger than what would be considered best practice in identifying hot spots (they were more like entire neighborhoods). I realize ‘best practices’ is a bit wishy-washy, but the TAAG areas ended up covering around 20% of the city (a smidge over 65 square miles). Here is a map of the 2017 areas. There were 54 TAAG areas that covered, so on average each is alittle over 1 square mile.

Additionally I knew the Dallas police department was interested in purchasing the RTM software to do hot spots. And a separate group, the Dallas Crime Task Force was interested in using the software as well for non-police related interventions.

So I did these projects on my own (with my colleagues Wouter and Sydney of course). It wasn’t paid work for any of these groups (I asked DPD if they were interested, and had shared my results with folks from CPAL before that task force report came out, but nothing much came of it unfortunately). But my results for Dallas data are very likely to generalize to other places, so hopefully they will be helpful to others.

Machine Learning to Predict and Understand Hot Spots

So I see the appeal for folks who want to use RTM. It is well validated in both theory and practice, and Joel has made a nice software as a service app. But I knew going in that I could likely improve upon the predictions compared to RTM.

RTM tries to find a middle ground between prediction and causality (which isn’t a critique, it is sort of what we are all doing). RTM in the end spits out predictions that are like “Within 800 feet of a Subway Entrances is Risk Factor 1” and “The Density of Bars within 500 Feet is Risk Factor 2”. So it prefers simple models, that have prognostic value for PDS (or other agencies) to identify potential causal reasons for why that location is high crime. And subsequently helps to not only identify where hot spots are, but frame the potential interventions an agency may be interested it.

But this simplicity has a few drawbacks. One is that it is a global model, e.g. “800 feet within a subway entrance” applies to all subway entrances in the city. Most crime generators have a distribution that makes it so most subway entrances are relatively safe, only a few end up being high crime (for an example). Another is that it forces the way that different crime generators predict crime to be a series of step functions, e.g. “within 600 ft” or “a high density within 1000 ft”. In reality, most geographic processes follow a distance decay function. E.g. if you are looking at the relationship between check-cashing stores and street robbery, there are likely to be more very nearby the store, and it tails off in a gradual process the further away you get.

So I fit a more complicated random forest model that has neither of those limitations and can learn much more complicated functions, both in terms of distance to crime generators as well as spatially varying over the city. But because of that you don’t get the simple model interpretation – they are fundamentally conflicting goals. In terms of predictions either my machine learning model or a simpler comparison of using prior crime = future crime greatly outperforms RTM for several different predictive metrics.

So this shows the predictions are better for RTM no matter how you slice the hot spot areas, but again you lose out the prognostic value of RTM. To replace that, I show local interpretability scores for hot spots. I have an online map here for an example. If you click on one of the high crime predicted areas, it gives you a local breakdown of the different variables that contributes to the risk score.

So it is still more complicated than RTM, but gets you a local set of factors that potentially contribute to why places are hot spots. (It is still superficial in terms of causality, but PDs aren’t going to be able to get really well identified causal relationships for these types of predictions.)

Return on Investment for Hot Spots Policing

The second part of this is that Dallas is no doubt in a tight economic bind. And this was even before all the stuff about reforming police budgets. So policing academics have been saying PDs should shift many more resources from reactive to proactive policing for years. But how to make the argument that it is in police departments best interest to shift resources or invest in additional resources?

To do this I aimed to calculate a return on investment on investing in hot spots policing. Priscilla Hunt (from RAND) recently came up with labor cost estimates for crime specifically relevant for police departments. So if an aggravated assault happens PDs (in Texas) typically spend around $8k in labor costs to respond to the crime and investigate (it is $125k for a homicide). So based on this, you can say, if I can prevent 10 agg assaults, I then save $80k in labor costs. I use this logic to estimate a return on investment for PDs to do hot spots policing.

So first I generate hot spots, weighting for the costs of those crimes. Here is an interactive map to check them out, and below is a screenshot of the map.

I have an example of then calculating a return on investment for the hot spot area that captured the most crime. I get this estimate by transforming meta-analysis estimates of hot spots policing, estimating an average crime reduction, and then backing out how much labor costs that would save a police department. So in this hot spot, an ROI for hot spots policing (for 1.5 years) is $350k.

That return would justify at least one (probably more like two) full time officers just to be assigned to that specific hot spot. So if you actually hire more officers, it will be around net-zero in terms of labor costs. If you shift around current officers it should be a net gain in labor resources for the PD.

So most of the hot spots I identify in the study if you do this ROI calculation likely aren’t hot enough to justify hot spots policing from this ROI perspective (these would probably never justify intensive overtime that is typical of crackdown like interventions). But a few clearly are, and definitely should be the targets of some type of hot spot intervention.

A linear programming example for TURF analysis in python

Recently on LinkedIn I saw a very nice example of TURF (Total Unduplicated Reach & Frequency) analysis via Jarlath Quinn. I suggest folks go and watch the video, but for a simple example imagine you are an ice cream food truck, and you only have room in your truck to sell 5 different ice-creams at a time. Some people only like chocolate, others like vanilla and neapolitan, and then others like me like everything. So what are the 5 best flavors to choose to maximize the number of people that like at least one flavor?

You may think this is a bit far-fetched to my usual posts related to criminal justice, but it is very much related to the work I did on identifying optimal gang members to deliver the message in a Focused Deterrence initiative. (I’m wondering if also there is an application in Association Rules/Conjunctive analysis.) But most examples I see of this are in the marketing space, e.g. whether to open a new store, or to carry a new product in a store, or to spend money on ads to reach an audience, etc.

Here I have posted the python code and data used in the analysis, below I go through the steps in formulating different linear programs to tackle this problem. I ended up taking some example simulated data from the XLStat website. (If you have a real data example feel free to share!)

A bit of a side story – growing up in rural Pennsylvania, going out to restaurants was sort of a big event. I specifically remember when we would travel to Williamsport, we would often go to eat at a restaurant called Hoss’s and we would all just order the salad bar buffet. So I am going to pretend this restaurant survey is maximizing the reach for Hoss’s buffet options.


Here I am using pulp to fit the linear programming, reading in the data, and I am making up names for the columns for different food items. I have a set of main course meals, sides, and desserts. You will see in a bit how I incorporate this info into the buffet plans.


import pandas as pd
import pulp
import os


#This is simulated data from XLStat
surv_data = pd.read_excel('demoTURF.xls',sheet_name='Data',index_col=0)

#Need 27 total of match up simulated data
main = ['Steak',

sides = ['CeaserSalad',

desserts = ['ChoclateIceCream',

#Renaming columns
surv_data.columns = main + sides + desserts

#Replacing the likert scale data with 0/1

#A customer weight example, here setting to 1 for all
surv_data['CustWeight'] = 1
cust_weight = surv_data['CustWeight']

Maximizing Customers Reached

And now onto the good stuff, here is an example TURF model linear program. I end up picking the same 5 items that the XLStat program picked in their spreadsheet as well.


k = 5 #pick 5 items
Cust_Index = surv_data.index
Prod_Index = main + sides + desserts

#Problem and Decision variables
P = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)

#Objective Function
P += pulp.lpSum( Cust_Dec[i] * cust_weight[i] for i in Cust_Index )

surv_items = surv_data[Prod_Index] #Dont want the weight variable
#Reached Constraint
for i in Cust_Index:
    #Get the products selected
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    #Set the constraint
    P += Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )
#Total number of products selected constraint
P += pulp.lpSum( Prod_Dec[j] for j in Prod_Index) == k

#Now solve the model

#Figure out the total reached people
print( pulp.value(P.objective) ) #129

#Print out the products you picked
picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec[j].varValue == 1:
        picked.append( (n+1,j) )


#Same as XLStat
#[(14, 'OnionSoup'), (15, 'Peas'), (16, 'BrusselSprouts'), 
# (23, 'ChocChipCookie'), (26, 'Blondie')]

#For 5 items, XLStat selected items 
# 14 15 16 23 26 that reached 129 people

One of the things I have done here is to create a ‘weight’ variable associated with each customer. So here I say all of the customers weights are all equal to 1, but you could swap out whatever you wanted. Say you had estimates on how much different individuals spend, so you could give big spenders more weight. (In a criminal justice example, for the Focused Deterrence initiative, folks typically want to target ‘leaders’ more frequently, so you may give them more weight in this example.) Since these examples are based on surveys, you may also want the weight to correspond to the proportion that survey respondent represents in the population, aka raking weights. Or if you have a crazy large survey population, you could use frequency weights for responses that give the exact same picks.

One thing to note as well in this formula is that I recoded the data earlier to be 0/1. You might however consider the likert scale rating 1 to 5 directly, subtract 1 and divide by 4. Then take that weight, and instead of the line:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )

You may want something like:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j]*likert_weight[i,j] for j in sel_prod )

In that case you would want to set the Cust_i decision variable to a continuous value, and then maybe cap it at 1 (so you can partially reach customers).

The total number of decision variables will be the number of customers plus the number of potential products, so here only 185 + 27 = 212. And the number of constraints will be the number of customers plus an additional small number. I’d note you can easily solve systems with 100,000’s of decision variables and constraints on your laptop, so at least for the example TURF analyses I have seen they are definitely within the ‘can solve this in a second on a laptop’ territory.

You can add in additional constraints into this problem. So imagine we always wanted to select one main course, at least two side dishes, and no more than three desserts. Also say you never wanted to pair two items together, say you had two chicken dishes and never wanted both at the same time. Here is how you could do each of those different constraints in the problem.


#Redoing the initial problem, but select 7 items
k = 7
P2 = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec2 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec2 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
P2 += pulp.lpSum( Cust_Dec2[i] * cust_weight[i] for i in Cust_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    P2 += Cust_Dec2[i] <= pulp.lpSum( Prod_Dec2[j] for j in sel_prod )
P2 += pulp.lpSum( Prod_Dec2[j] for j in Prod_Index) == k

#No Fried and BBQ Chicken at the same time
P2 += pulp.lpSum( Prod_Dec2['FriedChicken'] + Prod_Dec2['BBQChicken']) <= 1
#Exactly one main course
P2 += pulp.lpSum( Prod_Dec2[m] for m in main) == 1
#At least two sides (but could have 0)
P2 += pulp.lpSum( Prod_Dec2[s] for s in sides) >= 2
#No more than 3 desserts
P2 += pulp.lpSum( Prod_Dec2[d] for d in desserts) <= 3

#Now solve the model and print results
print( pulp.value(P2.objective) ) #137
picked2 = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec2[j].varValue == 1:
        picked2.append( (n+1,j) )
#[(10, 'Wings'), (12, 'IcebergSalad'), (14, 'OnionSoup'), (15, 'Peas'), 
# (16, 'BrusselSprouts'), (23, 'ChocChipCookie'), (27, 'CherryPie')]

You could also draw a trade-off curve for how many more people you will reach if you can up the total number of items you can place on the menu, so estimate the model with 4, 5, 6, etc items and see how many more people you can reach if you extend the menu.

One of the other constraints you may consider in this formula is a budget constraint. So imagine instead of the food example, you are working for a marketing company, and you have an advertisement budget. You want to maximize the customer reach given the budget, so here a “product” may be a billboard, radio ad, newspaper ad, etc, but each have different costs. So instead of the constraint Prod_j == k where you select so many products, you have the constraint Prod_j*Cost_j <= Budget, where each product is associated with a particular cost.

Alt Formula, Minimizing Cost while Reaching a Set Amount

So in that last bit I mentioned costs for selecting a particular portfolio of products. Another way you may think about the problem is minimizing cost while meeting constraints on the reach (instead of maximizing reach while minimizing cost). So say you were a marketer, and wanted an estimate of how much budget you would need to reach a million people. Or going with our buffet example, imagine we wanted to appeal to at least 50% of our sample (so at least 93 people). Our formula would then be below (where I make up slightly different costs for buffet each of the buffet options).


#Cost dictionary made up prices
cost_prod = {'Steak' : 5.0,
             'Pizza' : 2.0,
             'FriedChicken' : 4.0,
             'BBQChicken' : 3.5,
             'GrilledSalmon' : 4.5,
             'FriedHaddock' : 5.3,
             'LemonHaddock' : 4.7,
             'Roast' : 3.9,
             'Burger' : 1.5,
             'Wings' : 2.4,
             'CeaserSalad' : 1.0,
             'IcebergSalad' : 0.8,
             'TomatoSoup' : 0.4,
             'OnionSoup' : 0.9,
             'Peas' : 0.6,
             'BrusselSprouts' : 0.5,
             'GreenBeans' : 0.4,
             'Corn' : 0.3,
             'DeviledEggs' : 0.7,
             'Pickles' : 0.73,
             'ChoclateIceCream' : 1.3,
             'VanillaIceCream' : 1.2,
             'ChocChipCookie' : 1.5,
             'OatmealCookie' : 0.9,
             'Brownie' : 1.2,
             'Blondie' : 1.3,
             'CherryPie' : 1.9}

#Setting up the model with the same selection constraints
n = 100
Pmin = pulp.LpProblem("TURF", pulp.LpMinimize)
Cust_Dec3 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec3 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Minimize this instead of Maximize reach
Pmin += pulp.lpSum( Prod_Dec3[j] * cost_prod[j] for j in Prod_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    Pmin += Cust_Dec3[i] <= pulp.lpSum( Prod_Dec3[j] for j in sel_prod )
#Instead of select k items, we want to reach at least n people
Pmin += pulp.lpSum( Cust_Dec3[i]*cust_weight[i] for i in Cust_Index) >= n

#Same constraints on meal choices
Pmin += pulp.lpSum( Prod_Dec3['FriedChicken'] + Prod_Dec3['BBQChicken']) <= 1
Pmin += pulp.lpSum( Prod_Dec3[m] for m in main) == 1
Pmin += pulp.lpSum( Prod_Dec3[s] for s in sides) >= 2
Pmin += pulp.lpSum( Prod_Dec3[d] for d in desserts) <= 3

#Now solve the model and print results

reached = 0
for i in Cust_Index:
    reached += Cust_Dec3[i].varValue
print(reached) #100 reached on the nose

picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec3[j].varValue == 1:
        picked.append( (n+1,j,cost_prod[j]) )
        cost += cost_prod[j]
#[(9, 'Burger', 1.5), (13, 'TomatoSoup', 0.4), (15, 'Peas', 0.6)]
print(pulp.value(Pmin.objective)) #Total Cost 2.5

So for this example our minimum budget buffet has some burgers, tomato soup, and peas. (Sounds good to me, I am not a picky eater!)

You can still incorporate all of the same other constraints I discussed before in this formulation. So here we need at a minimum to serve only 3 items to get the (over) 50% reach that we desire. If you wanted fairness type constraints, e.g. you want to reach 60% of females and 40% of males, you could do that as well. In that case you would just have two separate constraints for each group level you wanted to reach (which would also be applicable to the prior maximize reach formula, although you may need to keep upping the number of products selected before you identify a feasible solution).

In the end you could mash up these two formulas into one bi-objective function. You would need to define a term though to balance reach and cost. I’m wondering as well if there is a way to incorporate marginal benefits of sales into this as well, e.g. if you sell a Steak you may make a larger profit than if you sell a Pizza. But I am not 100% sure how to do that in this set up (even though I like all ice-cream, I won’t necessarily buy every flavor if I visit the shop). Similar for marketing adverts some forms may have better reach, but may have worse conversion rates.

CrimRxiv, Alt-Journal Contributions, and Mike Maltz’s Retrospective

As I’m sure followers of mine know, I am a big proponent of posting pre-prints. Spearheaded by Scott Jacques, he has started a specifically criminology focused pre-print server title CrimRxiv. It is still in beta but anyone can contribute a paper if they want.

One of the things me and Scott have been jamming about is how to leverage crimrxiv to make a journal that not only takes advantage of all the goodies on the internet, such as being able to embed interactive graphics or other rich media directly in a journal articles. But to really widen the scope of what ‘counts’ in terms of scholarly contribution. Why can’t things like a cool app, or a really good video lecture you edited, or a blog post illustrating code be put on the same level with journal articles?

Part of the reason I am writing this blog post is that I saw Michael Maltz recently publish a retrospective on his career on This isn’t a typical journal article, but despite that there is no reason why you shouldn’t share such pieces. So I was able to convince Mike to post A Retrospective Look at My Professional Life to crimrxiv. When he first posted it on here was my response on how Mike (despite never having crossed paths) has influenced my career.

Hi Michael and thank you for sharing,

I’ve followed your work since a grad student at Albany. I initially got hooked on data viz based on Tufte’s book. When I looked for examples of criminologists discussing data viz you were the only one I found. That was sometime around 2010, so you had that chapter in the handbook of quantitative crim. You also had another article about drawing glyphs to illustrate life course transitions I was familiar with.

When I finished my classes at SUNY, I then worked at Troy as a crime analyst while finishing my dissertation. I doubt any of the coffee shops were the same from your time, but I did like walking over to Famous hotdogs for lunch every now and then.

Most of my work at the PD was making time series graphs and maps. No regression, so most of my stats training was not particularly useful. Even my mapping course I took focused on areal data analysis was not terribly relevant.

I tried to do similar projects to your glyph life-courses with interval censored crime data, but I was never really successful with that, they always ended up being too complicated with even moderately large crime datasets, see and for my attempts.

What was much more helpful was simply doing monitoring metrics over time, simple running means, and then I just inverted the PDF of the Poisson to give error bars, e.g. Then cases that were outside the error bands signified an anomalous pattern. In Troy there was an arrest of a single prolific person breaking into cars, and the trend went from a creeping 10 year high to a 10 year low instantly in those graphs.

So there again we have your work on the Poisson distribution and operations research in that JQC article. Also sometime in there I saw a comment you made on Andrew Gelman’s blog pointing to your work with error bands for BJS. Took that ‘fan chart’ idea later on and provided error bands for city level and USA level homicide trends, e.g. Most of popular discussion of large scale crime trends is misguided over-interpreting short term noise in my opinion.

So all my degrees are in criminal justice, but I have been focusing more on linear programming over time borrowing from operations researchers as well, I’ve found that taking outputs from a predictive model and then applying a decision analysis to specifically articulate strategies CJ agencies should take is much more fruitful than the typical way academic research is done.

Thank you again for sharing your story and best, Andy Wheeler

New paper out: Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims

A recent paper with Gio Circo, Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims, was published in Applied Spatial Analysis and Policy. In this work, me and Gio estimate the marginal effect that drive time distances to the nearest Level 1 trauma center have on the probability a victim dies of a gun shot wound, using open Philadelphia data.

If you do not have access to that published version, here is a pre-print version. (And you can always email me or Gio and ask for a copy.) Also because we use open data, we have posted the data and code used for the analysis. (Gio did most of the work!)

For a bit of the background on the project, Gio had another paper estimating a similar model using Detroit data. But Gio estimated those models with aggregate data. I was familiar with more detailed Philly shooting data, as I used it for an example hot spot cluster map in my GIS crime mapping class.

There are two benefits to leveraging micro data instead of the aggregated data. One is that you can incorporate micro level incident characteristics into the model. The other is that you can get the exact XY coordinates where the incident occurred. And using those exact coordinates we calculate drive time distances to the hospital, which offer a slight benefit in terms of leave-one-out cross-validated accuracy compared to Euclidean distances.

So in terms of incident level characteristics, the biggest factor in determining your probability of death is not the distance to the nearest hospital, but where you physically get shot on your body. Here is a marginal effect plot from our models, showing how the joint effect of injury location (as different colors) and the drive time distance impact the probability of death. So if you get shot in the head vs the torso, you have around a 30% jump in the probability of death from that gun shot wound. Or if you get shot in an extremity you have a very low probability of death as well.

But you can see from that the margins for drive times are not negligible. So if you are nearby a hospital and shot in the torso your probability of dying is around 20%, whereas if you are 30 minutes away your probability rises to around 30%. You can then use this to map out isochrone type survivability estimates over the city. This example map is if you get shot in the torso, and the probability of death based on the drive time distance to the nearest Level 1 trauma location.

Fortunately many shootings do not occur in the northern most parts of Philadelphia, here is a map of the number of shootings over the city for our sample.

You can subsequently use these models to either do hypothetical take a trauma center away or add a trauma center. So given the density of shootings and drive time distances, it might make sense for Philly to invest in a trauma center in the shooting hot spot in the Kensington area (northeast of Temple). (You could technically figure out an ‘optimal’ location given the distribution of shootings, but since you can’t just plop down a hospital wherever it would make more sense to do hypothetical investments in current hospitals.)

For a simplified example, imagine you had 100 shootings in the torso that were an average 20 minutes away. The average probability of death in that case is around 25% (so ~25 homicides). If you hypothetically have a location that is only 5 minutes away, the probability goes down to more like 20% (so ~20 homicides). So in that hypothetical, the distance margin would have prevented 5 deaths.

One future piece of research I would be interested in examining is pre-post Shotspotter. So in that article Jen Doleac is right in that the emipirical evidence for Shotspotter reducing shootings is pretty flimsy, but preventing mortality by getting to the scene faster may be one mechanism that ShotSpotter can justify its cost.

Discrete time survival models in python

Sorry in the advance for the long post! I’ve wanted to tackle a project on estimating discrete time survival models for awhile now, and may have a relevant project at work where I can use this. So have been crunching out some of this code I am going to share for the last two weeks.

I personally only have one example in my career of estimating discrete time models, I used discrete time models to estimate propensity scores in my demolitions and crime reduction paper (Wheeler et al., 2018), since the demolitions did not occur at all once, but happened over several years. (In that paper I estimated the discrete time models, and then did matches in random cohorts.)

But I was interested in discrete time survival models for one reason – they allow you to estimate very non-linear hazard functions that you cannot with traditional survival models. For Cox models, to do predictions you need to rely on a estimate of the baseline hazard function, and for parametric models (e.g. Weibull) they often can only have monotonic or flat functions (so can’t be low risk and then high risk in a short period). For a good reference about evaluating predictions for survival models, I suggest Haider et al. (2020), and for a general reference for discrete time survival models I suggest the little Sage green book by Paul Allison (Allison, 2014).

For traditional recidivism studies in criminology (e.g. after someone is paroled), I don’t believe the function to be too bumpy like this, so I don’t think prior studies are misleading (e.g. Denver, 2019). But I do think they are worth examining to see if that is the case. For another use case, for chronic offender based police predictions, I think individuals may have more bumpy risk profiles, e.g. you commit a crime and then lay low (so lower risk), or get victimized and may want retaliation (so high risk). A prior work I looked at a year horizon for offender predictions (Wheeler et al., 2019), so I wanted to extend that to shorter time intervals, but never quite got the chance. (Another benefit of discrete time models is that they can incorporate time varying factors no problem the way the model is set up.)

I have code illustrating discrete time models saved on github here. The data I use to illustrate the analysis is taken from Ruderman et al. (2015). This is recidivism for a fairly large cohort. (I don’t think discrete time makes much sense for small samples, you probably need 1000+ to even really consider it I would guess.)

The code ends up being too long to walk though in a blog post. So here are some quick notes/tables/plots, and I encourage you to go check out the github page to dive deeper if you want.

The Discrete Time Model Setup

The main thing to realize about the discrete time modeling set up is that you just turn your survival data problem into a format you can leverage logistic regression (or whatever binary prediction machine learning model you want). So if we have an original set of survival data that looks like:

ID Time Outcome
 A   4    1
 B   3    0

We then explode this dataset into a long format that looks like this:

ID Time Outcome
 A   1     0
 A   2     0
 A   3     0
 A   4     1
 B   1     0
 B   2     0
 B   3     0

So you can see ID A was exploded to 4 observations, and the Outcome variable is only set to 1 at the final time period. For person B, they are exploded 3 observations, but the outcome variable is always set to 0.

Then you model Outcome as a function of time and other covariates, which can be either constant per person or time varying. This then gets you a model that estimates the instant probability of death (or failure) in a particular time sliver. The way I think about it is like this – we can predict whether you will commit a crime sometime within the next week (the cumulative probability over the entire week), or within a particular sliver of time (the probability of committing a crime Friday at 10 pm). Discrete time models pick a sliver of time, e.g. Friday, and calculate the instant probability within that bin.

But then we don’t want to rely on the traditional binary metrics to evaluate this model – we will often want to go from the instant probabilities in a time sliver to the cumulative probabilities. You can take those model estimates though at aggregate them back up to examine over the weekly time horizon example though. So if we have predictions for a new person C that looked like this:

ID Time InstantProb
 A   1     0.2
 A   2     0.1
 A   3     0.3
 A   4     0.05

We could then calculate the cumulative probability of failure over these four time periods. So the failure in time period 1 is just 0.2. For time period 2 it is 1 - [(1-0.2)*(1-0.1)] = 0.28. You just then accumulate those individual specific probabilities into cumulative failure probabilities over particular time horizons, which you can then incorporate into cost-benefit analysis for how you will use those predictions in practice. For various metrics we will then examine not just the instant probability our model spits out, but also the cumulative probability of failure.

The main issue with these models is that when exploding the dataset it can result in large samples. So my initial sample of just over 13k observations, when I expand to observed weeks ends up being over 1 million observations. That is not a big deal though, I can still easily do whatever models I want with that data on my personal machine. Probably don’t need to worry about it for most statistical computing projects until maybe you are dealing with over 20 million observations I would bet for most out of the box desktop computers anymore.

Modeling Notes

In the github page the script prepares the dataset (transforming to the long format). The original Ruderman data has repeated events, but for simplicity I only take out the first events for individuals, which ends up being just over 13k observations. I then split this into a training dataset and a test dataset, and a set the test dataset to 3k cases.

My temporal unit of analysis I transform into weeks since release, and only examine the discrete time models up to 104 weeks (so two years). Here is a traditional KM plot based on the exploded discrete time training dataset.

But really what we are modeling in this set up is the instant hazard, not the cumulative hazard. So here is a plot of the instant probability of recidivism.

You can see that in the first week out, almost 1.4% of the individuals recidivate. There are ups and downs, but the instant probability continues to decrease and slightly flatten out out to 100 weeks. So you can see how over those two years we go from an original dataset of over 10k to around 3k due to censoring.

Part of the reason I was interested in examining discrete time models is that I was wondering if the instant hazard was bumpy and had some ups and downs when people are first exposed.

But this data appears fairly smooth, so in the end I end up fitting a logistic regression model with restricted cubic splines for time with knot locations at [4,10,20,40,60,80]. I also incorporate various interactions with the some of the time invariant covariates in the original Ruderman data (age at first arrest, male, overcrowding, concentrated disadvantage index, and offense category dummies).

I initially tried my GoTo machine learning models of random forests and XGBoost, but they performed quite poorly. Tree based models aren’t very well suited to estimate very tiny probabilities I am afraid. So that will need some more tinkering to see if I can use those machine learning models more effectively in this circumstance. I’m wondering if doing a different loss function makes sense (so do the loss based on the cumulative hazard instead of the instant). Here also I did not regularize the logit model, but with time varying factors that may make sense.

The Haider paper looks at the R MLTR package, which is similar to here but slightly different, in that they are modeling the cumulative hazard directly instead of the instant hazard. (So instead of chopping off the 1’s and the end of the vector, you keep padding them on for observations.) So in that case you want to enforce monotonic constraints on the time effect.

Checking Out Individual Predictions

The remaining sections in the blog post are all taken from the second script. So first, after you generate your predictions on the training data, you can then pull out a particular individual and check out our predictions for their cumulative survival probability based on our predictive model. The red line shows that this individual actually recidivated at 45 weeks, in which their cumulative risk was just above 20%.

The cumulative probability will never be super interesting though – in that even if you had a very wiggly instant hazard the cumulative hazard is always monotonically increasing. So if you check out the instant hazard this will show how a persons risk level varies over time.

So we can see here that person 39 had a predicted high risk when they are first released, but gradual decreases in a few steps over time. The way I have modeled this using restricted cubic splines it has to be smooth, but you could say incorporate dummy variables for the first 10 weeks, in which case this prediction could be quite bumpy.

Given this always shows monotonically decreasing hazard, you wouldn’t be able to exactly fit that function using parametric models, but they would be not too far off. So this dataset doesn’t appear to be a real great showcase of the utility of discrete time models!

But doing some plots of the instant hazards may be interesting to try to identify particular different risk profiles, or maybe even use some clustering (like group based trajectory models) to identify particular latent risk profiles. (It may be most people are smoothly decreasing, but some people have bumpier profiles.)

Evaluating Model Calibration

Haider et al. (2020) break down predictive metrics to evaluate survival models into two types: calibration is that the model predictions match actual cases, e.g. if my model says the probability of failure is 20%, does the data actually show failure in 20% of the cases. The other is discrimination, can I rank individuals as high risk to low risk, and do the high risk ones have the negative outcome more frequently.

While the Haider paper has various metrics, I am kind of confused how to do them in practice. My confusion mostly stems from the test dataset will ultimately have censoring in it as well, so the calibration metrics need to take this into account. Here are my attempts at a few plots that take on the task of checking model calibration.

First, I’ve previously discussed what I call a lift calibration chart. I adjust it here though to account for the fact that we have interval censoring, and I create ignorance bounds for the actual proportion of failures in the dataset.

This is for the full sample, which I expanded out and did calculations for up to 104 weeks for everyone. You can do a slice of the data though for a particular time period and check the same calibration. So here is an example checking calibration at one year out.

The earlier in time the smaller the ignorance bands will be (as there will be less censoring in sample). Here is what the created dataset looks like to illustrate how the ignorance bands are calculated.

The CumHazard column is my predicted line, which I break down into 20 bins for that yearly plot (so with 3000 training dataset observations, results in bins of 150 observations). Then you can see the LowTrue column (in Bin 1) signifies I observed 19 failures in that set of observations, but there ending up being a total of 27 observations censored in that bin, 46 - 19. So the actual proportion in the data could either be 19/150 (all censored never recidivate) or 46/150 (all of those who were censored would end up recidivating). I would suggest for notes on ignorance bounds like these (which also apply to ECDF type functions), Ferson et al. (2007).

I’d note that this is the same way you generate data for a Hosmer-Lemeshow test for logit models, but I don’t bother with the Chi-Square test. For large samples it will always reject, and small samples it may mean you just have low power, not that your model is well calibrated. So doing that stat test is a lose-lose IMO. But you can just make the plot to see whether your predictions are on the mark, or if they are low or high on average. Here we can see that they hug the lower ignorance band, so are not too bad. But may be a shade too low (more people recidivate than predicted).

This calibration is examining the probability, but another way to think about calibration here is calibrated in terms of time, e.g. I say something will happen in 30 weeks, does it actually happen in 30 weeks? Here is my attempt at a plot to check that out. Using the test dataset, I generate the usual KM estimate. Then based on the predicted probabilites, I generate simulated outcomes for individuals (here 99), and then plot the range of those outcomes on the same chart.

So here you can see that my predicted failure times are somewhat longer than observed in the data (simulation bands slightly below observed for the later time periods). These two charts are likely not in contradiction, the error bands in each show somewhat observe patterns, so they both hint at my model is conservative in assigning risk. But it is not too shabby in terms of calibration (you should have seen some of these plots when I was trying random forest and XGBoost models!).

I’m wondering offhand if I have some edge effects going on. So maybe even if I am only interested in examining a time horizon of two years, I should still tack on longer time periods for the initial models.

Both of these charts you can subset the data and look at the same chart, so here is an example table generated for simulations based on 332 test dataset females. Because the sample is smaller, the simulated bands are wider, so the observed KM cumulative hazard estimate appears well inside the bands here for the female subsample. (Probably because of less diagnostic ability to identify tiny bits off in the calibration.)

Evaluating Model Discrimination

The second way you might evaluate survival predictions is in terms of rankings, can I discriminate in my model between individuals who are high risk and who are low risk. One of the crazy things about these individual level survival curves is that they can cross! So imagine we had a set of two individuals and are looking at a horizon of four periods:

ID Time InstProb CumProb
A   1      0.1     0.1
A   2      0.1     0.19
A   3      0.1     0.271
A   4      0.1     0.3439

B   1      0.2     0.2
B   2      0.1     0.28
B   3      0.05    0.316
B   4      0.01    0.32284

So person B is at higher risk right away. So if we ranked these individuals for who was more likely to recidivate, ID B will be ranked higher for periods 1, 2 and 3. But by period 4, ID A is at higher risk in terms of their cumulative probability of recidivating.

The simplest metric to evaluate discrimination IMO is AUC (which is related to the concordance metric). And to do that you just do slices of particular weeks, and then calculate the AUC based on the cumulative failure probability estimate at that time period.

So you can see here that it is pretty meh – only AUC stats around 0.6 for my logit model. So better than the random 0.5, but not by much. Even though my model appears to be reasonably calibrated, it is nothing to brag to grandma about being able to identify people at different risk levels for recidivism, not matter the time horizon I am interested in.

For this estimate I just dropped censored observations, so I am not sure how to deal with them in this case. If you have suggestions or references let me know! But offhand I don’t think they are too off, the earlier time periods should have less censoring, but they are all pretty close in terms of the overall metric.

Future Stuff?

Besides seeing how others have dealt with censoring in their prediction metrics, another metric introduced in the Haider et al. (2020) paper is a Brier Score that is both a calibration and discrimination metric.

Also for folks interested in survival analysis in python, I suggest to check out statsmodel or the lifelines packages.