A linear programming example for TURF analysis in python

Recently on LinkedIn I saw a very nice example of TURF (Total Unduplicated Reach & Frequency) analysis via Jarlath Quinn. I suggest folks go and watch the video, but for a simple example imagine you are an ice cream food truck, and you only have room in your truck to sell 5 different ice-creams at a time. Some people only like chocolate, others like vanilla and neapolitan, and then others like me like everything. So what are the 5 best flavors to choose to maximize the number of people that like at least one flavor?

You may think this is a bit far-fetched to my usual posts related to criminal justice, but it is very much related to the work I did on identifying optimal gang members to deliver the message in a Focused Deterrence initiative. (I’m wondering if also there is an application in Association Rules/Conjunctive analysis.) But most examples I see of this are in the marketing space, e.g. whether to open a new store, or to carry a new product in a store, or to spend money on ads to reach an audience, etc.

Here I have posted the python code and data used in the analysis, below I go through the steps in formulating different linear programs to tackle this problem. I ended up taking some example simulated data from the XLStat website. (If you have a real data example feel free to share!)

A bit of a side story – growing up in rural Pennsylvania, going out to restaurants was sort of a big event. I specifically remember when we would travel to Williamsport, we would often go to eat at a restaurant called Hoss’s and we would all just order the salad bar buffet. So I am going to pretend this restaurant survey is maximizing the reach for Hoss’s buffet options.

Frontmatter

Here I am using pulp to fit the linear programming, reading in the data, and I am making up names for the columns for different food items. I have a set of main course meals, sides, and desserts. You will see in a bit how I incorporate this info into the buffet plans.

############################################################
#FRONT MATTER 

import pandas as pd
import pulp
import os

os.chdir('D:\Dropbox\Dropbox\Documents\BLOG\TURF_Analysis\Analysis')

#This is simulated data from XLStat
surv_data = pd.read_excel('demoTURF.xls',sheet_name='Data',index_col=0)

#Need 27 total of match up simulated data
main = ['Steak',
        'Pizza',
        'FriedChicken',
        'BBQChicken',
        'GrilledSalmon',
        'FriedHaddock',
        'LemonHaddock',
        'Roast',
        'Burger',
        'Wings']

sides = ['CeaserSalad',
         'IcebergSalad',
         'TomatoSoup',
         'OnionSoup',
         'Peas',
         'BrusselSprouts',
         'GreenBeans',
         'Corn',
         'DeviledEggs',
         'Pickles']

desserts = ['ChoclateIceCream',
            'VanillaIceCream',
            'ChocChipCookie',
            'OatmealCookie',
            'Brownie',
            'Blondie',
            'CherryPie']

#Renaming columns
surv_data.columns = main + sides + desserts

#Replacing the likert scale data with 0/1
surv_data.replace([1,2,3],0,inplace=True)
surv_data.replace([4,5],1,inplace=True)

#A customer weight example, here setting to 1 for all
surv_data['CustWeight'] = 1
cust_weight = surv_data['CustWeight']
############################################################

Maximizing Customers Reached

And now onto the good stuff, here is an example TURF model linear program. I end up picking the same 5 items that the XLStat program picked in their spreadsheet as well.

############################################################
#TRADITIONAL TURF ANALYSIS

k = 5 #pick 5 items
Cust_Index = surv_data.index
Prod_Index = main + sides + desserts

#Problem and Decision variables
P = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)

#Objective Function
P += pulp.lpSum( Cust_Dec[i] * cust_weight[i] for i in Cust_Index )

surv_items = surv_data[Prod_Index] #Dont want the weight variable
#Reached Constraint
for i in Cust_Index:
    #Get the products selected
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    #Set the constraint
    P += Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )
    
#Total number of products selected constraint
P += pulp.lpSum( Prod_Dec[j] for j in Prod_Index) == k

#Now solve the model
P.solve()

#Figure out the total reached people
print( pulp.value(P.objective) ) #129

#Print out the products you picked
picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec[j].varValue == 1:
        picked.append( (n+1,j) )

print(picked) 

#Same as XLStat
#[(14, 'OnionSoup'), (15, 'Peas'), (16, 'BrusselSprouts'), 
# (23, 'ChocChipCookie'), (26, 'Blondie')]

#For 5 items, XLStat selected items 
# 14 15 16 23 26 that reached 129 people
############################################################

One of the things I have done here is to create a ‘weight’ variable associated with each customer. So here I say all of the customers weights are all equal to 1, but you could swap out whatever you wanted. Say you had estimates on how much different individuals spend, so you could give big spenders more weight. (In a criminal justice example, for the Focused Deterrence initiative, folks typically want to target ‘leaders’ more frequently, so you may give them more weight in this example.) Since these examples are based on surveys, you may also want the weight to correspond to the proportion that survey respondent represents in the population, aka raking weights. Or if you have a crazy large survey population, you could use frequency weights for responses that give the exact same picks.

One thing to note as well in this formula is that I recoded the data earlier to be 0/1. You might however consider the likert scale rating 1 to 5 directly, subtract 1 and divide by 4. Then take that weight, and instead of the line:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )

You may want something like:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j]*likert_weight[i,j] for j in sel_prod )

In that case you would want to set the Cust_i decision variable to a continuous value, and then maybe cap it at 1 (so you can partially reach customers).

The total number of decision variables will be the number of customers plus the number of potential products, so here only 185 + 27 = 212. And the number of constraints will be the number of customers plus an additional small number. I’d note you can easily solve systems with 100,000’s of decision variables and constraints on your laptop, so at least for the example TURF analyses I have seen they are definitely within the ‘can solve this in a second on a laptop’ territory.

You can add in additional constraints into this problem. So imagine we always wanted to select one main course, at least two side dishes, and no more than three desserts. Also say you never wanted to pair two items together, say you had two chicken dishes and never wanted both at the same time. Here is how you could do each of those different constraints in the problem.

############################################################
#CONSTRAINTS ON ITEMS SELECTED

#Redoing the initial problem, but select 7 items
k = 7
P2 = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec2 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec2 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
P2 += pulp.lpSum( Cust_Dec2[i] * cust_weight[i] for i in Cust_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    P2 += Cust_Dec2[i] <= pulp.lpSum( Prod_Dec2[j] for j in sel_prod )
P2 += pulp.lpSum( Prod_Dec2[j] for j in Prod_Index) == k

#No Fried and BBQ Chicken at the same time
P2 += pulp.lpSum( Prod_Dec2['FriedChicken'] + Prod_Dec2['BBQChicken']) <= 1
#Exactly one main course
P2 += pulp.lpSum( Prod_Dec2[m] for m in main) == 1
#At least two sides (but could have 0)
P2 += pulp.lpSum( Prod_Dec2[s] for s in sides) >= 2
#No more than 3 desserts
P2 += pulp.lpSum( Prod_Dec2[d] for d in desserts) <= 3

#Now solve the model and print results
P2.solve()
print( pulp.value(P2.objective) ) #137
picked2 = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec2[j].varValue == 1:
        picked2.append( (n+1,j) )
print(picked2)
#[(10, 'Wings'), (12, 'IcebergSalad'), (14, 'OnionSoup'), (15, 'Peas'), 
# (16, 'BrusselSprouts'), (23, 'ChocChipCookie'), (27, 'CherryPie')]
############################################################

You could also draw a trade-off curve for how many more people you will reach if you can up the total number of items you can place on the menu, so estimate the model with 4, 5, 6, etc items and see how many more people you can reach if you extend the menu.

One of the other constraints you may consider in this formula is a budget constraint. So imagine instead of the food example, you are working for a marketing company, and you have an advertisement budget. You want to maximize the customer reach given the budget, so here a “product” may be a billboard, radio ad, newspaper ad, etc, but each have different costs. So instead of the constraint Prod_j == k where you select so many products, you have the constraint Prod_j*Cost_j <= Budget, where each product is associated with a particular cost.

Alt Formula, Minimizing Cost while Reaching a Set Amount

So in that last bit I mentioned costs for selecting a particular portfolio of products. Another way you may think about the problem is minimizing cost while meeting constraints on the reach (instead of maximizing reach while minimizing cost). So say you were a marketer, and wanted an estimate of how much budget you would need to reach a million people. Or going with our buffet example, imagine we wanted to appeal to at least 50% of our sample (so at least 93 people). Our formula would then be below (where I make up slightly different costs for buffet each of the buffet options).

############################################################
#MINIMIZE COST

#Cost dictionary made up prices
cost_prod = {'Steak' : 5.0,
             'Pizza' : 2.0,
             'FriedChicken' : 4.0,
             'BBQChicken' : 3.5,
             'GrilledSalmon' : 4.5,
             'FriedHaddock' : 5.3,
             'LemonHaddock' : 4.7,
             'Roast' : 3.9,
             'Burger' : 1.5,
             'Wings' : 2.4,
             'CeaserSalad' : 1.0,
             'IcebergSalad' : 0.8,
             'TomatoSoup' : 0.4,
             'OnionSoup' : 0.9,
             'Peas' : 0.6,
             'BrusselSprouts' : 0.5,
             'GreenBeans' : 0.4,
             'Corn' : 0.3,
             'DeviledEggs' : 0.7,
             'Pickles' : 0.73,
             'ChoclateIceCream' : 1.3,
             'VanillaIceCream' : 1.2,
             'ChocChipCookie' : 1.5,
             'OatmealCookie' : 0.9,
             'Brownie' : 1.2,
             'Blondie' : 1.3,
             'CherryPie' : 1.9}

#Setting up the model with the same selection constraints
n = 100
Pmin = pulp.LpProblem("TURF", pulp.LpMinimize)
Cust_Dec3 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec3 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Minimize this instead of Maximize reach
Pmin += pulp.lpSum( Prod_Dec3[j] * cost_prod[j] for j in Prod_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    Pmin += Cust_Dec3[i] <= pulp.lpSum( Prod_Dec3[j] for j in sel_prod )
#Instead of select k items, we want to reach at least n people
Pmin += pulp.lpSum( Cust_Dec3[i]*cust_weight[i] for i in Cust_Index) >= n

#Same constraints on meal choices
Pmin += pulp.lpSum( Prod_Dec3['FriedChicken'] + Prod_Dec3['BBQChicken']) <= 1
Pmin += pulp.lpSum( Prod_Dec3[m] for m in main) == 1
Pmin += pulp.lpSum( Prod_Dec3[s] for s in sides) >= 2
Pmin += pulp.lpSum( Prod_Dec3[d] for d in desserts) <= 3

#Now solve the model and print results
Pmin.solve()

reached = 0
for i in Cust_Index:
    reached += Cust_Dec3[i].varValue
print(reached) #100 reached on the nose

picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec3[j].varValue == 1:
        picked.append( (n+1,j,cost_prod[j]) )
        cost += cost_prod[j]
print(picked)
#[(9, 'Burger', 1.5), (13, 'TomatoSoup', 0.4), (15, 'Peas', 0.6)]
print(pulp.value(Pmin.objective)) #Total Cost 2.5
############################################################

So for this example our minimum budget buffet has some burgers, tomato soup, and peas. (Sounds good to me, I am not a picky eater!)

You can still incorporate all of the same other constraints I discussed before in this formulation. So here we need at a minimum to serve only 3 items to get the (over) 50% reach that we desire. If you wanted fairness type constraints, e.g. you want to reach 60% of females and 40% of males, you could do that as well. In that case you would just have two separate constraints for each group level you wanted to reach (which would also be applicable to the prior maximize reach formula, although you may need to keep upping the number of products selected before you identify a feasible solution).

In the end you could mash up these two formulas into one bi-objective function. You would need to define a term though to balance reach and cost. I’m wondering as well if there is a way to incorporate marginal benefits of sales into this as well, e.g. if you sell a Steak you may make a larger profit than if you sell a Pizza. But I am not 100% sure how to do that in this set up (even though I like all ice-cream, I won’t necessarily buy every flavor if I visit the shop). Similar for marketing adverts some forms may have better reach, but may have worse conversion rates.

CrimRxiv, Alt-Journal Contributions, and Mike Maltz’s Retrospective

As I’m sure followers of mine know, I am a big proponent of posting pre-prints. Spearheaded by Scott Jacques, he has started a specifically criminology focused pre-print server title CrimRxiv. It is still in beta but anyone can contribute a paper if they want.

One of the things me and Scott have been jamming about is how to leverage crimrxiv to make a journal that not only takes advantage of all the goodies on the internet, such as being able to embed interactive graphics or other rich media directly in a journal articles. But to really widen the scope of what ‘counts’ in terms of scholarly contribution. Why can’t things like a cool app, or a really good video lecture you edited, or a blog post illustrating code be put on the same level with journal articles?

Part of the reason I am writing this blog post is that I saw Michael Maltz recently publish a retrospective on his career on Academia.edu. This isn’t a typical journal article, but despite that there is no reason why you shouldn’t share such pieces. So I was able to convince Mike to post A Retrospective Look at My Professional Life to crimrxiv. When he first posted it on Academia.edu here was my response on how Mike (despite never having crossed paths) has influenced my career.


Hi Michael and thank you for sharing,

I’ve followed your work since a grad student at Albany. I initially got hooked on data viz based on Tufte’s book. When I looked for examples of criminologists discussing data viz you were the only one I found. That was sometime around 2010, so you had that chapter in the handbook of quantitative crim. You also had another article about drawing glyphs to illustrate life course transitions I was familiar with.

When I finished my classes at SUNY, I then worked at Troy as a crime analyst while finishing my dissertation. I doubt any of the coffee shops were the same from your time, but I did like walking over to Famous hotdogs for lunch every now and then.

Most of my work at the PD was making time series graphs and maps. No regression, so most of my stats training was not particularly useful. Even my mapping course I took focused on areal data analysis was not terribly relevant.

I tried to do similar projects to your glyph life-courses with interval censored crime data, but I was never really successful with that, they always ended up being too complicated with even moderately large crime datasets, see https://andrewpwheeler.com/2013/02/28/interval-graph-for-viz-temporal-overlap-in-crime-events/ and https://andrewpwheeler.com/2014/10/02/stacking-intervals/ for my attempts.

What was much more helpful was simply doing monitoring metrics over time, simple running means, and then I just inverted the PDF of the Poisson to give error bars, e.g. https://andrewpwheeler.com/2016/06/23/weekly-and-monthly-graphs-for-monitoring-crime-patterns-spss/. Then cases that were outside the error bands signified an anomalous pattern. In Troy there was an arrest of a single prolific person breaking into cars, and the trend went from a creeping 10 year high to a 10 year low instantly in those graphs.

So there again we have your work on the Poisson distribution and operations research in that JQC article. Also sometime in there I saw a comment you made on Andrew Gelman’s blog pointing to your work with error bands for BJS. Took that ‘fan chart’ idea later on and provided error bands for city level and USA level homicide trends, e.g. https://apwheele.github.io/MathPosts/FanChart_NewOrleans.html. Most of popular discussion of large scale crime trends is misguided over-interpreting short term noise in my opinion.

So all my degrees are in criminal justice, but I have been focusing more on linear programming over time borrowing from operations researchers as well, https://andrewpwheeler.com/2020/05/29/an-intro-to-linear-programming-for-criminologists/. I’ve found that taking outputs from a predictive model and then applying a decision analysis to specifically articulate strategies CJ agencies should take is much more fruitful than the typical way academic research is done.

Thank you again for sharing your story and best, Andy Wheeler

New paper out: Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims

A recent paper with Gio Circo, Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims, was published in Applied Spatial Analysis and Policy. In this work, me and Gio estimate the marginal effect that drive time distances to the nearest Level 1 trauma center have on the probability a victim dies of a gun shot wound, using open Philadelphia data.

If you do not have access to that published version, here is a pre-print version. (And you can always email me or Gio and ask for a copy.) Also because we use open data, we have posted the data and code used for the analysis. (Gio did most of the work!)

For a bit of the background on the project, Gio had another paper estimating a similar model using Detroit data. But Gio estimated those models with aggregate data. I was familiar with more detailed Philly shooting data, as I used it for an example hot spot cluster map in my GIS crime mapping class.

There are two benefits to leveraging micro data instead of the aggregated data. One is that you can incorporate micro level incident characteristics into the model. The other is that you can get the exact XY coordinates where the incident occurred. And using those exact coordinates we calculate drive time distances to the hospital, which offer a slight benefit in terms of leave-one-out cross-validated accuracy compared to Euclidean distances.

So in terms of incident level characteristics, the biggest factor in determining your probability of death is not the distance to the nearest hospital, but where you physically get shot on your body. Here is a marginal effect plot from our models, showing how the joint effect of injury location (as different colors) and the drive time distance impact the probability of death. So if you get shot in the head vs the torso, you have around a 30% jump in the probability of death from that gun shot wound. Or if you get shot in an extremity you have a very low probability of death as well.

But you can see from that the margins for drive times are not negligible. So if you are nearby a hospital and shot in the torso your probability of dying is around 20%, whereas if you are 30 minutes away your probability rises to around 30%. You can then use this to map out isochrone type survivability estimates over the city. This example map is if you get shot in the torso, and the probability of death based on the drive time distance to the nearest Level 1 trauma location.

Fortunately many shootings do not occur in the northern most parts of Philadelphia, here is a map of the number of shootings over the city for our sample.

You can subsequently use these models to either do hypothetical take a trauma center away or add a trauma center. So given the density of shootings and drive time distances, it might make sense for Philly to invest in a trauma center in the shooting hot spot in the Kensington area (northeast of Temple). (You could technically figure out an ‘optimal’ location given the distribution of shootings, but since you can’t just plop down a hospital wherever it would make more sense to do hypothetical investments in current hospitals.)

For a simplified example, imagine you had 100 shootings in the torso that were an average 20 minutes away. The average probability of death in that case is around 25% (so ~25 homicides). If you hypothetically have a location that is only 5 minutes away, the probability goes down to more like 20% (so ~20 homicides). So in that hypothetical, the distance margin would have prevented 5 deaths.

One future piece of research I would be interested in examining is pre-post Shotspotter. So in that article Jen Doleac is right in that the emipirical evidence for Shotspotter reducing shootings is pretty flimsy, but preventing mortality by getting to the scene faster may be one mechanism that ShotSpotter can justify its cost.

Discrete time survival models in python

Sorry in the advance for the long post! I’ve wanted to tackle a project on estimating discrete time survival models for awhile now, and may have a relevant project at work where I can use this. So have been crunching out some of this code I am going to share for the last two weeks.

I personally only have one example in my career of estimating discrete time models, I used discrete time models to estimate propensity scores in my demolitions and crime reduction paper (Wheeler et al., 2018), since the demolitions did not occur at all once, but happened over several years. (In that paper I estimated the discrete time models, and then did matches in random cohorts.)

But I was interested in discrete time survival models for one reason – they allow you to estimate very non-linear hazard functions that you cannot with traditional survival models. For Cox models, to do predictions you need to rely on a estimate of the baseline hazard function, and for parametric models (e.g. Weibull) they often can only have monotonic or flat functions (so can’t be low risk and then high risk in a short period). For a good reference about evaluating predictions for survival models, I suggest Haider et al. (2020), and for a general reference for discrete time survival models I suggest the little Sage green book by Paul Allison (Allison, 2014).

For traditional recidivism studies in criminology (e.g. after someone is paroled), I don’t believe the function to be too bumpy like this, so I don’t think prior studies are misleading (e.g. Denver, 2019). But I do think they are worth examining to see if that is the case. For another use case, for chronic offender based police predictions, I think individuals may have more bumpy risk profiles, e.g. you commit a crime and then lay low (so lower risk), or get victimized and may want retaliation (so high risk). A prior work I looked at a year horizon for offender predictions (Wheeler et al., 2019), so I wanted to extend that to shorter time intervals, but never quite got the chance. (Another benefit of discrete time models is that they can incorporate time varying factors no problem the way the model is set up.)

I have code illustrating discrete time models saved on github here. The data I use to illustrate the analysis is taken from Ruderman et al. (2015). This is recidivism for a fairly large cohort. (I don’t think discrete time makes much sense for small samples, you probably need 1000+ to even really consider it I would guess.)

The code ends up being too long to walk though in a blog post. So here are some quick notes/tables/plots, and I encourage you to go check out the github page to dive deeper if you want.

The Discrete Time Model Setup

The main thing to realize about the discrete time modeling set up is that you just turn your survival data problem into a format you can leverage logistic regression (or whatever binary prediction machine learning model you want). So if we have an original set of survival data that looks like:

ID Time Outcome
 A   4    1
 B   3    0

We then explode this dataset into a long format that looks like this:

ID Time Outcome
 A   1     0
 A   2     0
 A   3     0
 A   4     1
 B   1     0
 B   2     0
 B   3     0

So you can see ID A was exploded to 4 observations, and the Outcome variable is only set to 1 at the final time period. For person B, they are exploded 3 observations, but the outcome variable is always set to 0.

Then you model Outcome as a function of time and other covariates, which can be either constant per person or time varying. This then gets you a model that estimates the instant probability of death (or failure) in a particular time sliver. The way I think about it is like this – we can predict whether you will commit a crime sometime within the next week (the cumulative probability over the entire week), or within a particular sliver of time (the probability of committing a crime Friday at 10 pm). Discrete time models pick a sliver of time, e.g. Friday, and calculate the instant probability within that bin.

But then we don’t want to rely on the traditional binary metrics to evaluate this model – we will often want to go from the instant probabilities in a time sliver to the cumulative probabilities. You can take those model estimates though at aggregate them back up to examine over the weekly time horizon example though. So if we have predictions for a new person C that looked like this:

ID Time InstantProb
 A   1     0.2
 A   2     0.1
 A   3     0.3
 A   4     0.05

We could then calculate the cumulative probability of failure over these four time periods. So the failure in time period 1 is just 0.2. For time period 2 it is 1 - [(1-0.2)*(1-0.1)] = 0.28. You just then accumulate those individual specific probabilities into cumulative failure probabilities over particular time horizons, which you can then incorporate into cost-benefit analysis for how you will use those predictions in practice. For various metrics we will then examine not just the instant probability our model spits out, but also the cumulative probability of failure.

The main issue with these models is that when exploding the dataset it can result in large samples. So my initial sample of just over 13k observations, when I expand to observed weeks ends up being over 1 million observations. That is not a big deal though, I can still easily do whatever models I want with that data on my personal machine. Probably don’t need to worry about it for most statistical computing projects until maybe you are dealing with over 20 million observations I would bet for most out of the box desktop computers anymore.

Modeling Notes

In the github page the script 00_PrepData.py prepares the dataset (transforming to the long format). The original Ruderman data has repeated events, but for simplicity I only take out the first events for individuals, which ends up being just over 13k observations. I then split this into a training dataset and a test dataset, and a set the test dataset to 3k cases.

My temporal unit of analysis I transform into weeks since release, and only examine the discrete time models up to 104 weeks (so two years). Here is a traditional KM plot based on the exploded discrete time training dataset.

But really what we are modeling in this set up is the instant hazard, not the cumulative hazard. So here is a plot of the instant probability of recidivism.

You can see that in the first week out, almost 1.4% of the individuals recidivate. There are ups and downs, but the instant probability continues to decrease and slightly flatten out out to 100 weeks. So you can see how over those two years we go from an original dataset of over 10k to around 3k due to censoring.

Part of the reason I was interested in examining discrete time models is that I was wondering if the instant hazard was bumpy and had some ups and downs when people are first exposed.

But this data appears fairly smooth, so in the end I end up fitting a logistic regression model with restricted cubic splines for time with knot locations at [4,10,20,40,60,80]. I also incorporate various interactions with the some of the time invariant covariates in the original Ruderman data (age at first arrest, male, overcrowding, concentrated disadvantage index, and offense category dummies).

I initially tried my GoTo machine learning models of random forests and XGBoost, but they performed quite poorly. Tree based models aren’t very well suited to estimate very tiny probabilities I am afraid. So that will need some more tinkering to see if I can use those machine learning models more effectively in this circumstance. I’m wondering if doing a different loss function makes sense (so do the loss based on the cumulative hazard instead of the instant). Here also I did not regularize the logit model, but with time varying factors that may make sense.

The Haider paper looks at the R MLTR package, which is similar to here but slightly different, in that they are modeling the cumulative hazard directly instead of the instant hazard. (So instead of chopping off the 1’s and the end of the vector, you keep padding them on for observations.) So in that case you want to enforce monotonic constraints on the time effect.

Checking Out Individual Predictions

The remaining sections in the blog post are all taken from the second 01_EvalTime.py script. So first, after you generate your predictions on the training data, you can then pull out a particular individual and check out our predictions for their cumulative survival probability based on our predictive model. The red line shows that this individual actually recidivated at 45 weeks, in which their cumulative risk was just above 20%.

The cumulative probability will never be super interesting though – in that even if you had a very wiggly instant hazard the cumulative hazard is always monotonically increasing. So if you check out the instant hazard this will show how a persons risk level varies over time.

So we can see here that person 39 had a predicted high risk when they are first released, but gradual decreases in a few steps over time. The way I have modeled this using restricted cubic splines it has to be smooth, but you could say incorporate dummy variables for the first 10 weeks, in which case this prediction could be quite bumpy.

Given this always shows monotonically decreasing hazard, you wouldn’t be able to exactly fit that function using parametric models, but they would be not too far off. So this dataset doesn’t appear to be a real great showcase of the utility of discrete time models!

But doing some plots of the instant hazards may be interesting to try to identify particular different risk profiles, or maybe even use some clustering (like group based trajectory models) to identify particular latent risk profiles. (It may be most people are smoothly decreasing, but some people have bumpier profiles.)

Evaluating Model Calibration

Haider et al. (2020) break down predictive metrics to evaluate survival models into two types: calibration is that the model predictions match actual cases, e.g. if my model says the probability of failure is 20%, does the data actually show failure in 20% of the cases. The other is discrimination, can I rank individuals as high risk to low risk, and do the high risk ones have the negative outcome more frequently.

While the Haider paper has various metrics, I am kind of confused how to do them in practice. My confusion mostly stems from the test dataset will ultimately have censoring in it as well, so the calibration metrics need to take this into account. Here are my attempts at a few plots that take on the task of checking model calibration.

First, I’ve previously discussed what I call a lift calibration chart. I adjust it here though to account for the fact that we have interval censoring, and I create ignorance bounds for the actual proportion of failures in the dataset.

This is for the full sample, which I expanded out and did calculations for up to 104 weeks for everyone. You can do a slice of the data though for a particular time period and check the same calibration. So here is an example checking calibration at one year out.

The earlier in time the smaller the ignorance bands will be (as there will be less censoring in sample). Here is what the created dataset looks like to illustrate how the ignorance bands are calculated.

The CumHazard column is my predicted line, which I break down into 20 bins for that yearly plot (so with 3000 training dataset observations, results in bins of 150 observations). Then you can see the LowTrue column (in Bin 1) signifies I observed 19 failures in that set of observations, but there ending up being a total of 27 observations censored in that bin, 46 - 19. So the actual proportion in the data could either be 19/150 (all censored never recidivate) or 46/150 (all of those who were censored would end up recidivating). I would suggest for notes on ignorance bounds like these (which also apply to ECDF type functions), Ferson et al. (2007).

I’d note that this is the same way you generate data for a Hosmer-Lemeshow test for logit models, but I don’t bother with the Chi-Square test. For large samples it will always reject, and small samples it may mean you just have low power, not that your model is well calibrated. So doing that stat test is a lose-lose IMO. But you can just make the plot to see whether your predictions are on the mark, or if they are low or high on average. Here we can see that they hug the lower ignorance band, so are not too bad. But may be a shade too low (more people recidivate than predicted).

This calibration is examining the probability, but another way to think about calibration here is calibrated in terms of time, e.g. I say something will happen in 30 weeks, does it actually happen in 30 weeks? Here is my attempt at a plot to check that out. Using the test dataset, I generate the usual KM estimate. Then based on the predicted probabilites, I generate simulated outcomes for individuals (here 99), and then plot the range of those outcomes on the same chart.

So here you can see that my predicted failure times are somewhat longer than observed in the data (simulation bands slightly below observed for the later time periods). These two charts are likely not in contradiction, the error bands in each show somewhat observe patterns, so they both hint at my model is conservative in assigning risk. But it is not too shabby in terms of calibration (you should have seen some of these plots when I was trying random forest and XGBoost models!).

I’m wondering offhand if I have some edge effects going on. So maybe even if I am only interested in examining a time horizon of two years, I should still tack on longer time periods for the initial models.

Both of these charts you can subset the data and look at the same chart, so here is an example table generated for simulations based on 332 test dataset females. Because the sample is smaller, the simulated bands are wider, so the observed KM cumulative hazard estimate appears well inside the bands here for the female subsample. (Probably because of less diagnostic ability to identify tiny bits off in the calibration.)

Evaluating Model Discrimination

The second way you might evaluate survival predictions is in terms of rankings, can I discriminate in my model between individuals who are high risk and who are low risk. One of the crazy things about these individual level survival curves is that they can cross! So imagine we had a set of two individuals and are looking at a horizon of four periods:

ID Time InstProb CumProb
A   1      0.1     0.1
A   2      0.1     0.19
A   3      0.1     0.271
A   4      0.1     0.3439

B   1      0.2     0.2
B   2      0.1     0.28
B   3      0.05    0.316
B   4      0.01    0.32284

So person B is at higher risk right away. So if we ranked these individuals for who was more likely to recidivate, ID B will be ranked higher for periods 1, 2 and 3. But by period 4, ID A is at higher risk in terms of their cumulative probability of recidivating.

The simplest metric to evaluate discrimination IMO is AUC (which is related to the concordance metric). And to do that you just do slices of particular weeks, and then calculate the AUC based on the cumulative failure probability estimate at that time period.

So you can see here that it is pretty meh – only AUC stats around 0.6 for my logit model. So better than the random 0.5, but not by much. Even though my model appears to be reasonably calibrated, it is nothing to brag to grandma about being able to identify people at different risk levels for recidivism, not matter the time horizon I am interested in.

For this estimate I just dropped censored observations, so I am not sure how to deal with them in this case. If you have suggestions or references let me know! But offhand I don’t think they are too off, the earlier time periods should have less censoring, but they are all pretty close in terms of the overall metric.

Future Stuff?

Besides seeing how others have dealt with censoring in their prediction metrics, another metric introduced in the Haider et al. (2020) paper is a Brier Score that is both a calibration and discrimination metric.

Also for folks interested in survival analysis in python, I suggest to check out statsmodel or the lifelines packages.

Citations

Making smoothed scatterplots in python

The other day I made a blog post on my notes on making scatterplots in matplotlib. One big chunk of why you want to make scatterplots though is if you are interested in a predictive relationship. Typically you want to look at the conditional value of the Y variable based on the X variable. Here are some example exploratory data analysis plots to accomplish that task in python.

I have posted the code to follow along on github here, in particular smooth.py has the functions of interest, and below I have various examples (that are saved in the Examples_Conditional.py file).

Data Prep

First to get started, I am importing my libraries and loading up some of the data from my dissertation on crime in DC at street units. My functions are in the smooth set of code. Also I change the default matplotlib theme using smooth.change_theme(). Only difference from my prior posts is I don’t have gridlines by default here (they can be a bit busy).

#################################
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import os
import sys

mydir = r'D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\Smooth'
data_loc = r'https://dl.dropbox.com/s/79ma3ldoup1bkw6/DC_CrimeData.csv?dl=0'
os.chdir(mydir)

#My functions
sys.path.append(mydir)
import smooth
smooth.change_theme()

#Dissertation dataset, can read from dropbox
DC_crime = pd.read_csv(data_loc)
#################################

Binned Conditional Plots

The first set of examples, I bin the data and estimate the conditional means and standard deviations. So here in this example I estimate E[Y | X = 0], E[Y | X = 1], etc, where Y is the total number of part 1 crimes and x is the total number of alcohol licenses on the street unit (e.g. bars, liquor stores, or conv. stores that sell beer).

The function name is mean_spike, and you pass in at a minimum the dataframe, x variable, and y variable. I by default plot the spikes as +/- 2 standard deviations, but you can set it via the mult argument.

####################
#Example binning and making mean/std dev spike plots

smooth.mean_spike(DC_crime,'TotalLic','TotalCrime')

mean_lic = smooth.mean_spike(DC_crime,'TotalLic','TotalCrime',
                             plot=False,ret_data=True)
####################

This example works out because licenses are just whole numbers, so it can be binned. You can pass in any X variable that can be binned in the end. So you could pass in a string for the X variable. If you don’t like the resulting format of the plot though, you can just pass plot=False,ret_data=True for arguments, and you get the aggregated data that I use to build the plots in the end.

mean_lic = smooth.mean_spike(DC_crime,'TotalLic','TotalCrime',
                             plot=False,ret_data=True)

Another example I am frequently interested in is proportions and confidence intervals. Here it uses exact binomial confidence intervals at the 99% confidence level. Here I clip the burglary data to 0/1 values and then estimate proportions.

####################
#Example with proportion confidence interval spike plots

DC_crime['BurgClip'] = DC_crime['OffN3'].clip(0,1)
smooth.prop_spike(DC_crime,'TotalLic','BurgClip')

####################

A few things to note about this is I clip out bins with only 1 observation in them for both of these plots. I also do not have an argument to save the plot. This is because I typically only use these for exploratory data analysis, it is pretty rare I use these plots in a final presentation or paper.

I will need to update these in the future to jitter the data slightly to be able to superimpose the original data observations. The next plots are a bit easier to show that though.

Restricted Cubic Spline Plots

Binning like I did prior works out well when you have only a few bins of data. If you have continuous inputs though it is tougher. In that case, typically what I want to do is estimate a functional relationship in a regression equation, e.g. Y ~ f(x), where f(x) is pretty flexible to identify potential non-linear relationships.

Many analysts are taught the loess linear smoother for this. But I do not like loess very much, it is often both locally too wiggly and globally too smooth in my experience, and the weighting function has no really good default.

Another popular choice is to use generalized additive model smoothers. My experience with these (in R) is better than loess, but they IMO tend to be too aggressive, and identify overly complicated functions by default.

My favorite approach to this is actually then from Frank Harrell’s regression modeling strategies. Just pick a regular set of restricted cubic splines along your data. It is arbitrary where to set the knot locations for the splines, but my experience is they are very robust (so chaning the knot locations only tends to change the estimated function form by a tiny bit).

I have class notes on restricted cubic splines I think are a nice introduction. First, I am going to make the same dataset from my class notes, the US violent crime rate from 85 through 2010.

years = pd.Series(list(range(26)))
vcr = [1881.3,
       1995.2,
       2036.1,
       2217.6,
       2299.9,
       2383.6,
       2318.2,
       2163.7,
       2089.8,
       1860.9,
       1557.8,
       1344.2,
       1268.4,
       1167.4,
       1062.6,
        945.2,
        927.5,
        789.6,
        734.1,
        687.4,
        673.1,
        637.9,
        613.8,
        580.3,
        551.8,
        593.1]

yr_df = pd.DataFrame(zip(years,years+1985,vcr), columns=['y1','years','vcr'])

I have a function that allows you to append the spline basis to a dataframe. If you don’t pass in a data argument, in returns a dataframe of the basis functions.

#Can append rcs basis to dataframe
kn = [3.0,7.0,12.0,21.0]
smooth.rcs(years,knots=kn,stub='S',data=yr_df)

I also have in the code set Harrell’s suggested knot locations for the data. This ranges from 3 to 7 knots (it will through an error if you pass a number not in that range). This here suggests the locations [1.25, 8.75, 16.25, 23.75].

#If you want to use Harrell's rules to suggest knot locations
smooth.sug_knots(years,4)

Note if you have integer data here these rules don’t work out so well (can have redundant suggested knot locations). So Harell’s defaults don’t work with my alcohol license data. But it is one of the reasons I like these though, I just pick regular locations along the X data and they tend to work well. So here is a regression plot passing in those knot locations kn = [3.0,7.0,12.0,21.0] I defined a few paragraphs ago, and the plot does a few vertical guides to show the knot locations.

#RCS plot
smooth.plot_rcs(yr_df,'y1','vcr',knots=kn)

Note that the error bands in the plot are confidence intervals around the mean, not prediction intervals. One of the nice things though about this under the hood, I used statsmodels glm interface, so if you want you can change the underlying link function to Poisson (I am going back to my DC crime data here), you just pass it in the fam argument:

#Can pass in a family argument for logit/Poisson models
smooth.plot_rcs(DC_crime,'TotalLic','TotalCrime', knots=[3,7,10,15],
                fam=sm.families.Poisson(), marker_size=12)

This is a really great example for the utility of splines. I will show later, but a linear Poisson model for the alcohol license effect extrapolates very poorly and ends up being explosive. Here though the larger values the conditional effect fits right into the observed data. (And I swear I did not fiddle with the knot locations, there are just what I picked out offhand to spread them out on the X axis.)

And if you want to do a logistic regression:

smooth.plot_rcs(DC_crime,'TotalLic','BurgClip', knots=[3,7,10,15],
                fam=sm.families.Binomial(),marker_alpha=0)

I’m not sure how to do this in a way you can get prediction intervals (I know how to do it for Gaussian models, but not for the other glm families, prediction intervals probably don’t make sense for binomial data anyway). But one thing I could expand on in the future is to do quantile regression instead of glm models.

Smooth Plots by Group

Sometimes you want to do the smoothed regression plots with interactions per groups. I have two helper functions to do this. One is group_rcs_plot. Here I use the good old iris data to illustrate, which I will explain why in a second.

#Superimposing rcs on the same plot
iris = sns.load_dataset('iris')
smooth.group_rcs_plot(iris,'sepal_length','sepal_width',
               'species',colors=None,num_knots=3)

If you pass in the num_knots argument, the knot locations are different for each subgroup of data (which I like as a default). If you pass in the knots argument and the locations, they are the same though for each subgroup.

Note that the way I estimate the models here I estimate three different models on the subsetted data frame, I do not estimate a stacked model with group interactions. So the error bands will be a bit wider than estimating the stacked model.

Sometimes superimposing many different groups is tough to visualize. So then a good option is to make a set of small multiple plots. To help with this, I’ve made a function loc_error, to pipe into seaborn’s small multiple set up:

#Small multiple example
g = sns.FacetGrid(iris, col='species',col_wrap=2)
g.map_dataframe(smooth.loc_error, x='sepal_length', y='sepal_width', num_knots=3)
g.set_axis_labels("Sepal Length", "Sepal Width")

And here you can see that the not locations are different for each subset, and this plot by default includes the original observations.

Using the Formula Interface for Plots

Finally, I’ve been experimenting a bit with using the input in a formula interface, more similar to the way ggplot in R allows you to do this. So this is a new function, plot_form, and here is an example Poisson linear model:

smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ TotalLic',
                 fam=sm.families.Poisson(), marker_size=12)

You can see the explosive effect I talked about, which is common for Poisson/negative binomial models.

Here with the formula interface you can do other things, such as a polynomial regression:

#Can do polynomial terms
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ TotalLic + TotalLic**2 + TotalLic**3',
                 fam=sm.families.Poisson(), marker_size=12)

Which here ends up being almost indistinguishable from the linear terms. You can do other smoothers that are available in the patsy library as well, here are bsplines:

#Can do other smoothers
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ bs(TotalLic,df=4,degree=3)',
                 fam=sm.families.Poisson(), marker_size=12)

I don’t really have a good reason to prefer restricted cubic splines to bsplines, I am just more familiar with restricted cubic splines (and this plot does not illustrate the knot locations that were by default chosen, although you could pass in knot locations to the bs function).

You can also do other transformations of the x variable. So here if you take the square root of the total number of licenses helps with the explosive effect somewhat:

#Can do transforms of the X variable
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ np.sqrt(TotalLic)',
                 fam=sm.families.Poisson(), marker_size=12)
             

In the prior blog post about explosive Poisson models I also showed a broken stick type model if you wanted to log the x variable but it has zero values.

#Can do multiple transforms of the X variable
smooth.plot_form(data=DC_crime,x='TotalLic',y='TotalCrime',
                 form='TotalCrime ~ np.log(TotalLic.clip(1)) + I(TotalLic==0)',
                 fam=sm.families.Poisson(), marker_size=12)

Technically this “works” if you transform the Y variable as well, but the resulting plot is misleading, and the prediction interval is for the transformed variable. E.g. if you pass a formula 'np.log(TotalCrime+1) ~ TotalLic', you would need to exponentiate the the predictions and subtract 1 to get back to the original scale (and then the line won’t be the mean anymore, but the confidence intervals are OK).

I will need to see if I can figure out patsy and sympy to be able to do the inverse transformation to even do that. That type of transform to the y variable directly probably only makes sense for linear models, and then I would also maybe need to do a Duan type smearing estimate to get the mean effect right.

Making aoristic density maps in R

I saw Jerry the other day made/updated an R package to do aoristic analysis. A nice part of this is that it returns the weights breakdown for individual cases, which you can then make maps of. My goto hot spot map for data visualization, kernel density maps, are a bit tough to work with weighted data though in R (tough is maybe not the right word, to use ggplot it takes a bit of work leveraging other packages). So here are some notes on that.

I have provided the data/code here. It is burglaries in Dallas, specifically I filter out just for business burglaries.

R Code Snippet

First, for my front end I load the libraries I will be using, and change the working directory to where my data is located.

############################
library(aoristic) #aoristic analysis 
library(rgdal)    #importing spatial data
library(spatstat) #weighted kde
library(raster)   #manipulate raster object
library(ggplot2)  #for contour graphs
library(sf)       #easier to plot sf objects

my_dir <- "D:\\Dropbox\\Dropbox\\Documents\\BLOG\\aoristic_maps_R\\data_analysis"
setwd(my_dir)
############################

Next I just have one user defined function, this takes an input polygon (the polygon that defines the borders of Dallas here), and returns a raster grid covering the bounding box. It also have an extra data field, to say whether the grid cell is inside/outside of the boundary. (This is mostly convenient when creating an RTM style dataset to make all the features conform to the same grid cells.)

###########################
#Data Manipulation Functions

#B is border, g is size of grid cell on one side
BaseRaster <- function(b,g){
    base_raster <- raster(ext = extent(b), res=g)
    projection(base_raster) <- crs(b)
    mask_raster <- rasterize(b, base_raster, getCover=TRUE) #percentage of cover, 0 is outside
    return(mask_raster)
}
###########################

The next part I grab the datasets I will be using, a boundary file for Dallas (in which I chopped off the Lochs, so will not be doing an analysis of boat house burglaries today), and then the crime data. R I believe you always have to convert date-times when reading from a CSV (it never smartly infers that a column is date/time). And then I do some other data fiddling – Jerry has a nice function to check and make sure the date/times are all in order, and then I get rid of points outside of Dallas using the sp over function. Finally the dataset is for both residential/commercial, but I just look at the commercial burglaries here.

###########################
#Get the datasets

#Geo data
boundary <- readOGR(dsn="Dallas_MainArea_Proj.shp",layer="Dallas_MainArea_Proj")
base_Dallas <- BaseRaster(b=boundary,g=200) 
base_df <- as.data.frame(base_Dallas,long=TRUE,xy=TRUE)

#Crime Data
crime_dat <- read.csv('Burglary_Dallas.csv', stringsAsFactors=FALSE)
#prepping time fields
crime_dat$Beg <- as.POSIXct(crime_dat$StartingDateTime, format="%m/%d/%Y %H:%M:%OS")
crime_dat$End <- as.POSIXct(crime_dat$EndingDateTime, format="%m/%d/%Y %H:%M:%OS")

#cleaning up data
aor_check <- aoristic.datacheck(crime_dat, 'XCoordinate', 'YCoordinate', 'Beg', 'End')
coordinates(crime_dat) <- crime_dat[,c('XCoordinate', 'YCoordinate')]
crs(crime_dat) <- crs(boundary)
over_check <- over(crime_dat, boundary)
keep_rows <- (aor_check$aoristic_datacheck == 0) & (!is.na(over_check$city))
crime_dat_clean <- crime_dat[keep_rows,]

#only look at business burgs to make it go abit faster
busi_burgs <- crime_dat_clean[ crime_dat_clean$UCROffense == 'BURGLARY-BUSINESS', ]
###########################

The next part preps the aoristic weights. First, the aoristic.df function is from Jerry’s aoristic package. It returns the weights broken down by 168 hours per day of the week. Here I then just collapse across the weekdays into the same hour, which to do that is simple, just add up the weights.

After that it is some more geographic data munging using the spatstat package to do the heavy lifting for the weighted kernel density estimate, and then stuffing the result back into another data frame. My bandwidth here, 3000 feet, is a bit large but makes nicer looking maps. If you do this smaller you will have a more bumpy and localized hot spots in the kernel density estimate.

###########################
#aoristic weights

#This takes like a minute
res_weights <- aoristic.df(busi_burgs@data, 'XCoordinate', 'YCoordinate', 'Beg', 'End')

#Binning into same hourly bins
for (i in 1:24){
    cols <- (0:6*24)+i+5
    lab <- paste0("Hour",i)
    res_weights[,c(lab)] <- rowSums(res_weights[,cols])
}

#Prepping the spatstat junk I need
peval <- rasterToPoints(base_Dallas)[,1:2]
spWin <- as.owin(as.data.frame(peval))
sp_ppp <- as.ppp(res_weights[,c('x_lon','y_lat')],W=spWin) #spp point pattern object

#Creating a dataframe with all of the weighted KDE
Hour_Labs <- paste0("Hour",1:24)

for (h in Hour_Labs){
  sp_den <- density.ppp(sp_ppp,weights=res_weights[,c(h)],
                        sigma=3000,
                        edge=FALSE,warnings=FALSE)
  sp_dat <- as.data.frame(sp_den)
  kd_raster <- rasterFromXYZ(sp_dat,res=res(base_Dallas),crs=crs(base_Dallas))
  base_df[,c(h)] <- as.data.frame(kd_raster,long=TRUE)$value
}
###########################

If you are following along, you may be wondering why all the hassle? It is partly because I want to use ggplot to make maps, but for its geom_contour it does not except weights, so I need to do the data manipulation myself to supply ggplot the weighted data in the proper format.

First I turn my Dallas boundary into a simple feature sf object, then I create my filled contour graph, supplying the regular grid X/Y and the Z values for the first Hour of the day (so between midnight and 1 am).

###########################
#now making contour graphs

dallas_sf <- st_as_sf(boundary)

#A plot for one hour of the day
hour1 <- ggplot() + 
  geom_contour_filled(data=base_df, aes(x, y, z = Hour1), bins=9) +
  geom_sf(data=dallas_sf, fill=NA, color='black') +
  scale_fill_brewer(palette="Greens") +
  ggtitle('       Hour [0-1)') + 
  theme_void() + theme(legend.position = "none")
hour1

png('Hour1.png', height=5, width=5, units="in", res=1000, type="cairo") 
hour1
dev.off()
###########################

Nice right! I have in the code my attempt to make a super snazzy small multiple plot, but that was not working out so well for me. But you can then go ahead and make up other slices if you want. Here is an example of taking an extended lunchtime time period.

###########################
#Plot for the afternoon time period
base_df$Afternoon <- rowSums(base_df[,paste0("Hour",10:17)])

afternoon <- ggplot() + 
  geom_contour_filled(data=base_df, aes(x, y, z = Afternoon), bins=9) +
  geom_sf(data=dallas_sf, fill=NA, color='black') +
  scale_fill_brewer(palette="Greens") +
  ggtitle('       Hour [9:00-17:00)') + 
  theme_void() + theme(legend.position = "none")
afternoon
###########################

So you can see that the patterns only slightly changed compared to the midnight prior graph.

Note that these plots will have different breaks, but you could set them to be equal by simply specifying a breaks argument in the geom_contour_filled call.

I will leave it up so someone who is more adept at R code than me to make a cool animated viz over time from this. But that is a way to mash up the temporal weights in a map.

Deleted Twitter Account

I’ve decided to delete Twitter. It is for multiple reasons in the end.

Reason 1, I was definitely addicted to it. Checked it quite often during the daytime. Deleting off of my phone (and ditto for email) was a good first step, but I still checked it quite a bit when I was on my personal computer.

Reason 2 — there is a XKCD comic about staying up arguing with people on the internet. I was constantly tempted to do this on Twitter. It is never really worth it. Many of the examples that come to mind I did this — had a comment stream with Pete Kraska the other day about grant funding, and in the past Travis Pratt over pre-prints — Pete/Travis had an ounce of truth in their initial statements, but made sweeping generalizations that don’t describe the majority of people (which included me, hence my urge to respond). While they likely did not intend to say something directly about me, they did so in making general stereotyping comments.

I respect each as scholars, but they just have ill-informed opinions in those cases. You would think criminologists would be less likely to attribute the malice of a few to widespread groups of individuals, but so it goes. No doubt I have bad/wrong opinions all the time as well.

Reason 3, a former colleague the other day was upset I liked a tweet that was a critique of their work. This is just one example, but there are a million different things people could take offense to. I am not interested in even the potential of saying or doing something that would result in a sandbag onslaught I’ve seen several times on Twitter. I of course do not intentionally mean to hurt peoples feelings, but I do not feel like defending minor stuff like that either. Worrying about things like that is just not good for my mental health.

There are of course good things I will be missing out on. I initially joined Twitter to keep up on the news. Between Google Scholar and CrimPapers I can keep up on academic work. (Actual news I should definately not be getting my info from tweets!) But the biggest benefit in the end was there are several internet friends I only met on Twitter and would not have had the opportunity to meet without Twitter.

And of course it was nice to tweet a blog post and get a dozen likes (or say something snarky and get 30). So my work will have less exposure than before, but honestly it was not much to begin with. My last post had more likes (around a dozen) than referrals from Twitter (around half that!) Not like tweeting my blog posts resulted in 1000’s of views, more like a few dozen extra most of the time (and a few hundred extra in the best of times). So I will just continue to write blog posts, and they will have a few less views than before. I wish my blog had bigger reach but it is really just my place for creative output.

I encourage folks to always reach out and send me an email to keep in touch if you are one of my former Twitter friends. Academia can be a lonely place in normal times, and with isolating in the pandemic I can’t even imagine what it would be like without my family. I don’t think my time spent on Twitter was good for my personal well being though in the end, even though it did definitely help me be part of a larger community of colleagues and friends. 

RTM Deep Learning Style

In my quest to better understand deep learning, I have attempted to replicate some basic models I am familiar with in criminology, just typical OLS and the more complicated group based trajectory models. Another example I will illustrate is doing a variant of Risk Terrain Modeling.

The typical way RTM is done is:

Data Prep Part:

  1. create a set of independent variables for crime generators (e.g. bars, subway stops, liquor stores, etc.) that are either the distance to the nearest or the kernel density estimate
  2. Turn these continuous estimates into dummy variables, e.g. if within 100 meters = 1, else = 0. For kernel density they typically z-score and if a z-score > 2 the dummy variable equals 1.
  3. Do 2 for varying distance/bandwidth selections, e.g. 100 meters, 200 meters, etc. So you end up with a collection of distance variables, e.g. Bars_100, Bars_200, Bars_400, etc.

Modeling Part

  1. Fit a Lasso regression predicting your crime outcome constraining all of the variables to be positive. (So RTM will never say a crime generator has a negative effect.)
  2. For the variables that passed this Lasso stage, then do a variable selection routine. So instead of the final model having Bars_100 and Bars_400, it will only choose one of those variables.

For the modeling part, we can replicate various parts of these in a deep learning environment. For the constrain the coefficients to be positive, when you see folks refer to a “RelU” or the rectified linear unit function, all this means is that the coefficients are constrained to be positive. For the variable selection part, I needed to hack my own – it ends up being a combo of a custom dropout scheme and then pruning in deep learning lingo.

Although RTM is typically done on raster grid cells for the spatial unit of analysis, this is not a requirement. You can do all these steps on vector (e.g. street segments) or other areal spatial units of analysis.

Here I illustrate using street units (intersections and street segments) from DC. The crime generator data I take from my dissertation (and I have a few pubs in Crime & Delinquency based on that work). The crime data I illustrate using 2011 violent Part 1 UCR (homicide, agg assault, robbery, no rape in the public data).

The crime dataset is over time, and I describe in an analysis (with Billy Zakrzewski) on examining pre/post crime around DC medical marijuana dispensaries.

The data and code to replicate can be downloaded here. It is python, and for the deep learning model I used pytorch.

RTM Example in Python

So I will walk through briefly my second script, 01_DeepLearningRTM.py. The first script, 00_DataPrep.py, does the data prep, so this data file already has the crime generator variables prepared in the manner RTM typically creates them. (The rtm_dl_funcs.py has the functions to do the feature extraction and create the deep learning model, to do distance/density in sci-kit is very slick, only a few lines of code.)

So first I just define the libraries I will be using, and import my custom rtm functions I created.

######################################################
import numpy as np
import pandas as pd
import torch
device = torch.device("cuda:0")
import os
import sys

my_dir = r'C:\Users\andre\OneDrive\Desktop\RTM_DeepLearning'
os.chdir(my_dir)
sys.path.append(my_dir)
import rtm_dl_funcs
######################################################

The next set of code grabs the crime data, and then defines my variable sets. I have plenty more crime generator data from my dissertation, but to make it easier on myself I just focus on distance to metro entrances, the density of 311 calls (a measure of disorder), and the distance and density of alcohol outlets (this includes bars/liquor stores/gas stations that sell beer, etc.).

Among these variable sets, the final selected model will only choose one within each set. But I have also included the ability for the model to incorporate other variables that will just enter in no-matter what (and are not constrained to be positive). This is mostly to incorporate an intercept into the regression equation, but here I also include the percent of sidewalk encompassing one of my street units (based on the Voronoi tessellation), and a dummy variable for whether the street unit is an intersection. (I also planned on included the area of the tessalation, but it ended up being an explosive effect, my dissertation shows its effect is highly non-linear, so didn’t want to worry about splines here for simplicity.)

######################################################
#Get the Prepped Data
crime_data = pd.read_csv('Prepped_Crime.csv')

#Variable sets for each
db = [50, 100, 200, 300, 400, 500, 600, 700, 800]
metro_set = ['met_dis_' + str(i) for i in db]
alc_set = ['alc_dis_' + str(i) for i in db]
alc_set += ['alc_kde_' + str(i) for i in db]
c311_set = ['c31_kde_' + str(i) for i in db]

#Creating a few other generic variables
crime_data['PercSidewalk'] = crime_data['SidewalkArea'] / crime_data['AreaMinWat']
crime_data['Const'] = 1
const_li = ['Const','Intersection','PercSidewalk']
full_set = const_li + alc_set + metro_set + c311_set
######################################################

The next set of code turns my data into a set of torch tensors, then I grab the size of my independent variable sets, which I will end up needing when initializing my pytorch model.

Then I set the seed (to be able to reproduce the results), create the model, and set the loss function and optimizer. I use a Poisson loss function (will need to figure out negative binomial another day).

######################################################
#Now creating the torch tensors
x_ten = torch.tensor(crime_data[full_set].to_numpy(), dtype=float)
y_ten = torch.tensor(crime_data['Viol_2011'].to_numpy(), dtype=float)
out_ten = torch.tensor(crime_data['Viol_2012'].to_numpy(), dtype=float)

#These I need to initialize the deep learning model
gen_lens = [len(alc_set), len(metro_set), len(c311_set)]
    
#Creating the model 
torch.manual_seed(10)

model = rtm_dl_funcs.RTM_torch(const=len(const_li), 
                               gen_list=gen_lens)
criterion = torch.nn.PoissonNLLLoss(log_input=True, reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #1e-4
print( model )
######################################################

If you look at the printed out model, it gives a nice summary of the different layers. We have our one layer for the fixed coefficients, and another three sets for our alcohol outlets, 311 calls, and metro entrances. We then have a final cancel layer. The idea behind the final cancel layer is that the variable selection routine in RTM can still end up not selecting any variables for a set. I ended up not using it here though, as it was too aggressive in this example. (So will need to tinker with that some more!)

The variable selection routine is very volatile – so if you have very correlated inputs, you can essentially swap one for the other and get near equivalent predictions. I often see folks who do RTM analyses say something along the lines of, “OK this RTM selected A, and this RTM selected B, so they are different effects in these two samples” (sometimes pre/post, other times comparing different areas, and other times different crime outcomes). I think this is probably wrong though to make that inference, as there is quite a bit of noise in the variable selection process (and the variable selection process itself precludes making inferences on the coefficients themselves).

My deep learning example inherited the same problems. So if you change the initialized weights, it may end up selecting totally different inputs in the end. To get the variable selection routine to at least select the same crime generator variables in my tests, I do a burn in period in which I implement a random dropout scheme. So instead of the typical dropout, for every forward pass it does a random dropout to only select one variable randomly out of each crime generator set. After that converges, I then use a pruning layer to only pick the coefficient that has the largest effect, and again do a large set of iterations to make sure the results converged. So different means but same ends to the typical RTM steps 4 and 5 above. I also have like I said a ReLU transformation after each layer, so the crime generator variables are always positive, any negative effects will be pruned out.

One thing that is nice about deep learning is that it can be quite fast. Here each of these 10,000 iteration sets take less than a minute on my desktop with a GPU. (I’ve been prototyping models with more parameters and more observations at work on my laptop with just a CPU that only take like 10 to 20 minutes).

######################################################
#Burn in part, random dropout
for t in range(10000):
    #Forward pass
    y_pred = model(x=x_ten)
    #Loss
    loss_insample = criterion(y_pred, y_ten)
    optimizer.zero_grad()
    loss_insample.backward(retain_graph=True)
    optimizer.step()
    if t % 1000 == 0:
        print(f'loss: {loss_insample.item()}' )

#Switching to pruning all but the largest effects
model.l1_prune()

for t in range(10000):
    #Forward pass
    y_pred = model(x=x_ten, mask_type=None, cancel=False)
    #Loss
    loss_insample = criterion(y_pred, y_ten)
    optimizer.zero_grad()
    loss_insample.backward(retain_graph=True)
    optimizer.step()
    if t % 1000 == 0:
        print(f'loss: {loss_insample.item()}' )

print( model.coef_df(nm_li=full_set, cancel=False) )
######################################################

And this prints out the results (as incident rate ratios), so you can see it selected 50 meters alcohol kernel density, 50 meters distance to the nearest metro station, and kernel density for 311 calls with an 800 meter bandwidth.

I have in the code another example model when using a different seed. So testing out on around 5 different seeds it always selected these same distance/density variables, but the coefficients are slightly different each time. Here is an example from setting the seed to 12.

These models are nothing to brag about, using the typical z-score the predictions and set the threshold to above 2, the PAI is only around 3 (both for in-sample 2011 and out of sample 2012 is slightly lower). It is a tough prediction task – the mean number of violent crimes per street unit per year is only 0.3. Violent crime is fortunately very rare!

But with only three different risk variables, we can do a quick conjunctive analysis, and look at the areas of overlap.

######################################################
#Adding model 1 predictions back into the dataset
pred_mod1 = pd.Series(model(x=x_ten, mask_type=None, cancel=False).exp().detach().numpy())
crime_data['Pred_M1'] = pred_mod1

#Check out the areas of overlapping risk
mod1_coef = model.coef_df(nm_li=full_set, cancel=False)
risk_vars = list(set(mod1_coef['Variable']) - set(const_li))
conj_set = crime_data.groupby(risk_vars, as_index=False)['Const','Pred_M1','Viol_2012'].sum()
print(conj_set)
######################################################

In this table Const is the total number of street units selected, Pred_M1 is the expected number of crimes via Model 1, and then I show how well it conforms to the predictions out of sample 2012. So you can see in the aggregate the predictions are not too far off. There only ends up being one street unit that overlaps for all three risk factors in the study area.

I believe the predictions would be better if I included more crime generator variables. But ultimately the nature of how RTM works it trades off accuracy for simple models. Which is fair – it helps to ease the nature of how a police department (or some other entity) responds to the predictions.

But this trade off results in predictions that don’t fare as well compared with more complicated models. For example I show (with Wouter Steenbeek) that random forests do much better than RTM. To make those models more interpretable we did local decompositions for hot spots, so say this hot spot is 30% alcohol outlets, 20% nearby apartments, etc.

So there is no doubt more extensions for RTM you could do in a deep learning framework, but they will likely always result in more complicated and less interpretable models. Also here I don’t think this code will be better than the traditional RTM folks, the only major benefit of this code is it will run faster – minutes instead of overnight for most jobs.

Notes on making scatterplots in matplotlib and seaborn

Many of my programming tips, like my notes for making Leaflet maps in R or margins plots in Stata, I’ve just accumulated doing projects over the years. My current workplace is a python shop though, so I am figuring it out all over for some of these things in python. I made some ugly scatterplots for a presentation the other day, and figured it would be time to spend alittle time making some notes on making them a bit nicer.

For prior python graphing post examples, I have:

For this post, I am going to use the same data I illustrated with SPSS previously, a set of crime rates in Appalachian counties. Here you can download the dataset and the python script to follow along.

Making scatterplots using matplotlib

So first for the upfront junk, I load my libraries, change my directory, update my plot theme, and then load my data into a dataframe crime_dat. I technically do not use numpy in this script, but soon as I take it out I’m guaranteed to need to use np. for something!

################################################################
import pandas as pd
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

my_dir = r'C:\Users\andre\OneDrive\Desktop\big_scatter'
os.chdir(my_dir)

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}

matplotlib.rcParams.update(andy_theme)
crime_dat = pd.read_csv('Rural_appcrime_long.csv')
################################################################

First, lets start from the base scatterplot. After defining my figure and axis objects, I add on the ax.scatter by pointing the x and y’s to my pandas dataframe columns, here Burglary and Robbery rates per 100k. You could also instead of starting from the matplotlib objects start from the pandas dataframe methods (as I did in my prior histogram post). I don’t have a good reason for using one or the other.

Then I set the axis grid lines to be below my points (is there a way to set this as a default?), and then I set my X and Y axis labels to be nicer than the default names.

################################################################
#Default scatterplot
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(crime_dat['burg_rate'], crime_dat['rob_rate'])
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter01.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

You can see here the default point markers, just solid blue filled circles with no outline, when you get a very dense scatterplot just looks like a solid blob. I think a better default for scatterplots is to plot points with an outline. Here I also make the interior fill slightly transparent. All of this action is going on in the ax.scatter call, all of the other lines are the same.

################################################################
#Making points have an outline and interior fill
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(crime_dat['burg_rate'], crime_dat['rob_rate'], 
           c='grey', edgecolor='k', alpha=0.5)
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter02.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

So that is better, but we still have quite a bit of overplotting going on. Another quick trick is to make the points smaller and up the transparency by setting alpha to a lower value. This allows you to further visualize the density, but then makes it a bit harder to see individual points – if you started from here you might miss that outlier in the upper right.

Note I don’t set the edgecolor here, but if you want to make the edges semitransparent as well you could do edgecolor=(0.0, 0.0, 0.0, 0.5), where the last number of is the alpha transparency tuner.

################################################################
#Making the points small and semi-transparent
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(crime_dat['burg_rate'], crime_dat['rob_rate'], c='k', 
            alpha=0.1, s=4)
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter03.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

This dataset has around 7.5k rows in it. For most datasets of anymore than a hundred points, you often have severe overplotting like you do here. One way to solve that problem is to bin observations, and then make a graph showing the counts within the bins. Matplotlib has a very nice hexbin method for doing this, which is easier to show than explain.

################################################################
#Making a hexbin plot
fig, ax = plt.subplots(figsize=(6,4))
hb = ax.hexbin(crime_dat['burg_rate'], crime_dat['rob_rate'], 
               gridsize=20, edgecolors='grey', 
               cmap='inferno', mincnt=1)
ax.set_axisbelow(True)
ax.set_xlabel('Burglary Rate per 100,000')
ax.set_ylabel('Robbery Rate per 100,000')
cb = fig.colorbar(hb, ax=ax)
plt.savefig('Scatter04.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

So for the hexbins I like using the mincnt=1 option, as it clearly shows areas with no points, but then you can still spot the outliers fairly easy. (Using white for the edge colors looks nice as well.)

You may be asking, what is up with that outlier in the top right? It ends up being Letcher county in Kentucky in 1983, which had a UCR population estimate of only 1522, but had a total of 136 burglaries and 7 robberies. This could technically be correct (only some local one cop town reported, and that town does not cover the whole county), but I’m wondering if this is a UCR reporting snafu.

It is also a good use case for funnel charts. I debated on making some notes here about putting in text labels, but will hold off for now. You can add in text by using ax.annotate fairly easy by hand, but it is hard to automate text label positions. It is maybe easier to make interactive graphs and have a tooltip, but that will need to be another blog post as well.

Making scatterplots using seaborn

The further examples I show are using the seaborn library, imported earlier as sns. I like using seaborn to make small multiple plots, but it also has a very nice 2d kernel density contour plot method I am showing off.

Note this does something fundamentally different than the prior hexbin chart, it creates a density estimate. Here it looks pretty but creates a density estimate in areas that are not possible, negative crime rates. (There are ways to prevent this, such as estimating the KDE on a transformed scale and retransforming back, or reflecting the density back inside the plot would probably make more sense here, ala edge weighting in spatial statistics.)

Here the only other things to note are used filled contours instead of just the lines, I also drop the lowest shaded area (I wish I could just drop areas of zero density, note dropping the lowest area drops my outlier in the top right). Also I have a tough go of using the default bandwidth estimators, so I input my own.

################################################################
#Making a contour plot using seaborn
g = sns.kdeplot(crime_dat['burg_rate'], crime_dat['rob_rate'], 
                shade=True, cbar=True, gridsize=100, bw=(500,50),
                cmap='plasma', shade_lowest=False, alpha=0.8)
g.set_axisbelow(True)
g.set_xlabel('Burglary Rate per 100,000')
g.set_ylabel('Robbery Rate per 100,000')
plt.savefig('Scatter05.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################ 

So far I have not talked about the actual marker types. It is very difficult to visualize different markers in a scatterplot unless they are clearly separated. So although it works out OK for the Iris dataset because it is small N and the species are clearly separated, in real life datasets it tends to be much messier.

So I very rarely use multiple point types to symbolize different groups in a scatterplot, but prefer to use small multiple graphs. Here is an example of turning my original scatterplot, but differentiating between different county areas in the dataset. It is a pretty straightforward update using sns.FacetGrid to define the group, and then using g.map. (There is probably a smarter way to set the grid lines below the points for each subplot than the loop.)

################################################################
#Making a small multiple scatterplot using seaborn
g = sns.FacetGrid(data=crime_dat, col='subrgn', 
                   col_wrap=2, despine=False, height=4)
g.map(plt.scatter, 'burg_rate', 'rob_rate', color='grey', 
       s=12, edgecolor='k', alpha=0.5)
g.set_titles("{col_name}")
for a in g.axes:
    a.set_axisbelow(True)
g.set_xlabels('Burglary Rate per 100,000')
g.set_ylabels('Robbery Rate per 100,000')
plt.savefig('Scatter06.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

And then finally I show an example of making a small multiple hexbin plot. It is alittle tricky, but this is an example in the seaborn docs of writing your own sub-plot function and passing that.

To make this work, you need to pass an extent for each subplot (so the hexagons are not expanded/shrunk in any particular subplot). You also need to pass a vmin/vmax argument, so the color scales are consistent for each subplot. Then finally to add in the color bar I just fiddled with adding an axes. (Again there is probably a smarter way to scoop up the plot coordinates for the last plot, but here I just experimented till it looked about right.)

################################################################
#Making a small multiple hexbin plot using seaborn

#https://github.com/mwaskom/seaborn/issues/1860
#https://stackoverflow.com/a/31385996/604456
def loc_hexbin(x, y, **kwargs):
    kwargs.pop("color", None)
    plt.hexbin(x, y, gridsize=20, edgecolor='grey',
               cmap='inferno', mincnt=1, 
               vmin=1, vmax=700, **kwargs)

g = sns.FacetGrid(data=crime_dat, col='subrgn', 
                  col_wrap=2, despine=False, height=4)
g.map(loc_hexbin, 'burg_rate', 'rob_rate', 
      edgecolors='grey', extent=[0, 9000, 0, 500])
g.set_titles("{col_name}")
for a in g.axes:
    a.set_axisbelow(True)
#This goes x,y,width,height
cax = g.fig.add_axes([0.55, 0.09, 0.03, .384])
plt.colorbar(cax=cax, ax=g.axes[0])
g.set_xlabels('Burglary Rate per 100,000')
g.set_ylabels('Robbery Rate per 100,000')
plt.savefig('Scatter07.png', dpi=500, bbox_inches='tight')
plt.show()
################################################################

Another common task with scatterplots is to visualize a smoother, e.g. E[Y|X] the expected mean of Y conditional on X, or you could do any other quantile, etc. That will have to be another post though, but for examples I have written about previously I have jittering 0/1 data, and visually weighted regression.

Notes on making Leaflet maps in R

The other day I wrote a blog post for crimrxiv about posting interactive graphics on their pre-print sharing service. I figured it would be good to share my notes on making interactive maps, and to date I’ve mostly created these using the R leaflet library.

The reason I like these interactive maps is they allow you to zoom in and look at hot spots of crime. With the slippy base maps you can then see, oh OK this hot spot is by a train station, or an apartment complex, etc. It also allows you to check out specific data labels via pop-ups as I will show.

I’m using data from my paper on creating cost of crime weighted hot spots in Dallas (that will be forthcoming in Police Quarterly soonish). But I have posted a more direct set of replicating code for the blog post here.

R Code

So first for the R libraries I am using, I also change the working directory to where I have my data located on my Windows machine.

##########################################################
#This code creates a nice leaflet map of my DBSCAN areas

library(rgdal)       #read in shapefiles
library(sp)          #spatial objects
library(leaflet)     #for creating interactive maps
library(htmlwidgets) #for exporting interactive maps

#will need to change baseLoc if replicating on your machine
baseLoc <- "D:\\Dropbox\\Dropbox\\Documents\\BLOG\\leaflet_R_examples\\Analysis"
setwd(baseLoc)
##########################################################

Second, I read in my shapefiles using the rgdal library. This is important, as it includes the projection information. To plot the spatial objects on a slippy map they need to be in the Web Mercator projection (or technically no projection, just a coordinate reference system for the globe). As another trick I like with these basemaps, for the outlined area (the Dallas boundary here), it is easier to plot as a line spatial object, as opposed to an empty filled polygon. You don’t need to worry about the order of the layers as much that way.

##########################################################
#Get the boundary data and DBSCAN data
boundary <- readOGR(dsn="Dallas_MainArea_Proj.shp",layer="Dallas_MainArea_Proj")
dbscan_areas <- readOGR(dsn="db_scan.shp",layer="db_scan")

#Now convert to WGS
DalLatLon <- spTransform(boundary,CRS("+init=epsg:4326"))
DallLine <- as(DalLatLon, 'SpatialLines') #Leaflet useful for boundaries to be lines instead of areas
dbscan_LatLon <- spTransform(dbscan_areas,CRS("+init=epsg:4326") )

#Quick and Dirty plot to check projections are OK
plot(DallLine)
plot(dbscan_LatLon,add=TRUE,col='blue')
##########################################################

Next part, I have a custom function I have made to make pop-up labels for these leaflet maps. First I need to read in a table with the data info for the hot spot areas and merge that into the spatial object. Then the way my custom function works is I pass it the dataset, then I have arguments for the variables I want, and the way I want them labeled. The function does the work of making the labels bolded and putting in line breaks into the HTML. (No doubt others have created nice libraries to do HTML tables/graphs inside the pop-ups that I am unaware of.) If you check out the final print statement, it shows the HTML it built for one of the labels, <strong>ID: </strong>1<br><strong>$ (Thousands): </strong>116.9<br><strong>PAI: </strong>10.3<br><strong>Street Length (Miles): </strong>0.4

##########################################################
#Function for labels

#read in data
crime_stats <- read.csv('ClusterStats_wlen.csv', stringsAsFactors=FALSE)
dbscan_stats <- crime_stats[crime_stats$type == 'DBSCAN',]
dbscan_stats$clus_id <- as.numeric(dbscan_stats$AreaStr) #because factors=False!

#merge into the dbscan areas
dbscan_LL <- merge(dbscan_LatLon,dbscan_stats)

LabFunct <- function(data,vars,labs){
  n <- length(labs)
  add_lab <- paste0("<strong>",labs[1],"</strong>",data[,vars[1]])
  for (i in 2:n){
    add_lab <- paste0(add_lab,"<br><strong>",labs[i],"</strong>",data[,vars[i]])
  }
  return(add_lab)
}

#create labels
vs <- c('AreaStr', 'val_th', 'PAI_valth_len', 'LenMile')
#Lazy, so just going to round these values
for (v in vs[-1]){
  dbscan_LL@data[,v] <- round(dbscan_LL@data[,v],1)
}  
lb <- c('ID: ','$ (Thousands): ','PAI: ','Street Length (Miles): ')
diss_lab <- LabFunct(dbscan_LL@data, vs, lb)

print(diss_lab[1]) #showing off just one
##########################################################

Now finally onto the hotspot map. This is a bit to chew over, so I will go through bit-by-bit.

##########################################################
HotSpotMap <- leaflet() %>%
  addProviderTiles(providers$OpenStreetMap, group = "Open Street Map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "CartoDB Lite") %>%
  addPolylines(data=DallLine, color='black', weight=4, group="Dallas Boundary") %>%
  addPolygons(data=dbscan_LL,color = "blue", weight = 2, opacity = 1.0, 
              fillOpacity = 0.5, group="DBSCAN Areas",popup=diss_lab, 
              highlight = highlightOptions(weight = 5,bringToFront = TRUE)) %>%
  addLayersControl(baseGroups = c("Open Street Map","CartoDB Lite"),
                   overlayGroups = c("Dallas Boundary","DBSCAN Areas"),
                   options = layersControlOptions(collapsed = FALSE))  %>%
  addScaleBar(position = "bottomleft", options = scaleBarOptions(maxWidth = 100, 
              imperial = TRUE, updateWhenIdle = TRUE))
                      
HotSpotMap #this lets you view interactively

#or save to a HTML file to embed in webpage
saveWidget(HotSpotMap,"HotSpotMap.html", selfcontained = TRUE)
##########################################################

First I create the empty leaflet() object. Because I am superimposing multiple spatial layers, I don’t worry about setting the default spatial layer. Second, I add in two basemap providers, OpenStreetMap and the grey scale CartoDB positron. Positron is better IMO for visualizing global data patterns, but the open street map is better for when you zoom in and want to see exactly what is around a hot spot area. Note when adding in a layer, I give it a group name. This allows you to later toggle which provider you want via a basegroup in the layers control.

Next I add in the two spatial layers, the Dallas Boundary lines and then the hot spots. For the DBSCAN hot spots, I include a pop-up diss_lab for the dbscan hot spot layer. This allows you to click on the polygon, and you get the info I stuffed into that label vector earlier. The HTML is to make it print nicely.

Finally then I add in a layers control, so you can toggle layers on/off. Basegroups mean that only one of the options can be selected, it doesn’t make sense to have multiple basemaps selected. Overlay you can toggle on/off as needed. Here the overlay doesn’t matter much due to the nature of the map, but if you have many layers (e.g. a hot spot map and a choropleth map of demographics) being able to toggle the layers on/off helps a bit more.

Then as a final touch I add in a scale bar (that automatically updates depending on the zoom level). These aren’t my favorite with slippy maps, as I’m not even 100% sure what location the scale bar refers to offhand (the center of the map? Or literally where the scale bar is located?) But when zoomed into smaller areas like a city I guess it is not misleading.

Here is a screenshot of this created map zoomed out to the whole city using the Positron grey scale base map. So it is tough to visualize the distribution of hot spots from this. If I wanted to do that in a static map I would likely just plot the hot spot centroids, and then make the circles bigger for areas that capture more crime.

But since we can zoom in, here is another screenshot zoomed in using the OpenStreetMap basemap, and also illustrating what my pop-up labels look like.

I’m too lazy to post this exact map, but it is very similar to one I posted for my actual hot spots paper if you want to check it out directly. I host it on GitHub for free.

Here I did not show how to make a choropleth map, but Jacob Kaplan in his R book has a nice example of that. And in the future I will have to update this to show how to do the same thing in python using the Folium library. I used Folium in this blog post if you want to dig into an example though for now.

Some more examples

For some other examples of what is possible in Leaflet maps in R, here are some examples I made for my undergrad Communities and Crime class. I had students submit prediction assignments (e.g. predict the neighborhood with the most crime in Dallas, predict the street segment in Oak Cliff with the most violent crime, predict the bar with the most crimes nearby, etc.) I would then show the class the results, as well as where other students predicted. So here are some screen shots of those maps.

Choropleth

Graduated Points

Street Segment Viz