Clumpy hotspots

Read an article by Tim Hart the other day (part of a special issue I will have an article in as well here soon). In it he evaluated hot spot methods not only by how well they forecast crime, but also by the clumpiness of the hot spot method. Some hot spot methods, such as risk terrain modeling (Caplan et al., 2020; Fox et al., 2021), machine learning models (Wheeler & Steenbeek, 2020), or self-exciting point process models (Mohler et al., 2018) can by their nature produce discontinuous hot spots. Here is an example of a RTM map I made in Yoo & Wheeler (2019) for homeless related crime in Los Angeles, and you can see this is quite spotty in the ups/downs in the high risk areas:

Other hot spot methods, like hierarchical clustering (Wheeler & Reuter, 2020) or kernel density maps however this is not as big an issue. Here is an example kernel density map also from Yoo & Wheeler (2019) based on the same data:

So you can see how the hot spots in the kernel density map are spatially contiguous, whereas the RTM example can be little hot spots all over the jurisdiction. So it is obviously easier to patrol a single contiguous area than many islands over the entire jurisdiction. So it may make sense to trade off a contiguous area that captures somewhat fewer crimes than speckled areas that are all over the map.

Adepeju et al. (2016) was the first to use a particular statistic, the clumpiness index, to evaluate different hot spot methods. Their figure below is a pretty good depiction of the idea – count up the number of internal edges to a hot spot (when a hot spot grid-cell neighbors another hot spot), and the number of external edges. Then it is just a particular formula to make the index range from -1 to 1 given different sized hot spots.

So here I flip this idea on its head abit – instead of using a particular hot spot technique and see its clumpiness, I formulate a linear program given a prediction to trade off a smaller number of predicted crimes in the hot spot vs making the hot spot areas more clumpy. I illustrate my clumpy hot spots using just prior data to predict future data, in particular thefts from motor vehicles in Raleigh North Carolina.

I have posted the data/code on github here. It is a bit too long to embed the code directly in the blog post, but just see the file. The crime data and Raleigh border I downloaded from the Raleigh open data website.

A Linear Program to Clump Hot Spots

So for some quick and dirty math in text, the linear program I formulate is:

Maximize { Sum[ theta*S_i*Crime_i + (1 - theta)*E_i ] }
Subject To:
    1) Sum( S_i ) = k
    2) E_i <= Sum(S_n for n in neighbors(i) ) for each i
    3) E_i <= S_i for each i
    4) S_i element of {0,1}, E_i >= 0 (and can be continuous)

The idea behind this is that if theta=1, this is the same as just taking whatever your input areas are and ranking them to pick the top k areas. So if you have 10000 500 by 500 foot grid cells as your spatial units of analysis, and you wanted the top 1% of the city, that is 100 grid cells. So you would choose k=100 in that scenario. Crime_i here I use as prior counts of crime in the grid cell, but it could be the predicted value from whatever model as well. That is the first constraint in this model approach – you need to choose the total area you want. S_i are the decision variables for the final selected hot spot areas.

The second and third constraints determine the values for the second set of decision variables, E_i. These are the decision variables that encode the interconnected links when a selected grid cell touches another grid cell. Constraint 2 sets E_i to the total number of neighbors of i that are selected, except constraint 3 says if S_i is 0 E_i needs to be 0 as well.

In this formulation, S_i need to be integer variables, but the E_i are defined by the sum of S_i, so they can be continuous. In this formulation if you have N grid cells (or whatever spatial units of analysis), this results in 2*N decision variables, and 2*N + 1 constraints. You could maybe save a few constraints here by working with an undirected graph instead of a directed one (in essence this double counts, a-b and b-a would count as two links). But this will just make it 1.5*N constraints instead of 2*N. So not a big deal probably. I did have some issues solving this using pulps default coin/GLPK solver. But CPLEX solved it no problem. (My example is a total of 20,986 500 by 500 foot grid cells, and I use rook contiguity like the Adepeju article as well. And using CPLEX it solves the models in just a few seconds.)

In this formulation you can think of theta as trading off crimes in the hot spot vs interior edges. So imagine you had theta=0.9, and you had a solution with 200 crimes and 100 interior edges. The objective function in that scenario would be 0.9*200 + 0.1*100 = 190. Now imagine you had an alternative scenario with 190 crimes, but 200 internal edges, the objective function would be 0.9*190 + 0.1*200 = 191. So you are saying, it is ok to have hot spots capture a smaller number of crimes, if they are more connected.

Normal Hotspots vs Clumpy Ones in Raleigh

The open data I use for Raleigh, North Carolina for the NIBRS dataset goes back to June 2014, and has data updated in the beginning of March 2021. I pull out larcenies from motor vehicles, and for the historical train dataset use car larcenies from 2014 through 2019 (n = 17,681). For the test dataset I use car larcenies in 2020 and what is available so far in 2021 (n = 3,376). Again these are grid cells generated over the city boundaries at 500 by 500 foot intervals. For illustration I grab out the top 1% of the city (209 grid cells). I use a train/test dataset as out of sample test data will typically result in reduced predictions. Here are the PAI stats for train vs test when selecting the top 1%.

For all subsequent selections I always use the historical training data to select the hot spots, and the test dataset to evaluate the PAI.

If we do the typical approach of just taking the highest crime grid cells based on the historical data, here are the results both for the PAI and the CI (clumpy index). For those not familiar, PAI is % Crime Capture/% Area, so if the denominator is 1%, and the PAI (for the test data) is 17, that means the hot spots capture 17% of the total thefts from vehicles. The CI ranges from -1 (spread apart) to 1 (entirely clustered). Here it is just over 0, suggesting these are basically randomly distributed in terms of clustering.

You may think that almost spatial randomness in terms of clumping seems at odds with that crime clusters! But it is not really – a consistent relationship with crime hot spots is that they are intensely localized, and often you can go down the street and be in a low crime area (Harries, 2006). The same idea when people say high crime neighborhoods often are spotty interior – they tend to have mostly low crime areas and just a few specific hot spots.

OK, so now to show off my linear program. So what happens if we use theta=0.9?

The total crime numbers are here for the historical data, and it ends up capturing the exact same number of crimes as the select top 1% does (3,664). But, it switches the selection of one of the areas. So what happens here is that we have ties – even with basically little weight assigned to the interior connections, it will prioritize tied crime areas to be connected to other chosen hot spots (whereas before the ties are just random in the way I chose the top 1%). So if you have many ties at the threshold for your hot spot, this is a great way to prioritize particular tied areas.

What happens if we turn down theta to 0.5? So this is saying you would trade off one for one – one interior edge is equal to one crime.

You can see that it changed the selections slightly more here, traded off 24 areas compared to the original just rank solution. Lets check out the map and the CI:

The CI value is now 0.17 (up from 0.08). You can see some larger blobs, but it is still pretty spread apart. But the reduction in the total number of crimes captured is pretty small, going from a PAI of 17 to now a PAI of 16. How about if we crank down theta even more to 0.2?

This trades off a much larger number of areas and total amount of crime – over half of the chosen grid cells are flipped in this scenario. In the subsequent map you can see the hot spots are much more clumpy now, and have a CI of 0.64.

The PAI of 12.6 is a bit of a hit as well, but is not too shabby still. I typically take a PAI of 10 to be the ballpark of what is reasonable based on Weisburd’s Law of Crime Concentration – 5% of the areas contain 50% of the crime (which is a PAI of 10).

So this shows one linear programming approach to trade off clumpy chosen areas vs disconnected speckles over the map. It may be the case though that other approaches are more reasonable, such as using some type of clustering to begin with. E.g. I could use DBSCAN on the gridded predicted values (Wheeler & Reuter, 2020) as see how clumpy those hot spots are. This approach is nice though if you have a fixed area you want to cover though.

Why Raleigh?

For a bit of personal news, I will be moving to the Raleigh area here shortly. I recently negotiated to be 100% remote at my job – so I will still be at HMS (or since we were recently purchased I might be employed by Gainwell I guess by the time I move). So looking forward to the new adventure back on the east coast but still in more temperate climates than PA or NY!


Incorporating treatment non-compliance into call-ins

I have previously published work on identifying optimal individuals to prioritize for call-ins in Focused Deterrence interventions. The idea is we want to identify optimal people to spread the message, so you call in a small number of individuals and they should spread the message to the remaining group. There are better people than others to seed the message to to make sure it spreads throughout the network.

I knew of a direct improvement on that algorithm I published (very similar to the TURF problem I described the other day). But the bigger issue was that even when you call in individuals they do not always come to the meeting – treatment non-compliance. When working with state parole and/or local probation, the police department can ask those agencies to essentially make people come in, but otherwise it is voluntary.

The TURF problem I did the other day gave me a bit of inspiration on how to tackle that treatment non-compliance problem though. In a nutshell when you calculate whether someone is reached (via being directly connected to someone called-in), they can be partially reached based on the probability of the selected nodes treatment compliance. I have posted the code to follow along on dropbox here. I won’t go through the whole thing, but just some highlights.

The Model

First, in some quick and dirty text math, the model is:

Maximize Sum( R_i )

Subject to:

  • R_i <= Sum( S_j*p_j ) for each i
  • Sum( S_j ) = k
  • S_i element of [0,1]
  • R_i <= 1 for each i

Here i refers to an individual node in the gang/group network.

The first constraint R_i <= Sum( S_j*p_j ), the j’s are the nodes that are connected to i (and i itself). The p_j are the estimates that an individual will comply with coming into the call-in. For one agency we worked with for that project, they guessed that those who don’t need to come in comply about 1/6th of the time, so I use that estimate here in my examples, and give people who are on probation/parole a 1 for the probability of compliance.

Second constraint is we can only call in so many people, here k. The model solves very fast, so you can generate results for various k until you get the reach you want to in the end. (You could do the model the other way, minimize S_i while constraining the minimized acceptable reach, e.g. Sum( R_i ) >= threshold, I don’t suggest this in practice though, as when dealing with compliance there may be no feasible solution that gets you the amount of reach in the network you want.)

For the third constraint, the decision variables S_i are binary 0/1’s, but the R_i are continuous. But the trick here is that the last constraint, R_i <= 1, means that the expected reach is capped at 1. Here is a way to think about this, imagine you want to know the chance that person A is reached, and they are connected to two called-in individuals, who each have a 40% chance at complying with the treatment (coming to the call-in). The expected times person A would be reached then is additive in the probabilities, 0.4 + 0.4 = 0.8. If we had 3 people connected to A again at 40% apiece, the expected number of times A would be reached is then 0.4 + 0.4 + 0.4 = 1.2. So a person can be reached multiple times. (Note this is not the probability a person is reached at least once! It is a non-linear problem to model that.)

But if we took away the last constraint, what would happen is that the algorithm would just pick the nodes that had the highest number of neighbors. Since we are maximizing expected reach, if we had a sample of two people, the expected reach values of [2.5, 0] would be preferable to [1, 1], although clearly we rather have the reach spread out. So to prevent that, I cap the expected reach variable at 1, R_i <= 1 for each i, so this spreads out the selected individuals. So in the end the expected number of times people are reached are a lower bound estimate, but those are only people who are expected to receive the message multiple times.

This is a bit of a hack, but in my tests works quite well. I attempted to model the non-linear problem of estimating the probabilities at the person level and still maximizing the expected reach (in the code I have an example of using the CVXR R package). But it was quite fickle in when it would return a solution. So I am focusing on the linear program here, which is not perfect, but is an improvement over my prior published work.

Some Python Snippets

So for my example code, I am using City 4 Gang 4 from my paper. The reason is this was the largest network, and my original algorithm performed the worst. 99 nodes, and my original algorithm identified a 33 person dominant set, but Borgotti’s tool (that uses a genetic algorithm) identified a 29 dominant set.

Here is an example of calling my function to select the individuals for a call-in based on the non-compliance estimates. (g4 is the networkx graph object, the second arg is the number of individuals, and compliance is the node attribute that has the probability of treated compliance.) If we call in only 5 people, we still expect a reach of 29 individuals. Here there ends up being some highly connected people on parole/probation, so they have a 1 probability of complying with the treatment.

A consequence of this algorithm is that if you pipe in 1’s for the treatment compliance, you basically get an improvement to my original algorithm. So for a test we can see if I get the same minimal dominating set as Borgotti did for his algorithm here, where const is just everybody complies 100% of the time.

And yep we get a dominating set (all 99 people are reached). What happens if we go down one, and only select 28 people?

We only reach 98 out of the 99. So it appears a 29 set is the minimal dominating set here. But like I said the treatment non-compliance is a big deal in this setting. What is our expected reach if we take that into account, but still call-in 29 people?

It is still pretty high, around 2/3s of the network, but is still much smaller. Also if you look at the overlap between the constant versus non-compliance model, they select quite a few different individuals. It makes a big difference.

Here is a graph I made of selecting 20 individuals. Red means I selected that person, pink means they are reached at least some, and the size of the reach is proportion to the node. Then grey folks I wouldn’t expect to be reached by the message (at least by first degree connections).

So you can see that most of the people selected have that full 1 expected reach, so the algorithm does prioritize individuals on probation/parole who have a 100% expected compliance. But you can see a few folks who have a lower compliance who are selected as they are in places in the network not covered by those on probation/parole.

I have a tough time getting network layouts to look nice in python (even with the same layout algorithms, I feel like igraph in R just looks much better out of the box).

Future Work

Out of the box, this algorithm could incorporate several different pieces of information. So here I use the non-compliance estimate as a constant, but you could have varying estimates for that based on some other model no problem (e.g. older individuals comply more often than younger, etc.). Also another interesting extension (if you could get estimates) would be the probability a called-in individual spreads the message. In the part Sum( S_j*p_j ) it would just be something like Sum( S_j*p_cj*p_sj ), where p_cj is the compliance probability for attending, and p_sj is the probability to spread the message to those they are connected to.

Getting worthwhile estimates for either of those things will be tough though. Only way I can see it is via some shoe leather qualitative or survey approach.

A linear programming example for TURF analysis in python

Recently on LinkedIn I saw a very nice example of TURF (Total Unduplicated Reach & Frequency) analysis via Jarlath Quinn. I suggest folks go and watch the video, but for a simple example imagine you are an ice cream food truck, and you only have room in your truck to sell 5 different ice-creams at a time. Some people only like chocolate, others like vanilla and neapolitan, and then others like me like everything. So what are the 5 best flavors to choose to maximize the number of people that like at least one flavor?

You may think this is a bit far-fetched to my usual posts related to criminal justice, but it is very much related to the work I did on identifying optimal gang members to deliver the message in a Focused Deterrence initiative. (I’m wondering if also there is an application in Association Rules/Conjunctive analysis.) But most examples I see of this are in the marketing space, e.g. whether to open a new store, or to carry a new product in a store, or to spend money on ads to reach an audience, etc.

Here I have posted the python code and data used in the analysis, below I go through the steps in formulating different linear programs to tackle this problem. I ended up taking some example simulated data from the XLStat website. (If you have a real data example feel free to share!)

A bit of a side story – growing up in rural Pennsylvania, going out to restaurants was sort of a big event. I specifically remember when we would travel to Williamsport, we would often go to eat at a restaurant called Hoss’s and we would all just order the salad bar buffet. So I am going to pretend this restaurant survey is maximizing the reach for Hoss’s buffet options.


Here I am using pulp to fit the linear programming, reading in the data, and I am making up names for the columns for different food items. I have a set of main course meals, sides, and desserts. You will see in a bit how I incorporate this info into the buffet plans.


import pandas as pd
import pulp
import os


#This is simulated data from XLStat
surv_data = pd.read_excel('demoTURF.xls',sheet_name='Data',index_col=0)

#Need 27 total of match up simulated data
main = ['Steak',

sides = ['CeaserSalad',

desserts = ['ChoclateIceCream',

#Renaming columns
surv_data.columns = main + sides + desserts

#Replacing the likert scale data with 0/1

#A customer weight example, here setting to 1 for all
surv_data['CustWeight'] = 1
cust_weight = surv_data['CustWeight']

Maximizing Customers Reached

And now onto the good stuff, here is an example TURF model linear program. I end up picking the same 5 items that the XLStat program picked in their spreadsheet as well.


k = 5 #pick 5 items
Cust_Index = surv_data.index
Prod_Index = main + sides + desserts

#Problem and Decision variables
P = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)

#Objective Function
P += pulp.lpSum( Cust_Dec[i] * cust_weight[i] for i in Cust_Index )

surv_items = surv_data[Prod_Index] #Dont want the weight variable
#Reached Constraint
for i in Cust_Index:
    #Get the products selected
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    #Set the constraint
    P += Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )
#Total number of products selected constraint
P += pulp.lpSum( Prod_Dec[j] for j in Prod_Index) == k

#Now solve the model

#Figure out the total reached people
print( pulp.value(P.objective) ) #129

#Print out the products you picked
picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec[j].varValue == 1:
        picked.append( (n+1,j) )


#Same as XLStat
#[(14, 'OnionSoup'), (15, 'Peas'), (16, 'BrusselSprouts'), 
# (23, 'ChocChipCookie'), (26, 'Blondie')]

#For 5 items, XLStat selected items 
# 14 15 16 23 26 that reached 129 people

One of the things I have done here is to create a ‘weight’ variable associated with each customer. So here I say all of the customers weights are all equal to 1, but you could swap out whatever you wanted. Say you had estimates on how much different individuals spend, so you could give big spenders more weight. (In a criminal justice example, for the Focused Deterrence initiative, folks typically want to target ‘leaders’ more frequently, so you may give them more weight in this example.) Since these examples are based on surveys, you may also want the weight to correspond to the proportion that survey respondent represents in the population, aka raking weights. Or if you have a crazy large survey population, you could use frequency weights for responses that give the exact same picks.

One thing to note as well in this formula is that I recoded the data earlier to be 0/1. You might however consider the likert scale rating 1 to 5 directly, subtract 1 and divide by 4. Then take that weight, and instead of the line:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )

You may want something like:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j]*likert_weight[i,j] for j in sel_prod )

In that case you would want to set the Cust_i decision variable to a continuous value, and then maybe cap it at 1 (so you can partially reach customers).

The total number of decision variables will be the number of customers plus the number of potential products, so here only 185 + 27 = 212. And the number of constraints will be the number of customers plus an additional small number. I’d note you can easily solve systems with 100,000’s of decision variables and constraints on your laptop, so at least for the example TURF analyses I have seen they are definitely within the ‘can solve this in a second on a laptop’ territory.

You can add in additional constraints into this problem. So imagine we always wanted to select one main course, at least two side dishes, and no more than three desserts. Also say you never wanted to pair two items together, say you had two chicken dishes and never wanted both at the same time. Here is how you could do each of those different constraints in the problem.


#Redoing the initial problem, but select 7 items
k = 7
P2 = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec2 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec2 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
P2 += pulp.lpSum( Cust_Dec2[i] * cust_weight[i] for i in Cust_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    P2 += Cust_Dec2[i] <= pulp.lpSum( Prod_Dec2[j] for j in sel_prod )
P2 += pulp.lpSum( Prod_Dec2[j] for j in Prod_Index) == k

#No Fried and BBQ Chicken at the same time
P2 += pulp.lpSum( Prod_Dec2['FriedChicken'] + Prod_Dec2['BBQChicken']) <= 1
#Exactly one main course
P2 += pulp.lpSum( Prod_Dec2[m] for m in main) == 1
#At least two sides (but could have 0)
P2 += pulp.lpSum( Prod_Dec2[s] for s in sides) >= 2
#No more than 3 desserts
P2 += pulp.lpSum( Prod_Dec2[d] for d in desserts) <= 3

#Now solve the model and print results
print( pulp.value(P2.objective) ) #137
picked2 = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec2[j].varValue == 1:
        picked2.append( (n+1,j) )
#[(10, 'Wings'), (12, 'IcebergSalad'), (14, 'OnionSoup'), (15, 'Peas'), 
# (16, 'BrusselSprouts'), (23, 'ChocChipCookie'), (27, 'CherryPie')]

You could also draw a trade-off curve for how many more people you will reach if you can up the total number of items you can place on the menu, so estimate the model with 4, 5, 6, etc items and see how many more people you can reach if you extend the menu.

One of the other constraints you may consider in this formula is a budget constraint. So imagine instead of the food example, you are working for a marketing company, and you have an advertisement budget. You want to maximize the customer reach given the budget, so here a “product” may be a billboard, radio ad, newspaper ad, etc, but each have different costs. So instead of the constraint Prod_j == k where you select so many products, you have the constraint Prod_j*Cost_j <= Budget, where each product is associated with a particular cost.

Alt Formula, Minimizing Cost while Reaching a Set Amount

So in that last bit I mentioned costs for selecting a particular portfolio of products. Another way you may think about the problem is minimizing cost while meeting constraints on the reach (instead of maximizing reach while minimizing cost). So say you were a marketer, and wanted an estimate of how much budget you would need to reach a million people. Or going with our buffet example, imagine we wanted to appeal to at least 50% of our sample (so at least 93 people). Our formula would then be below (where I make up slightly different costs for buffet each of the buffet options).


#Cost dictionary made up prices
cost_prod = {'Steak' : 5.0,
             'Pizza' : 2.0,
             'FriedChicken' : 4.0,
             'BBQChicken' : 3.5,
             'GrilledSalmon' : 4.5,
             'FriedHaddock' : 5.3,
             'LemonHaddock' : 4.7,
             'Roast' : 3.9,
             'Burger' : 1.5,
             'Wings' : 2.4,
             'CeaserSalad' : 1.0,
             'IcebergSalad' : 0.8,
             'TomatoSoup' : 0.4,
             'OnionSoup' : 0.9,
             'Peas' : 0.6,
             'BrusselSprouts' : 0.5,
             'GreenBeans' : 0.4,
             'Corn' : 0.3,
             'DeviledEggs' : 0.7,
             'Pickles' : 0.73,
             'ChoclateIceCream' : 1.3,
             'VanillaIceCream' : 1.2,
             'ChocChipCookie' : 1.5,
             'OatmealCookie' : 0.9,
             'Brownie' : 1.2,
             'Blondie' : 1.3,
             'CherryPie' : 1.9}

#Setting up the model with the same selection constraints
n = 100
Pmin = pulp.LpProblem("TURF", pulp.LpMinimize)
Cust_Dec3 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec3 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Minimize this instead of Maximize reach
Pmin += pulp.lpSum( Prod_Dec3[j] * cost_prod[j] for j in Prod_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    Pmin += Cust_Dec3[i] <= pulp.lpSum( Prod_Dec3[j] for j in sel_prod )
#Instead of select k items, we want to reach at least n people
Pmin += pulp.lpSum( Cust_Dec3[i]*cust_weight[i] for i in Cust_Index) >= n

#Same constraints on meal choices
Pmin += pulp.lpSum( Prod_Dec3['FriedChicken'] + Prod_Dec3['BBQChicken']) <= 1
Pmin += pulp.lpSum( Prod_Dec3[m] for m in main) == 1
Pmin += pulp.lpSum( Prod_Dec3[s] for s in sides) >= 2
Pmin += pulp.lpSum( Prod_Dec3[d] for d in desserts) <= 3

#Now solve the model and print results

reached = 0
for i in Cust_Index:
    reached += Cust_Dec3[i].varValue
print(reached) #100 reached on the nose

picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec3[j].varValue == 1:
        picked.append( (n+1,j,cost_prod[j]) )
        cost += cost_prod[j]
#[(9, 'Burger', 1.5), (13, 'TomatoSoup', 0.4), (15, 'Peas', 0.6)]
print(pulp.value(Pmin.objective)) #Total Cost 2.5

So for this example our minimum budget buffet has some burgers, tomato soup, and peas. (Sounds good to me, I am not a picky eater!)

You can still incorporate all of the same other constraints I discussed before in this formulation. So here we need at a minimum to serve only 3 items to get the (over) 50% reach that we desire. If you wanted fairness type constraints, e.g. you want to reach 60% of females and 40% of males, you could do that as well. In that case you would just have two separate constraints for each group level you wanted to reach (which would also be applicable to the prior maximize reach formula, although you may need to keep upping the number of products selected before you identify a feasible solution).

In the end you could mash up these two formulas into one bi-objective function. You would need to define a term though to balance reach and cost. I’m wondering as well if there is a way to incorporate marginal benefits of sales into this as well, e.g. if you sell a Steak you may make a larger profit than if you sell a Pizza. But I am not 100% sure how to do that in this set up (even though I like all ice-cream, I won’t necessarily buy every flavor if I visit the shop). Similar for marketing adverts some forms may have better reach, but may have worse conversion rates.

Creating high crime sub-tours

I was nerdsniped a bit by this paper, Targeting Knife-Enabled Homicides For Preventive Policing: A Stratified Resource Allocation Model by Vincent Hariman and Larry Sherman (HS from here on).

It in, HS attempt to define a touring schedule based on knife crime risk at the lower super output area in London. So here are the identified high risk areas:

And here are HS’s suggested hot spot tours schedule.

This is ad-hoc, but an admirable attempt to figure out a reasonable schedule. As you can see in their tables, the ‘high’ knife crime risk areas still only have a handful of homicides, so if reducing homicides is the objective, this program is a bit dead in the water (I’ve written about the lack of predictive ability of the model here).

I don’t think defining tours to visit everywhere makes sense, but I do think a somewhat smaller in scope question, how to figure out geographically informed tours for hot spot areas does. So instead of the single grid cell target ala PredPol, pick out multiple areas to visit for hot spots. (I don’t imagine the 41 LSOA areas are geographically contiguous either, e.g. it would make more sense to pick a tour for areas connected than for areas very far apart.)

Officers don’t tend to like single tiny areas either really, and I think it makes more sense to widen the scope a bit. So here is my attempt to figure those reasonable tours out.

Defining the Problem

The way I think about that problem is like this, look at the hypothetical diagram below. We have two choices for the hot spot location we are targeting, where the crime counts for locations are noted in the text label.

In the select the top hot spot (e.g. PredPol) approach, you would select the singlet grid cell in the top left, it is the highest intensity. We have another choice though, the more spread out hot spot in the lower right. Even though it is a lower density, it ends up capturing more crime overall.

I subsequently formulated an integer linear program to try to tackle the problem of finding good sub-tours through the graph that cumulatively capture more crime. So with the above graph, if I select two subtours, I get the results as (where nodes are identified by their (x,y) position):

  • ['Begin', (1, 4), 'End']
  • ['Begin', (4, 0), (4, 1), (3, 1), (3, 0), (2, 0), 'End']

So it can select singlet areas if they are islands (the (1,4) area in the top left), but will grow to wind through areas. Also note that the way I have programmed this network, it doesn’t skip the zero area (4,1) (it needs to go through at least one in the bottom right unless it doubles back on itself).

I will explain the meaning of the begin and end nodes below in my description of the linear program. It ends up being sort of a mash-up of traveling salesman type vehicle routing and min cost max flow type problems.

The Linear Program

The way I think about this problem formulation is like this: we have a directed graph, in which you can say, OK I start from location A, then can go to B, than go to C. In my set of decision variables, I have choices that look like this, where the first subscript denotes the from node, and the second subscript denotes the to node.

D_ab := node a -> node b
D_bc := node b -> node c

etc. In our subsequent linear program, the destination node is the node that we calculate our cumulative crime density statistics. So if node B had 10 crimes and 0.1 square kilometers, we would have a density of 100 crimes per square kilometer.

Now to make this formulation work, we need to add in a set of special nodes into our usual location network. These nodes I call ‘Begin’ and ‘End’ nodes (you may also call them source/sink nodes though). The begin nodes all look like this:


So you do that for every node in your network. Then you have End nodes as well, e.g.


In this formulation, since we are only concerned about the crime stats for the to node, not the from node, the Begin nodes just inherit the crime density stats from the original node data. For the end nodes though, you just set their objective value stats to zero (they are only relevant to define the constraints).

Now here is my linear program formulation:

  Sum [ D_ij ( CrimeDensity_j - DensityPenalty_j ) ]

Subject To:

 1. Sum( D_in for each neighbor of n ) <= 1, 
      for each original node n
 2. Sum( D_in for each neighbor of n ) =  Sum( D_ni for each neighbor of n ), 
      for each original node n
 3. Sum( D_bi for each begin node ) = k routes
 4. Sum( D_ie for each end node ) = k routes
 5. Sum( D_ij + D_ji ) <= 1, for each unique i,j pair
 6. D_ij is an element of {0,1}

Constraint 1 is a flow constraint. If a node has an incoming edge set to one, it cannot have any other incoming edge set to one (so a location can only be chosen once).

Constraint 2 is a constraint that says if an incoming node is selected, one of the outgoing edges needs to be selected.

Constraints 3 & 4 determine the number of k tours/routes to choose in the end. Since the begin/end nodes are special we have k routes going out of the begin nodes, and k routes going into the end nodes.

With just these constraints, you can still get micro-cycles I found. So something like, X -> Z -> X. Constraint 5 (for only the undirected edges) prevents this from happening.

Constraint 6 is just setting the decision variables to binary 0/1. So it is a mixed integer linear program.

The final thing to note is the objective function, I have CrimeDensity_j - DensityPenalty_j, so what exactly is DensityPenalty? This is a value that penalizes visiting areas that are below this threshold. Basically the way that this works is that, the density penalty sets an approximate threshold for the minimum density a tour should contain.

I suggest a default of a predictive accuracy index of 10. Where do I get 10 you ask? Weisburd’s law of crime concentration suggests 5% of the areas should contain 50% of the crime, which is a PAI of 0.5/0.05 = 10. In my example with DC data then I just calculate the actual density of crime per unit area that corresponds to a PAI of 10.

You can adjust this though, if you prefer smaller tours of higher crime density you would up the value. If you prefer longer tours decrease it.

This is the best way I could figure out how to trade off the idea of spreading out the targeted hot spot vs selecting the best areas. If you spread out you will ultimately have a lower density. This turns it into a soft objective penalty to try to keep the selected tours at a particular density threshold (and will scoop up better tours if they are available). For awhile I tried to figure out if I could maximize the PAI metric, but it is the case with larger areas the PAI will always go down, so you need to define the objective some other way.

This formulation I only consider linked nodes (unlike the usual traveling salesman in which it is a completely linked distance graph). That makes it much more manageable. In this formulation, if you have N as the number of nodes/areas, and E is the number of directed edges between those areas, we will then have:

  • 2*N + E decision variables
  • 2 + 2*N + E/2 constraints

Generally if you are doing directly connected areas in geographic networks (i.e. contiguity connections), you will have less than 8 (typically more like an average of 6) neighbors per each area. So in the case of the 4k London lower super output areas, if I chose tours I would guess it would end up being fewer than 2*4,000 + 8*4,000 = 40,000 decision variables, and then fewer than that constraints.

Since that is puny (and I would suggest doing this at a smaller geographic resolution) I tested it out on a harder network. I used the data from my dissertation, a network of 21,506 street units (both street segments and intersections) in Washington, D.C. The contiguity I use for these micro units is based on the Voronoi tessellation, so tends to have more neighbors than you would with a strictly road based network connectivity. Still in the end it ends up being a shade fewer than 200k decision variables and 110k constraints. So is a better test for in the wild whether the problem can be feasibly solved I think.

Example with DC Data

Here I have posted the python code and data used for this analysis, I end up having a nice function that you just submit your network with the appropriate attributes and out pops the different tours.

So I end up doing examples of 4 and 8 subtours based on 2011 violent UCR crime data (agg assaults, robberies, and homicides, no rapes in the public data). I use for the penalty that PAI = 10 threshold, so it should limit tours to approximately that value. It only ends up taking 2 minutes for the model to converge for the 4 tours and less than 2.5 minutes for the 8 tours on my desktop. So it should be not a big problem to up the decision variables to more sub-areas and still be solvable in real life applications.

The area estimates are in square meters, hence the high numbers. But on the right you can see that each sub-tour has a PAI above 10.

Here is an interactive map for you to zoom into each 4 subtour example. Below is a screenshot of one of the subtours. You can see that since I have defined my connected areas in terms of Voronoi tessalations, they don’t exactly follow the street network.

For the 8 tour example, it ends up returning several zero tours, so it is not possible in this data to generate 8 sub-tours that meet that PAI >= 10 threshold.

You can see that it ends up being the tours have higher PAI values, but lower overall crime counts.

You may think, why does it not pick at least singlet areas with at least one crime? It ends up being that I weight areas here by their area (this formulation would be better with grid cells of equal area, so my objective function is technically Sum [ D_ij * w_j *( CrimeDensity_j - DensityPenalty_j ) ], where w_j is the percent of the total area (so the denominator in the PAI calculation). So it ends up picking areas that are the tiniest areas, as they result in the smallest penalty to the objective function (w_j is tiny). I think this is OK though in the end – I rather know that some of the tours are worthless.

You can also see I get one subtour that is just under the PAI 10 threshold. Again possible here, but should be only slightly below in the worst case scenario. The way the objective function works is that it is pretty tricky to pick out subtours below that PAI value but still have a positive contribution to the overall objective function.

Future Directions

The main thing I wish I could do with the current algorithm (but can’t the way the linear program is set up), is to have minimum and maximum tour area/length constraints. I think I can maybe do this by adapting this code (I’m not sure how to do the penalties/objectives though). So if others have ideas let me know!

I admit that this may be overkill, and maybe just doing more typical crime clustering algorithms may be sufficient. E.g. doing DBSCAN hot spots like I did here.

But this is my best attempt shake at the problem for now!

An example of soft constraints in linear programming

Most of the prior examples of linear programming on my site use hard constraints. These are examples where I say to the model, “only give me results that strictly meet these criteria”, like “only select 40 cases to audit”, or “keep the finding rate over 50%”, etc.

There are alternative ways though to tell the model, “I want to select a finding rate over 50%, but still potentially consider those with lower finding rates”. One way to do that is via soft constraints, modifying the objective function directly to penalize (or favor) particular outcomes. For example, say you knew you could translate a 1% finding rate difference over 50% to a value of $1000. So if our original model is:

Maximize Sum{D_i*Return_i} 

Subject To
  D_i element of (0,1) #decision variables are 0/1
  Sum{D_i} = 100       #so select 100 cases

We would then place an additional penalty term that looks like this:

Maximize Sum{D_i*Return_i} + Sum{D_i*[(prob_i - 0.5)*1000]}  

Subject To
  D_i element of (0,1)
  Sum{D_i} = 100

So instead of a subject to constraint that says we need to be over 50% finding rate, we added a second penalty term for solutions that have an under 50% finding rate. So here if the finding rate in the end is 49%, it takes a hit of $1000 in the objective function. This example is also similar to a bi-objective function, here I just set an exact translation between finding rates and returns, but in practice often you don’t have that exact translation.

It just depends on your situation whether hard constraints or soft constraints make sense. Many situations you can swap one for the other, so different means same ends. For a good example of this, my allocating police resources paper on reducing disproportionate minority contact uses hard constraints (Wheeler, 2020), and George Mohler and colleagues have a very similar paper which uses soft constraints (Mohler et al., 2018). I imagine these will end up being very similar ends, although in that circumstance I prefer my hard constraint approach, as George’s you need to fiddle with the magnitude of the penalty term. Also I don’t tend to like changing the loss function for statistical/machine learning models, I just like changing what you do with the info after you have fit your model (Kleinberg et al., 2018).

Here I provide an example of where I think soft constraints make a bit of sense though. Imagine you have continuous predictive outputs, you need to make a binary yes/no decision among those options, but those predictive outputs also have a variance. An example of where this comes up is if you are making loan decisions, you want your portfolio to have a high return, but you also want to lower the variance of those returns as well.

For a simple example, imagine you are the lending institution and you have the choice between two scenarios:

  • Scenario A: lend 1 person $100,000 with an expected return of $8,000, with a variance of $4,000
  • Scenario B: lend 2 people $50,000, with an expected return of each for $4,000 each, with a variance of $1,000 for each loan

Since you expect to make the same amount of money under each scenario, option B is preferable if the loans are independent of each other (e.g. one going under does not cause the other to go under). In that case, variances are additive, so the total variance of option B is $2,000, so has much less volatility than does the A scenario.

(Sorry to my criminology friends, this example is generic but I strain to find a criminal justice example to apply it to. It would not be crazy that you have low volatility vs high volatility hot spots. So you may want to choose a consistent hot spot as oppossed to a fleeting one for an intervention. But I don’t think that will happen in practice quite like that. Choosing among expensive high risk/reward vs inexpensive treatment regimes low risk/reward in corrections settings may also make sense, but that is crazy pie in the sky technocratic given the current state of affairs as well.)

Example with Lending Club Data

So to illustrate an example with actual data, I’ve provide Python code fitting a predictive model to Lending Club data on loans. (I got the original dataset from Kaggle.) I am just going to highlight some key points here in the blog post. You will need to go to the code to see everything.

First, I’ve been introduced to this dataset as predicting a binary default/no-default. I have code doing that in the code snippet as well, and it performed OK. But it was very uncalibrated as to whether my portfolio made money – so even though the default estimates were pretty well spot on, my portfolios did not make much money. This is because people who default pay back some loans, and also quite a few people in the dataset pay back the loans fast, so the lenders don’t make as much as interest as you would expect at the start of the loan.

So I cut out the middle man and just estimated a random forest model predicting the actual money one made on the loan. I only kept cases that are either 'Fully Paid','Charged Off', 'Default', so I don’t model loans that are still ongoing. I end up modeling then the value total_pymnt - loan_amnt. You can look into the code to see the variables I included in the model, but one of the neat things about regression random forests is that you can not only get the mean prediction, you can also look at the variance over all the trees. See below a function to do that (in the 01 py file):

#Fit random forest model
model = RandomForestRegressor(n_estimators=1000, 
                              min_samples_leaf=100)[x_vars], train[y_var])

#Check the predicted vs actual on the test set
y_pred = model.predict(test[x_vars]) #predicted mean
test['y_pred'] = y_pred

#I want an estimate of the variance
def tree_var(X, rf_mod):
    per_tree_pred = [pd.Series(tree.predict(X), index=X.index) for tree in rf_mod.estimators_]
    pd_res = pd.concat(per_tree_pred, axis=1)
    pd_var = pd_res.var(axis=1)
    return pd_var

test['y_var'] = tree_var(X=test[x_vars], rf_mod=model)

And that predicted value and variance are then what I feed into my subsequent linear programming problem (in the 02 py file). The model in some more text is:

Maximize Sum{D_i*(prediction - lambda*variance)}

Subject To:
  D element of (0,1)                #decision variables
  Sum{D_i*loan_amount_i} <= 300000  #only have so much $ to loan, so no leveraging

Where lambda is the tuner – higher variances will get higher penalties. So going back to our two loan example, if lambda = 1, scenario A it would be 8000 - 1*4000 = 4000, and scenario B would be 8000 - 1*2000 = 6000, so that penalty would choose scenario B over A. Whereas without the penalty the two scenarios are exactly the same.

Since the lambda value is arbitrary, I illustrate the approach selecting portfolios of loans that are a total of $300,000 (I divided the loans by $1000 to make the numbers a bit easier to view). This is a totally held out sample of around 5k loans. So you can see my first model (with no constraints):

And my second model with a higher lambda value of 1 selects more (smaller) loans, and reduces the variance. You can see since we have the actual outcomes, I can show that both portfolios turned a profit, each above what I predicted. But the standard deviation for second portfolio is cut not quite in half.

So you can see in that one selection it worked out OK, but this does not verify that my variance estimates are correct (they are no doubt too small, as you can see the actual returns are way higher than I predicted).

To test them out though I do a simulation. I draw 1000 cases out of those 5000, and then again pick $300k in loans. I do that process 1000 times under a set of different lambda penalty terms, [0, 0.5, 1.0, 2.0, 3.0]. For those simulations, here is the overall distribution of the returns under different penalty terms for the variance.

Note that those histograms have different X axes. It is easier to see the moments of the spread in a boxplot:

So here you can see each scenario has pretty near the same median return (somewhere around $30k), but the penalties reduce the variance. The higher penalties end up selecting portfolio’s that always at least make money, whereas the lower penalty terms you do end up losing money in some scenarios.

Unfortunately I did not beat the market in my simple weekend experimentation, so don’t sink a bunch of money into Lending club based on this! The average returns starting from $300k are something like an annualized rate of return of around 3% (over 3 years) for the smaller simulation pick from 1000. These include loans of 60 months as well though. So even though my linear program with the penalty term did a good job of reducing the risk, this isn’t good enough returns for me to put a bunch of my money into Lending Club.

But there is no doubt improvements both to the modeling as well as the portfolio selection. For modeling I would be tempted to try out a discrete time survival model for payments over time, but that would be more work than I could do in a weekend. (Also I only incorporated the easy continuous variables here I could prepare in just a few minutes, so maybe more feature engineering would boost my results.)

I could also adapt the linear program to take into account covariance between the loans, but not sure how to estimate them (a multi-level model perhaps?). You also may want to do some sort of conditional value at risk approach in the linear program, say instead of piping in the variance from random forests, count up how often you lose money and put that as a penalty or constraint on the system.


An intro to linear programming for criminologists

Erik Alda made the point the other day on twitter that we are one of the few crim folks that do anything related to linear programming. I think it is crazy useful – much more so than say teaching myself some new regression technique or a programming language.

I don’t quite remember the motivation to learn it. I think I kept seeing repeated applications in papers I read, but was also totally baffled by it; I did not understand peoples notation for it at all. In retrospect that was because it is not statistics. You are optimizing a function by estimating some parameters (there is nothing stochastic about it, so there is no statistical inference). So it is more like finding the min/max of a function in calculus.

I think the best way to think about linear programming is in terms of decision analysis. We have a set of options among which we need to choose some action. So we make the choices that either maximize or minimize some objective, but also take into account constraints on the decisions we can make.

For social scientists here is an example that hopefully illustrates the difference between statistics and linear programming. Say we are interested in conducting a hot spots policing randomized experiment. So we define our top 20 crime hot spots in the city, and randomly assign 10 of them to receive the hot spots treatment. Linear programming is basically the obverse of this, given our 20 hot spot areas, which are the best 10 locations to choose for our intervention.

This problem as stated you might be thinking is trivial – just rank each of the 20 hot spots by the total number of crimes, and then choose the top 10. Where linear programming really helps though is if you have constraints on the final choices you make. Say you did not want to choose hot spots that are within 1 mile of each other (to spread out the hot spot interventions throughout the city). There is no simple way to sort your hot spots to obey that constraint, but you can encode that in the linear program and have the computer solve it quite easily.

There is no shortage of ways you could expand the complexity of this example hot spot decision analysis. Say you had two different types of hot spot treatments, and that they had different efficacy in different areas (one was good for property crime, and the other was better for violent crime). You might think of this as doing two separate decision analyses, where a constraint is that an area can only be assigned one of the two interventions.

Here I will provide some code examples in python using the pulp library to illustrate some more examples using data you can see in action, as well as different ways to think about linear programming problems in practice. (Technically the examples I give are all mixed integer linear programs, as the decision variables are binary 0/1.)

Formulating Objectives and Constraints

For this example I will be simulating data, but imagine a case you are an analyst for the IRS, and you want to determine which business tax returns to audit. We want to audit cases that both have a high probability of being fraudulent, as well as cases in which the total amount of the underpayment is large. (If you want a more typical criminology example, imagine assigning criminal cases to detectives, some cases have more costs, e.g. homicide vs burglary, and some cases have different probabilities of being solvable. This type of decision problem is very common in my experience – pretty much any time you have to make a binary choice, and those choices have variable costs/benefits.)

First I start off by simulating some data (the only libraries we need are numpy and pulp). So I simulate 1000 business tax returns, which have an estimate of the probability they are fraud, prob_fraud, and an estimate of the amount they underpayed, underpay_est.

import numpy as np
import pulp

#Simulate data for costs and probabilities

total_cases = 1000
underpay_est = np.random.uniform(1000,100000,total_cases)
prob_fraud = np.random.uniform(0,1,total_cases)
exp_return = prob_fraud*underpay_est


The objective we will be maximizing then is the expected return of auditing a tax return, exp_return, which is simply the multiplication of the probability of fraud multiplied by the amount of the underpayment. For a simple example, say we have a case where fraud is estimated to be 50%, and the estimate of the underpayment amount is $10,000. So our expected return for auditing that case is $5,000.

We need these two estimates external to our linear programming problem, and they themselves can be informed by predictive models (or simpler estimates, e.g. underpayment is proportional ~30% of deductions or something like that).

Now we have all we need to set up our linear programming problem. I am going to choose 100 cases out of these 1000 to audit. Hopefully that code is documented enough to see creating the decision variables (each tax return either gets a 1 if it is chosen, or a 0 if it is not), the original objective function that we are maximizing, and the results.

#Setting up the problem
case_index = list(range(total_cases))
tot_audit = 100

#Basic Problem
P = pulp.LpProblem("Choosing Cases to Audit", pulp.LpMaximize)
D = pulp.LpVariable.dicts("Decision Variable", [i for i in case_index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Objective Function
P += pulp.lpSum( D[i]*exp_return[i] for i in case_index)
#Constraint on total number of cases audited
P += pulp.lpSum( D[i] for i in case_index ) == tot_audit
#Solve the problem
#Get the decision variables
dec_list = [D[i].varValue for i in case_index]
dec_np = np.asarray(dec_list)
#Expected return
print( (dec_np * exp_return).sum() )
#Should be the same
print( pulp.value(P.objective) )
#Hit rate of cases
print( (dec_np * prob_fraud).sum()/tot_audit )

If you are following along in python, you will see that the total expected return is 7,287,915, and the estimated hit rate (or clearance rate) of the audits is around 0.88.

This example would be no different if we just chose the top 100 cases based on the expected return. Say that you thought the hit rate though of 88% was too low. You will choose cases that are big dollar amounts, but not necessarily a very high probability. So you may say I want my clearance rate to be over 90% overall. That is a simple constraint to add into the above model.

#Updating the problem to constrain on the hit rate
#Above a particular threshold
hit_rate = 0.9
P += pulp.lpSum( D[i]*prob_fraud[i] for i in case_index ) >= hit_rate*tot_audit
#Get the decision variables
dec_list = [D[i].varValue for i in case_index]
dec_np = np.asarray(dec_list)
#Expected return is slightly lower than before
print( pulp.value(P.objective) )
#Hit rate of cases
print( (dec_np * prob_fraud).sum()/tot_audit )

So now the total expected return is lower than without the constraint, 7,229,140 (so a reduction of about $60k), but our expected hit rate is just above 90%.

You may be thinking, “why not just eliminate cases with a probability of lower than 90%”, and then amongst those left over select the highest expected return. That meets your constraints, but has a lower expected return than this program! Think of this program as more tit-for-tat. High expected return / lower probability audits can still be selected with this model, but you need to balance them out with some high probability cases in response to tip the scales to meet the overall hit rate objective.

Trade-Offs and the Frontier Curve

So you may be thinking, ok the trade-off to get a 90% clearance was not too bad in terms of total extra taxes collected. So why not set the constraint to 95%. When you create constraints, they always lower the objective function (lower or equal to be more precise). The question then becomes quantifying that trade off.

You can subsequently vary the hit rate constraint, and see how much it changes the total expected return. Here is an example of doing that, each model only takes around a second to complete.

#Drawing the trade-off in hit rate vs expected return

hit_rate = np.linspace(0.85, 0.95, 30)
total_return = []

#Function to estimate the model
def const_hit_rate(er, prob, tot_n, hr):
    c_index = range(len(er))
    Prob = pulp.LpProblem("Choosing Cases to Audit", pulp.LpMaximize)
    Dec = pulp.LpVariable.dicts("Decision Variable", [i for i in c_index], lowBound=0, upBound=1, cat=pulp.LpInteger)
    Prob += pulp.lpSum( Dec[i]*er[i] for i in c_index)
    Prob += pulp.lpSum( Dec[i] for i in c_index ) == tot_n
    Prob += pulp.lpSum( Dec[i]*prob[i] for i in c_index ) >= hr*tot_n
    dec_li = [Dec[i].varValue for i in c_index]
    dec_np = np.asarray(dec_li)
    return pulp.value(Prob.objective), dec_np

for h in hit_rate:
    print(f'Estimating hit rate {h}')
    obj, dec_res = const_hit_rate(exp_return, prob_fraud, 100, h)


For this simulated data example, there end up being pretty severe trade-offs in the total return after you get above 91% hit rates, so from this it may not be worth the trade-off to get a much higher hit rate in practice. Just depends on how much you are willing to trade-off one for the other.

There are other ways to formulate this trade off (via bi-objective functions/soft-constraints, or weighted ranking schemes), but the blog post is long enough as is!

Other Potential Applications

So in terms of my work, I have examples of using linear programs to make spatial location decisions, encode fairness constraints into predictive policing, and allocate treatment assignment with network spillovers.

Erik Alda and Joseph Ferrandino have conducted frontier analysis of different criminal justice organizations, which is based on estimating the frontier curve above from data (instead of a pre-specified objective function).

That is about it for criminologists that I know of, but there are plenty of applications towards criminal justice topics using linear programming (or related concepts). It is most popular among operations researchers, of which Laura Albert is one of my favorites. (Criminal Justice as a field might not exist for Albert Blumstein, who was also a very influential operations researcher.)

One of the things that makes this different from more traditional quantitative work in the social sciences is that again it is not statistics – we are not testing hypotheses. The contribution is simply formulating the decision problem in a tractable way that can be solved, and the drawing of the trade-offs I showed above.

It is one of the ways I really like it though – unlike saying how your regression model can be used to inform decisions, this much more explicitly shows the utility of the results of those models in some practice.

Optimal treatment assignment with network spillovers

Motivated by a recent piece by Wood and Papachristos (2019), (WP from here on) which finds if you treat an individual at high risk for gun shot victimization, they have positive spillover effects on individuals they are connected to. This creates a tricky problem in identifying the best individuals to intervene with given finite resources. This is because you may not want to just choose the people with the highest risk – the best bang for your buck will be folks who are some function of high risk and connected to others with high risk (as well as those in areas of the network not already treated).

For a simplified example consider the network below, with individuals baseline probabilities of future risk noted in the nodes. Lets say the local treatment effect reduces the probability to 0, and the spillover effect reduces the probability by half, and you can only treat 1 node. Who do you treat?

We could select the person with the highest baseline probability (B), and the reduced effect ends up being 0.5(B) + 0.1(E) = 0.6 (the 0.1 is for the spillover effect for E). We could choose node A, which is a higher baseline probability and has the most connections, and the reduced effect is 0.4(A) + 0.05(C) + 0.05(D) + 0.1(E) = 0.6. But it ends up in this network the optimal node to choose is E, because the spillovers to A and B justify choosing a lower probability individual, 0.2(E) + 0.2(A) + 0.25(B) = 0.65.

Using this idea of a local effect and a spillover effect, I formulated an integer linear program with the same idea of a local treatment effect and a spillover effect:

\text{Maximize} \{ \sum_{i = 1}^n (L_i\cdot p_{li} + S_i \cdot p_{si}) \}

Where p_{li} is the reduction in the probability due to the local effect, and p_{si} is the reduction in the probability due to the spillover effect. These probabilities are fixed values you know at the onset, e.g. estimated from some model like in Wheeler, Worden, and Silver (2019) (and Papachristos has related work using the network itself to estimate risk). Each node, i, then gets two decision variables; L_i will equal 1 if that node is treated, and S_i will equal 1 if the node gets a spillover effect (depending on who is treated). Actually the findings in WP show that these effects are not additive (you don’t get extra effects if you are treated and your neighbors are treated, or if you have multiple neighbors treated), and this makes it easier to keep the problem on the probability scale. So we then have our constraints:

  1. L_i , S_i \in \{ 0,1 \}
  2. \sum L_i = K
  3. S_i \leq 1 + -1\cdot L_i , \forall \text{ Node}_i
  4. \sum_{\text{neigh}(i)} L_j \geq S_i , \forall \text{ Node}_i

Constraint 1 is that these are binary 0/1 decision variables. Constraint 2 is we limit the number of people treated to K (a value that we choose). Constraint 3 ensures that if a local decision variable is set to 1, then the spillover variable has to be set to 0. If the local is 0, it can be either 0 or 1. Constraint 4 looks at the neighbor relations. For Node i, if any of its neighbors local treated decision variable is set to 1, the Spillover decision variable can be set to 1.

So in the end, if the number of nodes is n, we have 2*n decision variables and 2*n + 1 constraints, I find it easier just to look at code sometimes, so here is this simple network and problem formulated in python using networkx and pulp. (Here is a full file of the code and data used in this post.) (Update: I swear I’ve edited this inline code snippet multiple times, if it does not appear I have coded constraints 3 & 4, check out the above linked code file. Maybe it is causing problems being rendered.)

import pulp
import networkx

Nodes = ['a','b','c','d','e']
Edges = [('a','c'),

p_l = {'a': 0.4, 'b': 0.5, 'c': 0.1, 'd': 0.1,'e': 0.2}
p_s = {'a': 0.2, 'b': 0.25, 'c': 0.05, 'd': 0.05,'e': 0.1}
K = 1

G = networkx.Graph()

P = pulp.LpProblem("Choosing Network Intervention", pulp.LpMaximize)
L = pulp.LpVariable.dicts("Treated Units", [i for i in Nodes], lowBound=0, upBound=1, cat=pulp.LpInteger)
S = pulp.LpVariable.dicts("Spillover Units", [i for i in Nodes], lowBound=0, upBound=1, cat=pulp.LpInteger)

P += pulp.lpSum( p_l[i]*L[i] + p_s[i]*S[i] for i in Nodes)
P += pulp.lpSum( L[i] for i in Nodes ) == K

for i in Nodes:
    P += pulp.lpSum( S[i] ) <= 1 + -1*L[i]
    ne = G.neighbors(i)
    P += pulp.lpSum( L[j] for j in ne ) >= S[i]


#Should select e for local, and a & b for spillover

for n in Nodes:

And this returns the correct results, that node E is chosen in this example, and A and B have the spillover effects. In the linked code I provided a nicer function to just pipe in your network, your two probability reduction estimates, and the number of treated units, and it will pipe out the results for you.

For an example with a larger network for just proof of concept, I conducted the same analysis, choosing 20 people to treat in a network of 311 nodes I pulled from Rostami and Mondani (2015). I simulated some baseline probabilities to pipe in, and made it so the local treatment effect was a 50% reduction in the probability, and a spillover effect was a 20% reduction. Here red squares are treated, pink circles are the spill-over, and non-treated are grey circles. It did not always choose the locally highest probability (largest nodes), but did tend to choose highly connected folks also with a high probability (but also chose some isolate nodes with a high probability as well).

This problem is solved in an instant. And I think out of the box this will work for even large networks of say over 100,000 nodes (I have let CPLEX churn on problems with near half a million decision variables on my desktop overnight). I need to check myself to make 100% sure though. A simple way to make the problem smaller if needed though is to conduct the analysis on subsets of connected components, and then shuffle the results back together.

Looking at the results, it is very similar to my choosing representatives work (Wheeler et al., 2019), and I think you could get similar results with just piping in 1’s for each of the local and spillover probabilities. One of the things I want to work on going forward though is treatment non-compliance. So if we are talking about giving some of these folks social services, they don’t always take up your offer (this is a problem in choose rep’s for call ins as well). WP actually relied on this to draw control nodes in their analysis. I thought for a bit the problem with treatment non-compliance in this setting was intractable, but another paper on a totally different topic (Bogle et al., 2019) has given me some recent hope that it can be solved.

This same idea is also is related to hot spots policing (think spatial diffusion of benefits). And I have some ideas about that to work on in the future as well (e.g. how wide of net to cast when doing hot spots interventions given geographical constraints).


  • Bogle, J., Bhatia, N., Ghobadi, M., Menache, I., Bjørner, N., Valadarsky, A., & Schapira, M. (2019). TEAVAR: striking the right utilization-availability balance in WAN traffic engineering. In Proceedings of the ACM Special Interest Group on Data Communication (pp. 29-43).
  • Rostami, A., & Mondani, H. (2015). The complexity of crime network data: A case study of its consequences for crime control and the study of networks. PloS ONE, 10(3), e0119309.
  • Wheeler, A. P., McLean, S. J., Becker, K. J., & Worden, R. E. (2019). Choosing Representatives to Deliver the Message in a Group Violence Intervention. Justice Evaluation Journal, Online First.
  • Wheeler, A. P., Worden, R. E., & Silver, J. R. (2019). The Accuracy of the Violent Offender Identification Directive Tool to Predict Future Gun Violence. Criminal Justice and Behavior, 46(5), 770-788.
  • Wood, G., & Papachristos, A. V. (2019). Reducing gunshot victimization in high-risk social networks through direct and spillover effects. Nature Human Behaviour, 1-7.


New preprint: Allocating police resources while limiting racial inequality

I have a new working paper out, Allocating police resources while limiting racial inequality. In this work I tackle the problem that a hot spots policing strategy likely exacerbates disproportionate minority contact (DMC). This is because of the pretty simple fact that hot spots of crime tend to be in disadvantaged/minority neighborhoods.

Here is a graph illustrating the problem. X axis is the proportion of minorities stopped by the police in 500 by 500 meter grid cells (NYPD data). Y axis is the number of violent crimes over along time period (12 years). So a typical hot spots strategy would choose the top N areas to target (here I do top 20). These are all very high proportion minority areas. So the inevitable extra police contact in those hot spots (in the form of either stops or arrests) will increase DMC.

I’d note that the majority of critiques of predictive policing focus on whether reported crime data is biased or not. I think that is a bit of a red herring though, you could use totally objective crime data (say swap out acoustic gun shot sensors with reported crime) and you still have the same problem.

The proportion of stops by the NYPD of minorities has consistently hovered around 90%, so doing a bunch of extra stuff in those hot spots will increase DMC, as those 20 hot spots tend to have 95%+ stops of minorities (with the exception of one location). Also note this 90% has not changed even with the dramatic decrease in stops overall by the NYPD.

So to illustrate my suggested solution here is a simple example. Consider you have a hot spot with predicted 30 crimes vs a hot spot with predicted 28 crimes. Also imagine that the 30 crime hot spot results in around 90% stops of minorities, whereas the 28 crime hot spot only results in around 50% stops of minorities. If you agree reducing DMC is a reasonable goal for the police in-and-of-itself, you may say choosing the 28 crime area is a good idea, even though it is a less efficient choice than the 30 crime hot spot.

I show in the paper how to codify this trade-off into a linear program that says choose X hot spots, but has a constraint based on the expected number of minorities likely to be stopped. Here is an example graph that shows it doesn’t always choose the highest crime areas to meet that racial equity constraint.

This results in a trade-off of efficiency though. Going back to the original hypothetical, trading off a 28 crime vs 30 crime area is not a big deal. But if the trade off was 3 crimes vs 30 that is a bigger deal. In this example I show that getting to 80% stops of minorities (NYC is around 70% minorities) results in hot spots with around 55% of the crime compared to the no constraint hot spots. So in the hypothetical it would go from 30 crimes to 17 crimes.

There won’t be a uniform formula to calculate the expected decrease in efficiency, but I think getting to perfect equality with the residential pop. will typically result in similar large decreases in many scenarios. A recent paper by George Mohler and company showed similar fairly steep declines. (That uses a totally different method, but I think will be pretty similar outputs in practice — can tune the penalty factor in a similar way to changing the linear program constraint I think.)

So basically the trade-off to get perfect equity will be steep, but I think the best case scenario is that a PD can say "this predictive policing strategy will not make current levels of DMC worse" by applying this algorithm on-top-of your predictive policing forecasts.

I will be presenting this work at ASC, so stop on by! Feedback always appreciated.