Injury rates at Amazon warehouses

I follow several of the News & Observer (The Raleigh/Durham newspaper) newsletters, and Brian Gordon and Tyler Dukes had a story recently about fainting and ambulance runs at the Amazon warehouse in Raleigh, Open Source: Ambulances at Amazon. He did some great sleuthing, and showed that while the number on its face seemed high (an ambulance call around 1 out of 3 days) the rate of ambulance runs when accounting for the size of the workforce is pretty similar to other warehouses.

Here I will show an example of downloading the OSHA injury data to show a similar finding. Using python it is pretty quick work.

So first can import the libraries we need (the typical scientific stack). I download the OSHA data for 2021, and I calculate injury rates per person work year, so how to interpret these are at the workplace level. Per full time people per year, it is the expected number of injuries across the workforce.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import beta

inj_2021url = "https://www.osha.gov/sites/default/largefiles/ITA-data-cy2021.zip"
inj_dat = pd.read_csv(inj_2021url)
# Calculate injuries per person full year
inj_dat['InjPerYear'] = (inj_dat['total_injuries']/inj_dat['total_hours_worked'])*2080

We can filter out warehouse workers via NAICS code 493110. I also just limit to warehouses in North Carolina. Sorting by the injury rate, Amazon is not even in the top 10 in the state:

warehouses = inj_dat[inj_dat['naics_code'] == 493110].copy()
warehouses_nc = warehouses[warehouses['state'] == 'NC'].reset_index(drop=True)
warehouses_nc['AmazonFlag'] = 1*(warehouses_nc['company_name'].str.find('Amazon.com') >= 0)

# Rate per year of work per person, 2080 
warehouses_nc.sort_values('InjPerYear',ascending=False,ignore_index=True,inplace=True)
warehouses_nc.head(10)

But note that I don’t think Bonded Logistics is a terribly dangerous place. One thing you need to watch out for when evaluating rate data is that places with smaller denominators (here lower total hours worked) tend to be more volatile. So a useful plot is to plot the total hours work (cumulative for the entire warehouse) against the overall rate of injuries per hour worked.

fig, ax = plt.subplots(figsize=(12,6))
ax.scatter(nam_ware['total_hours_worked'], nam_ware['InjPerYear'], 
           c='grey', s=30, edgecolor='k', alpha=0.5, label='Other Warehouses')
ax.scatter(amz_ware['total_hours_worked'], amz_ware['InjPerYear'], 
           c='blue', s=80, edgecolor='k', alpha=0.9, label='Amazon Warehouses')
ax.set_axisbelow(True)
ax.set_xlabel('Total Warehouse Hours Worked')
ax.set_ylabel('Injury Rate per Person Work Year (2080 hours)')
ax.legend(loc='upper right')
plt.savefig('InjRate.png', dpi=500, bbox_inches='tight')

You can see by this plot the Amazon warehouses have the largest total number of hours worked (by quite a few) relative to many other warehouses in North Carolina. But their overall rate of injuries is right in line with the rest of the crowd. Looking at the overall rate, it is around 0.04 (so you would expect around 1/20 full time workers to have an injury per year at a warehouse according to this data).

tot_rate = warehouses_nc['total_injuries'].sum()/warehouses_nc['total_hours_worked'].sum()
print(tot_rate*2080)

If we do this plot again, but add funnel bound lines to show the typical volatility we would expect with estimating these rates:

# Binomial confidence interval
def binom_int(num,den,confint=0.95):
    quant = (1 - confint)/ 2.
    low = beta.ppf(quant, num, den - num + 1)
    high = beta.ppf(1 - quant, num + 1, den - num)
    return (np.nan_to_num(low), np.where(np.isnan(high), 1, high))

den = np.geomspace(1000,8700000,500)
num = den*tot_rate
low_int, high_int = binom_int(num,den,0.99)
high_int = high_int*2080

fig, ax = plt.subplots(figsize=(12,6))
ax.plot(den,high_int, c='k', linewidth=0.5)
ax.hlines(tot_rate*2080,1000,8700000,colors='k', linewidths=0.5)
ax.scatter(nam_ware['total_hours_worked'], nam_ware['InjPerYear'], 
           c='grey', s=30, edgecolor='k', alpha=0.5, label='Other Warehouses')
ax.scatter(amz_ware['total_hours_worked'], amz_ware['InjPerYear'], 
           c='blue', s=80, edgecolor='k', alpha=0.5, label='Amazon Warehouses')
ax.set_axisbelow(True)
ax.set_xlabel('Total Warehouse Hours Worked')
ax.set_ylabel('Injury Rate per Person Work Year (2080 hours)')
plt.xscale('log', basex=10)
ax.legend(loc='upper right')
ax.annotate('Straight line is average overall injury rate\nCurved line is Binomial 99% Interval', 
            xy = (0.00, -0.13), xycoords='axes fraction')
plt.savefig('InjRate_wBin.png', dpi=500, bbox_inches='tight')

So you can see even Bonded Logistics is well within the average rate you would expect to still be consistent with the average overall injury rate relative to all the other warehouses in North Carolina.

As a note, I imagine I saw someone using this data recently looking at police departments in a criminal justice paper (I have in my notes police departments are NAICS code 922120). (Maybe Justin Nix/Michael Sierra-Arévalo/Ian Adams?) But sorry do not remember the paper (so I owe credit to someone else for pointing out this data, but not sure who).

Another way to do the analysis is to calculate the lower/upper confidence intervals per the rates, and then sort by the lower confidence interval. This way you can filter out high rate variance locations.

# Can look at police departments
# NAICS code 922120
police = inj_dat[inj_dat['naics_code'] == 922120].copy()
low_police, high_police = binom_int(police['total_injuries'],police['total_hours_worked'])
police['low_rate'] = low_police*2080
police.sort_values('low_rate',ascending=False,ignore_index=True,inplace=True)
check_fields = ['establishment_name','city','state','total_injuries','total_hours_worked','InjPerYear','low_rate']
police[check_fields].head(10)

So you can see we have some funny business going on with the LA data reporting (which OSHA mentions on the data webpage). Maybe it is just admin duty, so people are already injured and get assigned to those bureaus (not sure why LAPD reports seperate bureaus at all).

Using IO objects in python to read data

Just a quick post on a technique I’ve used a few times recently, in particular when reading web data.

First for a very quick example, in python when reading data with pandas, it often expects a filename on disk. For pandas, e.g. pd.read_csv('my_file.csv'). But if you happen to already have the contents of the csv in a text object in memory, you can use io.StringIO to just read that object.

import pandas as pd
from io import StringIO

# Example csv file inline
examp_csv = """a,b
1,x
2,z"""

pd.read_csv(StringIO(examp_csv))

Where this has come up for me recently is reading in different data from web servers. For example, here is Cary’s API for crime data, you can batch download the whole thing at the below url, but via this approach I currently get an SSL error:

# Town of Cary CSV for crimes
cary_url = 'https://data.townofcary.org/explore/dataset/cpd-incidents/download/?format=csv&timezone=America/New_York&lang=en&use_labels_for_header=true&csv_separator=%2C'

# Returns SSL Error for me
cary_df = pd.read_csv(cary_url)

Note I don’t know the distinction in web server tech that causes this (as sometimes you can just grab a CSV via url, here is an example I have grabbing PPP loan data or with the NIJ recidivism data).

But we can grab the data via requests, and use the same StringIO trick I just showed to get this data:

# Using string IO for reading text
import requests
res_cary = requests.get(cary_url)
cary_df = pd.read_csv(StringIO(res_cary.text))
cary_df

Again I don’t know why some servers you need to go through this approach, but this works for me for Socrata and CartoDB api’s for different cities open data. I also used in recently for converting geojson in ESRI’s api.

The second example I want to show is downloading zipfiles. For this, we will use io.BytesIO instead of StringIO. The census stores various data in zipfiles on their FTP server:

# Example 2, grabbing zipped contents
import zipfile
from io import BytesIO

census_url = 'https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/2019_5yr_Summary_FileTemplates.zip'
req = requests.get(census_url)

# Can use BytesIO for this content
zf = zipfile.ZipFile(BytesIO(req.content))

The zipfile library would be equivalent to reading/extracting a zipfile already on disk. But when downloading there is no need to save to disk, then deal with that file. BytesIO here cuts out the middleman.

Then we gan either grab a specific file inside of our zf object, or extract all the contents one-by-one:

# Now can loop through the list
# or grab specific file
zf.filelist[0]
temp_geo = pd.read_excel(zf.open('2019_SFGeoFileTemplate.xlsx'))
temp_geo.T

Hot spots of crime in Raleigh and home buying

So my realtor, Ellen Pitts (who is highly recommended, helped us a ton remotely moving into Raleigh), has a YouTube channel where she talks about real estate trends. Her most recent video she discussed a bit about crime in Raleigh relative to other cities because of the most recent shooting.

My criminologist hot take is that generally most cities in the US are relatively low crime. So Ellen shows Dallas has quite a few more per-capita shootings than Raleigh, but Dallas is quite safe “overall”. Probably somewhat contra to what most people think, the cities that in my opinion really have the most crime problems tend to be smaller rust belt cities. I love Troy, NY (where I was a crime analyst for a few years), but Troy is quite a bit rougher around the edges than Raleigh or Dallas.

So this post is more about, you have already chosen to move to Raleigh – if I am comparing house 1 and house 2 (or looking at general neighborhoods), do I need to worry about crime in this specific location?

So for a few specific resources/strategies for the home hunter. Not just in Raleigh, but many cities now have an open data portal. You can often look at crime. Here is an example with the Raleigh open data:

So if you have a specific address in mind, you can go and see the recent crime around that location (cities often fuzz the address a bit, so the actual points are just nearby on that block of the street). Blue dots in that screenshot are recent crimes in 2022 against people (you can click on each dot and get a more specific breakdown). Be prepared when you do this – crime is everywhere. But that said the vast majority of minor crime incidents should not deter you from buying a house or renting at a particular location.

Note I recommend looking at actual crime data (points on a map) for this. Several vendors release crime stats aggregated to neighborhoods or zipcodes, but these are of very low quality. (Often they “make up” data when it doesn’t exist, and when data does exist they don’t have a real great way to rank areas of low or high crime.)

For the more high level, should I worry about this neighborhood, I made an interactive hotspot map.

For the methodology, I focused on crimes that I would personally be concerned with as a homeowner. If I pull larceny crimes, I am sure the Target in North Hills would be a hotspot (but I would totally buy a condo in North Hills). So this pulls the recent crime data from Raleigh open data starting in 2020, but scoops up aggravated assaults, interpersonal robberies, weapon violations, and residential burglaries. Folks may be concerned about drug incidents and breaking into cars as well, but my experience those also do not tend to be in residential areas. The python code to replicate the map is here.

Then I created DBScan clusters that had at least 34 crimes – so these areas average at least one of these crimes per month over the time period I sampled. Zooming in, even though I tried to filter for more potentially residential related crimes, you can see the majority of these hot spots of crime are commercial areas in Raleigh. So for example you can zoom in and check out the string of hot spots on Capital Blvd (and if you click a hot spot you can get breakdowns of specific crime stats I looked at):

Very few of these hot spots are in residential neighborhoods – most are in more commercial areas. So when considering looking at homes in Raleigh, there are very few spots I would worry about crime at all in the city when making a housing choice. If moving into a neighborhood with a higher proportion of renters I think is potentially more important long term signal than crime here in Raleigh.

Hyperparameter tuning for Random Forests

Motivated to write this post based on a few different examples at work. One, we have periodically tried different auto machine learning (automl) libraries at work (with quite mediocre success). They are OK for a baseline, not so much for production. Two, a fellow data scientist was trying some simple hyperparameter search in random forests, and was searching over all the wrong things.

For those not familiar, automl libraries, such as data robot or databricks automl, typically do a grid search over different models given a particular problem (e.g. random forest, logistic regression, XGBoost, etc.). And within this you can have variants of similar models, e.g. RFMod1(max-depth=5) vs RFMod2(max-depth=10). Then given a train/test split, see which model wins in the test set, and then suggests promoting that model into production.

My experience at work with these different libraries (mostly tabular medical records data), they choose XGBoost variants at a very high frequency. I think this is partially due to poor defaults for the hyperparameter search in several of these automl libraries. Here I will just be focused on random forests, and to follow along I have code posted on Github.

Here I am using the NIJ recidivism challenge data I have written about in the past, and for just some upfront work I am loading in various libraries/functions (I have the recid functions in the githup repo). And then making it easy to just load in the train/test data (with some smart feature engineering already completed).

from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
from sklearn.metrics import brier_score_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import recid # my custom NIJ recid funcs

# Read in data and feature engineering
train, test = recid.fe()

Now first, in terms of hyperparameter tuning you need to understand the nature of the model you are fitting, and how those hyperparameters interact with the model. In terms of random forests for example, I see several automl libraries (and my colleague) doing a search over the number of trees (or estimators in sklearn parlance). Random forests the trees are independent, and so having more trees is always better. It is sort of like saying would you rather me do 100 simulations or 1,000,000 simulations to estimate a parameter – the 1,000,000 will of course have a more accurate estimate (although may be wasteful, as the extra precision is not needed).

So here, using the NIJ defined train/test split, and a set of different fixed parameters (close to what I typically default to for random forests). I show how AUC or the Brier Score changes with the number of trees, over a grid from 10 to 1000 trees:

# Loop over tree sizes
ntrees = np.arange(10,1010,10)
auc, bs = [], []

# This is showing number of estimators is asymptotic
# should approach best value with higher numbers
# but has some variance in out of sample test
for n in ntrees:
    print(f'Fitting Model for ntrees:{n} @ {datetime.now()}')
    # Baseline model
    mod = RandomForestClassifier(n_estimators=n, max_depth=5, min_samples_split=50)
    # Fit Model
    mod.fit(train[recid.xvars], train[recid.y1])
    # Evaluate AUC/BrierScore out of sample
    pred_out = mod.predict_proba(test[recid.xvars])[:,1]
    auc.append(roc_auc_score(test[recid.y1], pred_out))
    bs.append(brier_score_loss(test[recid.y1], pred_out))

# Making a plot of the 
fig, (ax1, ax2) = plt.subplots(2, figsize=(6,8))
ax1.plot(ntrees, auc)
ax1.set_title('AUC')
ax2.plot(ntrees, bs)
ax2.set_title('Brier Score')
plt.savefig('Ntree_grid.png', dpi=500, bbox_inches='tight')

We can see that the relationship is noisy, but the trend clearly decreases with tree size, and perhaps asymptotes post 200 some trees for both metrics in this particular set of data.

So of course it depends on the dataset, but when I see automl libraries choosing trees in the range of 10, 50, 100 for random forests I roll my eyes a bit. You always get more accurate (in a statistical sense), with more trees. You would only choose that few for convenience in fitting and time savings. The R library ranger has a better default of 500 trees IMO (sklearns 100 is often too small in doing tests). But there isn’t much point in trying to hyperparameter tune this – your hyperparameter library may choose a smaller number of trees in a particular run, but this is due to noise in the tuning process itself.

So what metrics do matter for random forests? All machine learning models you need to worry about over-fitting/under-fitting. This depends on the nature of the data (number of rows, can fit more parameters, fewer columns less of opportunity to overfit). Random forests this is dependent on the complexity of the trees – more complex trees can overfit the data. If you have more rows of data, you can fit more complex trees. So typically I am doing searches over (in sklearn parlance) max-depth (how deep a tree can grow), min-samples-split (can grow no more trees is samples are too tiny in a leaf node). Another parameter to search for is how many columns to subsample as well (more columns can find more complex trees).

It can potentially be a balancing act – if you have more samples per leaf, it by default will create less complex trees. Many of the other hyperparameters limiting the complexity of the trees are redundant with these as well (so you could really swap out with max-depth). Here is an example of using the Optuna library a friend recommended to figure out the best parameters for this particular data set, using a train/eval approach (so this splits up the training set even further):

# Consistent train/test split for all evaluations
tr1, tr2 = train_test_split(train, test_size=2000)

def objective(trial):
    param = {
        "n_estimators": 500,
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "min_samples_split": trial.suggest_int("min_samples_split", 10, 200, step=10),
        "max_features": trial.suggest_int("max_features", 3, 15),
    }
    mod = RandomForestClassifier(**param)
    mod.fit(tr1[recid.xvars], tr1[recid.y1])
    pred_out = mod.predict_proba(tr2[recid.xvars])[:,1]
    auc = roc_auc_score(tr2[recid.y1], pred_out)
    return auc

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
trial = study.best_trial

print(f"Best AUC {trial.value}")
print("Best Params")
print(trial.params)

So here it chooses fairly deep trees, 9, but limited complexity via the sample size in a leaf (90). The number of features per tree is alsio slightly larger than default (here 25 features, so default is sqrt(25)). So it appears my personal defaults were slightly under-fitting the data.

My experience regression prefers smaller depth of trees but lower sample sizes (splitting to 1 is OK), but for binary classification limiting sample sizes in trees is preferable.

One thing to pay attention to in these automl libraries, a few do not retrain on the full dataset with the selected hyperparameters. For certain scenarios you don’t want to do this (e.g. if using the eval set to do calibrations/prediction intervals), but if you don’t care about those then you typically want to retrain on the full set. Another random personal observation, random forests really only start to outperform simpler regression strategies at 20k cases, with smaller sample sizes and further splitting train/eval/test, the hyperparameter search is very noisy and mostly a waste of time. If you only have a few hundred cases just fit a reasonable regression model and call it a day.

So here I retrain on the full test set, and then look at the AUC/Brier Score compared to my default model.

# Lets refit according to these params for the full
# training set
mod2 = RandomForestClassifier(n_estimators=500,**trial.params)
mod2.fit(train[recid.xvars], train[recid.y1])
pred_out = mod2.predict_proba(test[recid.xvars])[:,1]
auc_tuned = roc_auc_score(test[recid.y1], pred_out)
bs_tuned = brier_score_loss(test[recid.y1], pred_out)
print(f"AUC tuned {auc_tuned:.4f} vs AUC default {auc[-1]:.4f}")
print(f"Brier Score tuned {bs_tuned:.4f} vs default {bs[-1]:.4f}")

It is definately worth hyperparameter tuning once you have your feature set down, but lift of ~0.01 AUC (which is typical for hypertuned tabular models in my experience) is not a big deal with initial model building. Figuring out a smart way to encode some relevant feature (based on domain knowledge of the problem you are dealing with) typically has more lift than this in my experience.

Column storage for wide datasets

The notion of big data is quite a bit overhyped. I do get some exposure to it at work, but many tools like Hadoop are not needed over doing things in chunks on more typical machines. One thing though I have learned more generally about databases, many relational databases (such as postgres) store data under the hood like this:

Index1|1|'a'|Index2|2|'b'.....|Index9999|1|'z'

So they are stacked cumulatively in an underlying file format. And then to access the information the system has to know “ok I need to find Index2 and column 2”, so it calculates offsets for where those pieces of information should be located in the file.

This however is not so nice for databases that have many columns. In particular administrative record databases that have few rows, but many thousands of columns (often with many missing fields and low dimensional categories), this default format is not very friendly. In fact many databases have limits on the number of columns a table can have. (“Few rows” is relative, but really I am contrasting with sensor data that often has billions/trillions of rows, but very few columns.)

In these situations, a columnar database (or data format) makes more sense. Instead of having to calculate many large number of offsets, the database often will represent the key-column pairs in some easier base format, and then will only worry about grabbing specific columns. Both representing the underlying data in a more efficient manner will lower on computer disk space (e.g. only takes up 1 gig instead of many gigabytes) as well as improve input/output operations on the data (e.g. it is faster to read the data/write new data).

I will show some examples via the American Community Survey data for micro areas. I have saved the functions/code here on github to follow along.

So first, I load in pandas and DuckDB. DuckDB is an open source database that uses by default a columnar storage format, but can consider it a very similar drop in replacement for sqllite for persistantly storing data.

import pandas as pd
import duckdb
import os

# my local census functions
# grabbing small area 2019 data
# here defaults to just Texas/Delaware
# still takes a few minutes
import census_funcs
cen2019_small, fields2019 = census_funcs.get_data(year=2019)
print(cen2019_small.shape) # (21868, 17145)

Here I grab the small area data for just Texas/Delaware (this includes mostly census tracts and census block groups in the census FTP data). You can see not so many rows, almost 22k, but quite a few columns, over 17k. This is after some data munging to drop entirely missing/duplicated columns even. The census data just has very many slightly different aggregate categories.

Next I save these data files to disk. For data that does not change very often, you just want to do this process a single time and save that data somewhere you can more easily access it. No need to re-download the data from the internet/census site everytime you want to do a new analysis.

I don’t have timing data here, but you can just experiment to see the parquet data format is quite a bit faster to save (and csv formats are the slowest). DuckDB is smart and just knows to look at your local namespace to find the referenced pandas dataframe in the execute string.

# Save locally to CSV file
cen2019_small.to_csv('census_test.csv')

# Trying zip compression
cen2019_small.to_csv('census_test.csv.zip',compression="zip")

# Save to parquet files
cen2019_small.to_parquet('census.parquet.gzip',compression='gzip',engine='pyarrow')

# Save to DuckDB, should make the DB if does not exist
con = duckdb.connect(database='census.duckdb',read_only=False)
con.execute('CREATE TABLE census_2019 AS SELECT * FROM cen2019_small')

If we then go and check out the datasizes on disk, csv for just these two states is 0.8 gigs (the full set of data for all 50 states is closer to 20 gigs). Using zip compression reduces this to around 1/4th of the size for the csv file. Using parquet format (which can be considered an alternative fixed file format to CSV although columnar oriented) and gzip compression is pretty much the same as zip compression for this data (not many missing values or repeat categories of numbers), but if you have repeat categorical data with a bit of missing data should be slightly better compression (I am thinking NIBRS here). The entire DuckDB database is actually smaller than the CSV file (I haven’t checked closely, I try to coerce the data to smarter float/int formats before saving, but there are probably even more space to squeeze out of my functions).

# Lets check out the file sizes
files = ['census_test.csv','census_test.csv.zip',
         'census.parquet.gzip','census.duckdb']

for fi in files:
    file_size = os.path.getsize(fi)
    print(f'File size for {fi} is {file_size/1024**3:,.1f} gigs')

# File size for census_test.csv is 0.8 gigs
# File size for census_test.csv.zip is 0.2 gigs
# File size for census.parquet.gzip is 0.2 gigs
# File size for census.duckdb is 0.7 gigs

The benefit of having a database here like DuckDB is for later SQL querying, as well as the ability to save additional tables. If I scoop up the 2018 data (that has slightly different columns), I can save to an additional table. Then later on downstream applications can select out the limited columns/years as needed (unlikely any real analysis workflow needs all 17,000 columns).

# Can add in another table into duckdb
cen2018_small, fields2018 = census_funcs.get_data(year=2018)
con.execute('CREATE TABLE census_2018 AS SELECT * FROM cen2018_small')
con.close()

ducksize = os.path.getsize(files[-1])
print(f'File size for {files[-1]} with two tables is {ducksize/1024**3:,.1f} gigs')
# File size for census.duckdb with two tables is 1.5 gigs

Sqllite is nice, as you can basically share a sqllite file and you know your friend will be able to open it. I have not worked with DuckDB that much, but hopefully it has similar functionality and ability to share without too much headache.

I have not worried about doing timing – I don’t really care about write timings of these data formats compared to CSV (slow read timing is annoying, but 10 seconds vs 1 minute to read data is not a big deal). But it is good practice to not be gluttonous with on disk space, which saving a bunch of inefficient csv files can be a bit wasteful.

Legends in python

Legends in python and matplotlib can be a little tricky (it is quite a web of different objects). Here are a few notes I have collected over different projects. First, one thing python does not do, even if you name things the same, it does not combine them in a legend. First example will show superimposing lines and error bars – even though I only have two labels, they each get their own slot in the resulting legend.

import numpy as np
import geopandas as gpd

# Need a bunch of junk from matplotlib
import matplotlib
from matplotlib import pyplot as plt
from matplotlib import patches
from matplotlib.legend_handler import HandlerPatch

# Making simple fake data
x = [0,1,2,3]
y1 = np.array([1,1,1,1])
y1l = y1 - 0.5
y1h = y1 + 0.5

y2 = np.array([2,2,2,2])
y2l = y2 - 0.1
y2h = y2 + 0.1

# Default, does not combine legends
fig, ax = plt.subplots()
# First Line
ax.plot(x,y1,zorder=3, color='blue',alpha=0.9,label='y1')
ax.fill_between(x,y1l,y1h,alpha=0.2,zorder=2,color='blue',label='y1')
# Second Line
ax.plot(x,y2,zorder=3, color='k',alpha=0.9,label='y2')
ax.fill_between(x,y2l,y2h,alpha=0.2,zorder=2,color='k',label='y2')
ax.legend(bbox_to_anchor=(1.0, 0.5))
plt.savefig('Leg01.png', dpi=500, loc="center left", bbox_inches='tight')

You can combine the legend items by scooping out the original objects, here via the ax.get_* function to get the labels and the “handles”. You can think of handles as just points/lines/polygons that refer to individual parts of the legend. And so combining the lines and polygons together, we can make the legend how we want it.

fig, ax = plt.subplots()
# First Line
ax.plot(x,y1,zorder=3, color='blue',alpha=0.9,label='y1')
ax.fill_between(x,y1l,y1h,alpha=0.2,zorder=2,color='blue',label='y1')
# Second Line
ax.plot(x,y2,zorder=3, color='k',alpha=0.9,label='y2')
ax.fill_between(x,y2l,y2h,alpha=0.2,zorder=2,color='k',label='y2')
# Can combine legend items
handler, labeler = ax.get_legend_handles_labels()
hd = [(handler[0],handler[1]),
       (handler[2],handler[3])]
lab = ['y1','y2']
ax.legend(hd, lab, loc="center left", bbox_to_anchor=(1, 0.5))
plt.savefig('Leg02.png', dpi=500, bbox_inches='tight')

I made a simple function combo_legend(ax) to combine items that have the same label in such matplotlib figures. So they do not need to per se be collections of similar items, just have the same text label.

# Can pass these to legend
def combo_legend(ax):
    handler, labeler = ax.get_legend_handles_labels()
    hd = []
    labli = list(set(labeler))
    for lab in labli:
        comb = [h for h,l in zip(handler,labeler) if l == lab]
        hd.append(tuple(comb))
    return hd, labli

# Combo line/error/scatter
fig, ax = plt.subplots()
# First Line
ax.plot(x,y1,zorder=3, color='blue',alpha=0.9,label='y1')
ax.fill_between(x,y1l,y1h,alpha=0.2,zorder=2,color='blue',label='y1')
ax.scatter(x,y1,c='blue', edgecolor='white',
           s=30,zorder=4,label='y1')
# Second Line
ax.plot(x,y2,zorder=3, color='k',alpha=0.9,label='y2')
ax.fill_between(x,y2l,y2h,alpha=0.2,zorder=2,color='k',label='y2')
hd, lab = combo_legend(ax)
ax.legend(hd, lab, loc="center left", bbox_to_anchor=(1, 0.5))
plt.savefig('Leg03.png', dpi=500, bbox_inches='tight')

Note that the set() call makes it so the order may not be how you want. You can just rearrange the objects you get back from combo_legend to sort them in the order you want.

Now for the second example in the post, is going to show some examples munging with geopandas objects/plots.

So default geopandas uses a continuous color ramp:

# Get counties for all US
county_url = r'https://www2.census.gov/geo/tiger/TIGER2019/COUNTY/tl_2019_us_county.zip'
us_county = gpd.read_file(county_url)
# Get counties for North Carolina
nc = us_county[us_county['STATEFP'] == '37'].copy()
nc['wat_prop'] = nc['AWATER']/(nc['AWATER'] + nc['ALAND'])

# Plot proportion area water
fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop', edgecolor='grey', linewidth=0.2, ax=ax, legend=True)
plt.savefig('Leg04.png', dpi=1000, bbox_inches='tight')

I have a tough time with continuous color ramps. The geopandas mapping user guide has an example of making a nicer sized continuous legend, but overall I find making classed choropleth maps much easier in general. geopandas objects have a special set of arguments, where you can pass information to class the map and make a legend:

# Make the legend not the raster scale
bin_edge = np.arange(0.2,0.9,0.2)
leg_args = {'loc': 'lower left',
            'prop': {'size': 9},
            'title':'Prop. Water'}

fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop',
        scheme="User_Defined",
        classification_kwds=dict(bins=bin_edge),
        legend_kwds=leg_args,
        edgecolor='grey',
        linewidth=0.2,
        ax=ax,
        legend=True)
plt.savefig('Leg05.png', dpi=1000, bbox_inches='tight')

But now we have circles. ArcGIS has the ability to use different glyphs in its legends, and it is common to use a blocky type glyph for geopolitical boundaries (which have unnatural straight lines).

If you go down the rabbit hole of geopandas objects, in this scenario the matplotlib handler is actually the same Line2D type object as if you call ax.plot(). So hacking together from a matplotlib doc tutorial, we can make a custom handler to draw our prespecified glyph. (Don’t take this as a nice example legend glyph, just threw something together very quick!) I also add in different labels to signify the boundary edges for the choropleth bins.

class MapHandler:
    def legend_artist(self, legend, orig_handle, fontsize, handlebox):
        x0, y0 = handlebox.xdescent, handlebox.ydescent
        # Can handle lines or polygons
        try:
            face_col = orig_handle.get_markerfacecolor()
        except:
            face_col = orig_handle.get_facecolor()
        width, height = handlebox.width, handlebox.height
        # Ugly shape, can make your own!
        xy = [[0,0],
              [1,0.2],
              [0.7,1],
              [0.05,0.4]]
        xy2 = []
        for x,y in xy:
            xt = x*width
            yt = y*height
            xy2.append([xt,yt])
        patch = patches.Polygon(xy2, fc=face_col)
        handlebox.add_artist(patch)
        return patch

leg_args2 = leg_args.copy()
leg_args2['handler_map'] = {matplotlib.lines.Line2D: MapHandler()}
leg_args2['labels'] = ['[0.0 to 0.2)','[0.2 to 0.4)','[0.4 to 0.6)','[0.6 to 0.8]']

fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop',
        scheme="User_Defined",
        classification_kwds=dict(bins=bin_edge),
        legend_kwds=leg_args2,
        edgecolor='grey',
        linewidth=0.2,
        ax=ax,
        legend=True)
plt.savefig('Leg06.png', dpi=1000, bbox_inches='tight')

This will be sufficient for many peoples choropleth map needs, but often times in maps you superimpose additional things, such as roads or other types of boundaries. So it may be necessary to go back into the legend and rebuild it entirely.

Here is an example of adding in something totally different, an ellipse, into the legend:

# Taken from that same matplotlib doc linked earlier
class HandlerEllipse(HandlerPatch):
    def create_artists(self, legend, orig_handle,
                       xdescent, ydescent, width, height, fontsize, trans):
        center = 0.5 * width - 0.5 * xdescent, 0.5 * height - 0.5 * ydescent
        p = patches.Ellipse(xy=center, width=width + xdescent,
                             height=height + ydescent)
        self.update_prop(p, orig_handle, legend)
        p.set_transform(trans)
        return [p]

fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop',
        scheme="User_Defined",
        classification_kwds=dict(bins=bin_edge),
        legend_kwds=leg_args2,
        edgecolor='grey',
        linewidth=0.2,
        ax=ax,
        legend=True)

# ax.get_legend_handles_labels() does not work here
leg = ax.get_legend()
handlers = leg.legendHandles
p = patches.Ellipse(xy=(0,0),width=2,height=1,facecolor='red')
handlers += [p]
new_labs = ['a','b','c','d','Whatever']
new_map = leg.get_legend_handler_map()
new_map[matplotlib.patches.Polygon] = MapHandler()
new_map[matplotlib.patches.Ellipse] = HandlerEllipse()
ax.legend(handlers, new_labs, handler_map=new_map, loc='lower left')
plt.savefig('Leg07.png', dpi=1000, bbox_inches='tight')

Checking for Stale RSS feeds in Python

So while I like RSS feeds, one issue I have noted over the years is that people/companies change the url for the feed. This ends up being a silent error in most feed readers (I swear the google reader had some metrics for this, but that was so long ago).

I have attempted to write code to check for not working feeds at least a few different times and given up. RSS feeds are quite heterogeneous if you look closely. Aliquote had his list though that I started going through to update my own, and figured it would be worth a shot to again filter out old/dormant feeds when checking out new blogs.

So first we will get the blog feeds and Christophe’s tags for each blog:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import ssl
import warnings # get warnings for html parser
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# aliquotes feed list
feeds = r'https://aliquote.org/pub/urls~'
gcont = ssl.SSLContext()
head = {'User-Agent': 'Chrome/104.0.0.0'}
response = urlopen(Request(feeds,headers=head),context=gcont).read()
list_feeds = response.splitlines()

I do not really know anything about SSL and headers (have had some small nightmares recently on work machines with ZScaler and getting windows subsystem for linux working with poetry/pip). So caveat emptor.

Next we have this ugly function I wrote. It could be cleaned up no doubt, but RSS feeds can fail (either website down, or the url is old/bad), so wrap everything in a try. Also most feeds have items/pubdate tags, but we have a few that have different structure. So all the if/elses are just to capture some of these other formats as best I can.

# Ugly Function to grab RSS last update
def getrss_stats(feed,context=gcont,header=head):
    try:
        req = Request(feed,headers=header)
        response = urlopen(req,context=context)
        soup = BeautifulSoup(response,'html.parser')
        all_items = soup.find_all("item")
        all_pub = soup.find_all("published")
        all_upd = soup.find_all("updated")
        if len(all_items) > 0:
            totn = len(all_items)
            if all_items[0].pubdate:
                last_dt = all_items[0].pubdate.text
            elif all_items[0].find("dc:date"):
                last_dt = all_items[0].find("dc:date").text
            else:
                last_dt = '1/1/1900'
        elif len(all_pub) > 0:
            totn = len(all_pub)
            last_dt = all_pub[0].text
        elif len(all_upd) > 0:
            totn = len(all_upd)
            last_dt = all_upd[0].text
        else:
            totn = 0 # means able to get response
            last_dt = '1/1/1900'
        return [feed,totn,last_dt]
    except:
        return [feed,None,None]

This returns a list of the feed url you put in, as well as the total number of posts in the feed, and the (hopefully) most recent date of a posting. Total numbers is partially a red herring, people may only publish to the feed some limited number of recent blog posts. I fill in missing data from a parsed feed as 0 and ‘1/1/1900’. If the response just overall was bad you get back None values.

So now I can loop over the list of our feeds. Here I only care about those with stats/python/sql (not worrying about Julia yet!):

# Looping over the list
# only care about certain tags
tg = set(['stats','rstats','python','sql'])
fin_stats = []

# Takes a few minutes
for fd in list_feeds:
    fdec = fd.decode().split(" ")
    rss, tags = fdec[0], fdec[1:]
    if (tg & set(tags)):
        rss_stats = getrss_stats(rss)
        rss_stats.append(tags)
        fin_stats.append(rss_stats.copy())

These dates are a bit of a mess. But this is the quickest way to clean them up I know of via some pandas magic:

# Convert data to dataframe
rss_dat = pd.DataFrame(fin_stats,columns=['RSS','TotItems','LastDate','Tags'])

# Coercing everything to a nicer formatted day
rss_dat['LastDate'] = pd.to_datetime(rss_dat['LastDate'].fillna('1/1/1900'),errors='coerce',utc=False)
rss_dat['LastDate'] = pd.to_datetime(rss_dat['LastDate'].astype(str).str[:11])

rss_dat.sort_values(by='LastDate',ascending=False,inplace=True,ignore_index=True)
print(rss_dat.head(10))

And you can see (that links to csv file) that most of Christophe’s feed is up to date. But some are dormant (Peter Norvig’s blog RSS is dormant, but his github he posts snippets/updates), and some have moved to different locations (such as Frank Harrell’s).

So for a feed reader to do something like give a note when a feed has not been updated for say 3 months I think would work for many feeds. Some people have blogs that are not as regular (which is fine), but many sites, such as journal papers, something is probably wrong if no updates for several months.

Assessing Categorical Effects in Machine Learning Models

One model diagnostic for machine learning models I like are accumulated local effects (ALE), see Wheeler & Steenbeek (2021) for an example and Molnar (2020) for a canonical mathematical reference. With these we get some ex-ante interpretability of models – I use this for mostly EDA of the final fitted model. Here is an example of seeing the diffusion effect of DART stations on robberies in Dallas from my cited paper:

So the model is behaving as expected – nearby DART stations causes an uptick, and that slowly diffuses away. And the way the ML model is set up it can estimate that diffusion effect, I did not apriori specify what that should look like.

These are essentially average/marginal effects (or approximate derivatives) for complicated machine learning models. In short pseudoish/python code, pretend we have a set of data D and a model, the local effect of variable x at the value 5 is something like:

D['x'] = 5 # set all the data for x to value 5
pred5 = mod.predict(D)
D['x'] = 5 + 0.001 # change value x by just alittle
predc = mod.predict(D)
loc_eff = (pred5 - predc)/0.001
print(loc_eff.mean())

So in shorthand [p(y|x) - p(y|x+s)]/s, so s is some small change (approximate the continuous derivative via finite differences). Then you generate these effects (over your sample), for various values of x, and then make a plot.

Many people say that this only applies to numerical features. So say we have a categorical effect with three variables, a/b/c. We could calculate p(y|a) and p(y|b), but because a - b is not defined (what is the difference in categorical variables), we cannot have anything like a derivative in the ALE for categorical features.

This seems to me though short sited. While we cannot approximate a derivative, the value p(y|a) - p(y|b) is pretty interpretable without the division – this is the predicted difference if we switch from category a to category b. Here I think a decent default is to simply do p(y|cat) - mean(p(y|other cat)), and then you can generate average categorical effects for each category (with very similar interpretation to ALEs). For those who know about regression contrasts, this would be like saying we have dummy variables for A/B/C, and the effect of A is contrast coded via 1,-1/2,-1/2.

Here is a simple example in python. For data see my prior post on the NIJ recidivism challenge or mine and Gio’s working paper (Circo & Wheeler, 2021). Front end cleaning up the data is very similar. I use a catboost model here.

import catboost
import numpy as np
import pandas as pd

# Ommitted Code, see 
# https://andrewpwheeler.com/2021/07/24/variance-of-leaderboard-metrics-for-competitions/
# for how pdata is generated
pdata = prep_data(full_data)

# Original train/test split
train = pdata[pdata['Training_Sample'] == 1].copy()
test = pdata[pdata['Training_Sample'] == 0].copy()

# estimate model, treat all variables as categorical
y_var = 'Recidivism_Arrest_Year1'
x_vars = list(pdata)
x_vars.remove(y_var)
x_vars.remove('Training_Sample')
cat_vars = list( set(x_vars) - set(more_clip) )

cb = catboost.CatBoostClassifier(cat_features=cat_vars)
cb.fit(train[x_vars],train[y_var])

Now we can do the hypothetical change the category and see how it impacts the predicted probabilities (you may prefer to do this on the logit scale, but since it is conditional on all other covariates it should be OK IMO). Here I calculate the probabilities over each of the individual PUMAs in the sample.

# Get the differences in probabilities swapping
# out each county, conditional on other factors
pc = train.copy()
counties = pd.unique(train['Residence_PUMA']).tolist()
res_vals = []
for c in counties:
    pc['Residence_PUMA'] = c
    pp = cb.predict_proba(pc[x_vars])[:,1]
    res_vals.append(pd.Series(pp))

res_pd = pd.concat(res_vals,axis=1)
res_pd.columns = counties
res_pd

So you can see for the person in the first row, if they were in PUMA 16, they would have a predicted probability of recidivism of 0.140. If you switched them to PUMA 24, it changes to 0.136. So you can see the PUMA overall doesn’t appear to have much of an impact on the recidivism prediction in this catboost model.

Now here is leave one out centering as I stated before. So we compare PUMA 16 to the average of all other PUMAs, within each row.

# Now mean center
n = res_pd.shape[1]
row_sum = res_pd.sum(axis=1)
row_adj = (-1*res_pd).add(row_sum,axis=0)/(n-1)
ycent = res_pd - row_adj
ycent

And now we can do various column aggregations to get the average categorical effects per each category. You can do whatever aggregation you want (means/medians/percentiles). (I’ve debated on making my own library to make ALEs a bit more general and return variance estimates as well.)

# Now can get mean/sd/ptils
mn = ycent.mean(axis=0)
sd = ycent.std(axis=0)
low = ycent.quantile(0.025,axis=0)
hig = ycent.quantile(0.975,axis=0)
fin_stats = pd.concat([mn,sd,low,hig],axis=1)
# Cleaning up the data
fin_stats.columns = ['Mean','Std','Low','High']
fin_stats.reset_index(inplace=True)
fin_stats.rename(columns={"index":"PUMA"}, inplace=True)
fin_stats.sort_values(by='Mean',ascending=False,
                      inplace=True,ignore_index=True)
fin_stats

And we can see that while I sorted the PUMAs and PUMA 25 has a mean effect of 0.03, its standard deviation is quite high. The only PUMA that the percentiles do not cover 0 is PUMA 13, with a negative effect of -0.06.

Like I said, I like these more so for model checking/EDA/face validity. Here I would dig into further PUMA 25/13, make sure nothing funny is going on with the data (and whether I should try to tease out more features from these in real life if I had access to the source data, e.g. smaller aggregations). The other PUMAs though are quite unremarkable and have spreads of +/-2 percentage points pretty consistently.

References

  • Circo, G., & Wheeler, A.P. (2021). National Institute of Justice Recidivism Forecasting Challange Team “MCHawks” Performance Analysis. CrimRXiv.
  • Molnar, C. (2020). Interpretable machine learning. Ebook.
  • Wheeler, A. P., & Steenbeek, W. (2021). Mapping the risk terrain for crime using machine learning. Journal of Quantitative Criminology, 37(2), 445-480.

Using linear programming to assess spatial access

So one of the problems I have been thinking about at work is assessing spatial access to providers. Some common metrics are ‘distance to nearest’, or combining distance as well as total provider capacity into one metric, the two-step floating catchment method (2SFCA).

So imagine we are trying to evaluate the coverage of opioid treatment facilities relative to the spatial distribution of people abusing opioids. Say for example we have two areas, A and B. Area A has a facility that can treat 50 people, but has a total of 100 people who need service. Now Area B has no treatment provider, but only has 10 people that need service.

If you looked at distance to nearest provider, location B would look worse than A. So you may think to yourself – we should open up a facility in Area B. But if you look at the total number of people, as well as the capacity of the treatment provider in Area A, it would probably be a better investment to expand the capacity of treatment center A. It has more potential demand.

The 2SFCA partially captures this, but I am going to show a linear programming approach that can decompose the added benefit of adding capacity to a particular provider, or adding in potential new providers. I have the posted these functions on github, but can walk through the linear programming model, and how to assess potential policy interventions in that framework.

So first lets make just a simple set of data, we have 10 people with XY coordinates, and 2 service providers. The two service providers have capacity 3 and 5, so we cannot cover 2 of the people.

# Python example code
import numpy as np
import pandas as pd
import pulp

# locations of people
id = range(10)
px = [1,1,3,3,4,5,6,6,9,9]
py = [2,8,1,8,4,6,1,2,5,8]
peop = pd.DataFrame(zip(id,px,py),columns=['id','px','py'])

# locations of 2 providers & capacity 3/5
hid = [1,2]
hx = [1,8]
hy = [1,8]
hc = [3,5] # so cant cover 2 people
prov = pd.DataFrame(zip(hid,hx,hy,hc),columns=['hid','hx','hy','hc'])

Now for the subsequent model I show, it is going to assign the people to the nearest possible provider, given the distance constraints. To make the model feasible, we need to add in a slack provider that can soak up the remaining people not covered. We set this location to somewhere very far away, so the model will only assign people to this slack provider in the case no other options are available.

# add in a slack location to the providers
# very out of the way and large capacity
prov_slack = prov.iloc[[0],:].copy()
prov_slack['hid'] = 999
prov_slack['hx'] = 999
prov_slack['hy'] = 999
prov_slack['hc'] = peop.shape[0]
prov_add = pd.concat([prov,prov_slack],axis=0)

Now the decision model looks at all pairwise combinations between people and providers (potentially eliminating combinations that are too far away to be reasonable). So I do the cross join between the people/providers, and then calculate the Euclidean distance between them. I set the distance for the slack provider to a very high value (here 9999).

# distance between people/providers
peop['const'] = 1
prov_add['const'] = 1
cross = pd.merge(peop,prov_add,on='const')
cross['dist'] = np.sqrt( (cross['px']-cross['hx'])**2 + 
                         (cross['py']-cross['hy'])**2 )
cross.set_index(['id','hid'],inplace=True)
# setting max distance for outside provider to 9999
cross.loc[cross.xs(999,level="hid",drop_level=False).index,"dist"] = 9999

Now we are ready to fit our linear programming model. First our objective is to minimize distances. (If all inputs are integers, you can use continuous decision variables and it will still return integer solutions.)

# now setting up linear program
# each person assigned to a location
# locations have capacity
# minimize distance traveled

# Minimize distances
P = pulp.LpProblem("MinDist",pulp.LpMinimize)

# each pair gets a decision variable
D = pulp.LpVariable.dicts("DA",cross.index.tolist(),
                          lowBound=0, upBound=1, cat=pulp.LpContinuous)

# Objective function based on distance
P += pulp.lpSum(D[i]*cross.loc[i,'dist'] for i in cross.index)

And we have two types of constraints, one is that each person is assigned a single provider in the end:

# Each person assigned to a single provider
for p in peop['id']:
    provl = cross.xs(p,0,drop_level=False).index
    P += pulp.lpSum(D[i] for i in provl) == 1, f"pers_{p}"

As a note later on, I will expand this model to include multiple people from a single source (e.g. count of people in a zipcode). For that expanded model, this constraint turns into pulp.lpSum(...) == tot where tot is the total people in a single area.

The second constraint is that providers have a capacity limit.

# Each provider capacity constraint
for h in prov_add['hid']:
    peopl = cross.xs(h,level=1,drop_level=False)
    pid = peopl.index
    cap = peopl['hc'].iloc[0] # should be a constant
    P += pulp.lpSum(D[i] for i in pid) <= cap, f"prov_{h}"

Now we can solve the model, and look at who was assigned to where:

# Solve the model
P.solve(pulp.PULP_CBC_CMD()) 
# CPLEX or CBC only ones I know of that return shadow
pulp.value(P.objective) #print objective 20024.33502494309


# Get the person distances
res_pick = []
for ph in cross.index:
    res_pick.append(D[ph].varValue)

cross['picked'] = res_pick
cross[cross['picked'] > 0.99]

So we can see two people were assigned the slack provider 999. Note that some people are not even assigned the closest – person 7 (6,2) is assigned to provider 2 at a distance of 6.3, it is only a distance of 5.1 away from provider 1. Because provider 1 has limited capacity though, they are assigned to provider 2 in the end.

In this framework, we can get the shadow price for the constraints, which says if we relax the constraint, how much it will improve our objective value.So if we add 1 capacity to provider 1, we will improve our objective by -9994.

# Get the shadow constraints per provider
o = [{'name':name, 'shadow price':c.pi, 'slack': c.slack} 
     for name, c in P.constraints.items()]
sc = pd.DataFrame(o)
print(sc)

I have helper functions at the github link above, so I don’t need to go through all of these motions again. You input your people matrix and the field names for the id, x, y, totn values.

And then you input your provider matrix with the field names for the providerid, x, y, prov_capacity (note this is not the matrix with the additional slack provider, my code adds that in automatically). The final two arguments limit the potential locations (e.g. here saying can’t assign a person to a provider over 12 distance away). And the last argument sets the super high distance penalty to people are not assigned.

# load in model functions
from assign_funcs import ProvAssign

# Const=1 is the total people per area
m1 = ProvAssign(peop,
                ['id','px','py','const'],
                prov,
                ['hid','hx','hy','hc'],
                12,
                9999)

m1.solve(pulp.PULP_CBC_CMD())
# see the same objective as before

Now we can go ahead and up our capacity at provider 1 by 1, and see how the objective is reduced by -9994:

# if we up the provider capacity for
# prov by 1, the model objective goes 
# down by -9994
prov['hc'] = [4,5]

m2 = ProvAssign(peop,
                ['id','px','py','const'],
                prov,
                ['hid','hx','hy','hc'],
                12,
                9999)
m2.solve(pulp.PULP_CBC_CMD())
m1.obj - m2.obj # 9994!

Like I said, I extended this to the scenario that you don’t have individual people, but have multiple counts of people in a spatial area. What can happen in this scenario is one source location can send people to multiple providers. Here you can see that source location 4 (total of 12 people), 5 were sent to provider 1, and 7 were sent to provider 2.

# Can make these multiple people, 100 total
peop['tot'] = [10,15,5,10,12,11,20,6,9,2]
prov['cap'] = [40,50] # should assign 10 people to slack

m3 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov,
                ['hid','hx','hy','cap'],
                12,
                9999)
m3.solve(pulp.PULP_CBC_CMD())
# Can see the assignments from one
# source can spill over into multiple
# provider locations
m3.assign

So one of the things I like about this approach I already showed, we can do hypothetical scenarios ‘add capacity’ and see how it improves overall travel. Another potential intervention is to just place dummy 0 capacity providers over the study area, then look at the shadow constraints to see the best locations to open new facilities. Here I add in a potential provider at location (5,5).

# Can add in hypothetical providers with
# no capacity (make new providers)
# and check out the shadow
p2 = prov_slack.copy()
p2['hid'] = 10
p2['hx'] = 5
p2['hy'] = 5
p2['hc'] = 0
p2['cap'] = 0
prov_add = pd.concat([prov,p2],axis=0)


m4 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov_add,
                ['hid','hx','hy','cap'],
                10,
                9999)

# we now don't have source9 and prov1
m4.cross

Here in the code, I also have a function to limit potential assignments. Here if I set that limit to 10, it only just prevents id 9 (9,8) from being assigned to provider 1 (1,1), which is a distance of 10.6 away. This is useful with larger datasets, in which you may not be able to fit all of the pairwise distances into memory and model. (Although this model is pretty simple, you may be able to look at +1 million pairwise combos and solve in a reasonable time with open source solvers.)

Now we can go ahead and solve this model. I have a bunch of helper functions as well, so after solving we can check out the shadow price matrix:

# Solve m4 model
m4.solve(pulp.PULP_CBC_CMD())

# can see the new provider 10
# if we added capacity would
# decrease distance travelled by -9996.2426
m4.shadow

And based on this, it does look like we will have the best improvement by adding capacity at our hypothetical new provider 10, as oppossed to adding capacity at provider 1 or 2. Lets see what happens if you add 10 capacity to our hypothetical provider:

# Now lets add capacity for new provider
# by 10, should the objective go down
# by 10*-9996.2426 ?
prov_add['cap'] = [40,50,10]

m5 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov_add,
                ['hid','hx','hy','cap'],
                12,
                9999)
m5.solve(pulp.PULP_CBC_CMD())

# Not quite -9996.2 but close!
(m5.obj - m4.obj)/10

We can see that the objective was not quite reduced by the expected, but is close. If it is not feasible to add a totally new provider, but simpler to give resources to expand current ones, we can see what will happen if we expand provider 1 by 10.

# we could add capacity
# to provider 1 by 10
# as well
prov_add['cap'] = [50,50,0]

m6 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov_add,
                ['hid','hx','hy','cap'],
                10,
                9999)
m6.solve(pulp.PULP_CBC_CMD())

# Not quite -9993.4 but close!
(m6.obj - m4.obj)/10

In addition to doing different policy interventions in this approach, I provide different metrics to assess distance traveled and coverage. So going back to model 4, we can look at which areas have limited coverage. It only ends up that source area 1 has 10/15 people not covered. The rest of the areas are covered.

# Can look at sources and see stats
# for distance travelled as well
# as potential non-coverage
m4.source_stats

The fields are Trav is the average distance travelled for people who are covered (so could have high coverage but those people have to go a ways). Tot is the total number of people within that particular source area, and picked/not-covered are those assigned a provider/not assigned a provider (so go to the slack) respectively.

I additionally I have stats available rolled up to providers, mostly based on the not covered nearby. One way that the shadow doesn’t work, if you only have 10 people nearby, it doesn’t make sense to expand capacity to any more than 10 people. Here you can see the shadow price, as well as those not covered that are nearby to that provider.

# Can look at providers
# and see which ones would result in
# most reduced travel given increased coverage
# Want high NotCover and low shadow price
m4.prov_stats

The other fields, hc is the listed capacity. Trav is the total travel for those not covered. The last field bisqw, is a way to only partially count not covered based on distance. It uses the bi-square kernel weight, based on the max distance field you provided in the model object. So if many people are nearby it will get a higher weight.

Here good add capacity locations are those with low shadow prices, and high notcovered/bisqw fields.

Note these won’t per se give the best potential places to add capacity. It may be the best new solution is to spread out new capacity – so here instead of adding 10 to a single provider, add a few to different providers.

You could technically figure that out by slicing out those leftover people in the m4.source_stats that have some not covered, adding in extra capacity to each provider, and then rerunning the algorithm and seeing where they end up.

Again I like this approach as it lets you articulate different policy proposals and see how that changes the coverage/distance travelled. Although no doubt there are other ways to formulate the problem that may make sense, such as maximizing coverage or making a coverage/distance bi-objective function.

Estimating Criminal Lifetime Value

At work I am currently working on estimating/forecasting healthcare spending. Similar to work I have done on forecasting person level crime risks (Wheeler et al., 2019), I build the predictive model dataset like this:

CrimeYear2020 PriorCrimeA PriorCrimeB
     0              2          3
     1              5          0
     0              0          0

etc. So I flatten people to a single row, and as covariates include prior cumulative crime histories. Most people do this similarly in the healthcare setting, so it looks like:

SpendingYear2020 PriorComorbidA PriorComorbidB
     3000              1          2
      500              3          0
    10000              0          0

Or sometimes people do a longitudinal dataset, where it is a spending*year*person panel (see Lauffenburger et al., 2020 for an example). I find this approach annoying for a few reasons though. One, it requires arbitrary temporal binning, which is somewhat problematic in these transaction level databases. We are talking for chronic offenders a few crimes per year is quite high, and ditto in the healthcare setting a few procedures a year can be very costly. So there is not much data to estimate the underlying function over time.

A second aspect I think is bad is that it doesn’t take into account the recency of the patterns. So the variables on the right hand side can be very old or very new. And with transaction level databases it is somewhat difficult to define how to estimate the lookback – do you consider it normalized by time? The VOID paper I mentioned we evaluated the long term scores, but the PD that does that chronic offender system has two scores – one a cumulative history and another a 90 day history to attempt to deal with that issue (again ad-hoc).

One approach to this issue from marketing research I have seen from very similar types of transactions databases are models to estimate Customer Lifetime Value (Fader et al. 2005). These models in the end generate a dataset that looks like this:

Person    RecentMonths  TotalEvents AveragePurchase
  A            3             5            $50
  B            1             2           $100
  C            9             8            $25

TotalEvents should be straightforward, RecentMonths just is a measure of the time since the last purchase, and then you have the average value of the purchases. And using just this data, estimates the probability of any future purchases, as well as projects the total value of the future average purchases. So here I use an example of this approach, using the Wolfgang Philly cohort public data. I am not going into the model more specifically (read some of the Bruce Hardie notes to get a flavor).

I have created some python code to follow along and apply these same customer lifetime value estimates to chronic offender data. Most examples of weighting crime harm apply it to spatial areas (Mitchell, 2019; Wheeler & Reuter, 2021), but you can apply it the same to chronic offender lists (Liggins et al., 2019).

Example Criminal Lifetime Value in Python

First, install the lifetimes python library – Cam’s documentation is excellent and makes the data manipulation/modelling quite simple.

Here I load in the transaction level crime data, e.g. it just have person A, 1/5/1960, 1000, where the 1000 is a crime seriousness index created by Wolfgang. Then the lifetimes package has some simple functions to turn our data into the frequency/recency format.

Note that for these models, you drop the first event in the series. To build a model to do train/test, I also split the data into evens before 1962, and use 1962 as the holdout test period.

import lifetimes as lt
import pandas as pd

# Just the columns from dataset II
# ID, SeriousScore, Date
df = pd.read_csv('PhilData.csv')
df['Date'] = pd.to_datetime(df['Date'])

# Creating the cumulative data
# Having holdout for one year in future
sd = lt.utils.calibration_and_holdout_data(df,'ID','Date',
              calibration_period_end='12-31-1961',
              observation_period_end='12-31-1962',
              freq='M',
              monetary_value_col='SeriousScore')

# Only keeping people with 2+ events in prior period
sd = sd[sd['frequency_cal'] > 0].copy()
sd.head()

Recency_cal is how many months since a prior crime (starting in 1/1/1962), frequency is the total number of events (minus 1, so number of repeat events technically), and the monetary_value_cal here is the average of the crime seriousness across all the events. The way this function works, the variables with the subscript _cal are in the training period, and _holdout are events in the 1962 period. For subsequent models I subset out people with at least 2 events total in the modeling.

Now we can fit a model to estimate the predicted number of future crimes a person will commit – so this does not take into account the seriousness of those crimes. The final groupby statement shows the predicted number of crimes vs those actually committed, broken down by number of crimes in the training time period. You can see the model is quite well calibrated over the entire sample.

# fit BG model
bgf = lt.BetaGeoFitter(penalizer_coef=0)
bgf.fit(sd['frequency_cal'],sd['recency_cal'],sd['T_cal'])

# look at fit of BG model
t = 12
sd['pred_events'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, sd['frequency_cal'], sd['recency_cal'],sd['T_cal'])
sd.groupby('frequency_cal',as_index=False)[['frequency_holdout','pred_events']].sum() # reasonable

Now we can fit a model to estimate the average crime severity score for an individual as well. Then you can project a future cumulative score for an offender (here over a horizon of 1 year), by multiple the predicted number of events times the estimate of the average severity of the events, what I label as pv here:

# See conditional seriousness
sd['pred_ser'] = ggf.conditional_expected_average_profit(
                              sd['frequency_cal'],
                              sd['monetary_value_cal'])

sd['pv'] = sd['pred_ser']*sd['pred_events']
sd['cal_tot_val'] = sd['monetary_value_holdout']*sd['frequency_holdout']
# Not great correlation, around 0.2
vc = ['frequency_holdout','monetary_value_holdout','cal_tot_val','pred_events','pv']
sd[vc].corr()

The correlation between pv and the holdout cumulative crime severity cal_tot_val, is not great at 0.26. But lets look at this relative to the more typical approach analysts will do, simply rank prior offenders based on either total number of events or the crime seriousness:

# Lets look at this method via just ranking prior
# seriousness or frequency
sd['rank_freq'] = sd['frequency_cal'].rank(method='first',ascending=True)
sd['rank_seri'] = (sd['monetary_value_cal']*sd['frequency_cal']).rank(method='first',ascending=True)
vc += ['rank_freq','rank_seri']
sd[vc].corr()[vc[-3:]]

So we can see that pv outperforms ranking based on total crimes (rank_freq), or ranking based on the cumulative serious score for offenders (rank_seri) in terms of the correlation for either the total number of future events or the cumulative crime harm.

If we look at capture rates, e.g. pretend we highlight the top 50 chronic offenders for intervention, we can see the criminal lifetime value pv estimate outperforms either simple ranking scheme by quite a bit:

# Look at capture rates by ranking
topn = 50
res_summ = []
for v in vc[-3:]:
    rank = sd[v].rank(method='first',ascending=False)
    locv = sd[rank <= topn].copy()
    tot_crimes = locv['frequency_holdout'].sum()
    tot_ser = locv['cal_tot_val'].sum()
    res_summ.append( [v,tot_crimes,tot_ser,topn] )

res_df = pd.DataFrame(res_summ,columns=['Var','TotCrimes','TotSer','TotN'])
res_df

In terms of the seriousness projection, it is reasonably well calibrated over the entire sample, but has a very tiny variance – it basically just predicts the average crime serious score over the sample and assigns that as the prediction going forward:

# Cumulative stats over sample reasonable
# variance much too small
sd[['cal_tot_val','pv']].describe()

So what this means is that if say Chicago READI wanted to do estimates to reasonably justify the max dollar cost for their program (over a large number of individuals) that would be reasonable. And this is how most marketing people use this info, average benefits of retaining a customer.

For individual projections though, e.g. I think OffenderB will generate between [low,high] crime harm in the next year, this is not quite up to par. I am hoping though to pursue these models further, maybe either in a machine learning/regression framework to estimate the parameters directly, or to use mixture models in an equivalent way that marketers use “segmentation” to identify different types of customers. Knowing the different way people have formulated models though is very helpful to be able to build a machine learning framework, which you can incorporate covariates.

References