Hot spots of crime in Raleigh and home buying

So my realtor, Ellen Pitts (who is highly recommended, helped us a ton remotely moving into Raleigh), has a YouTube channel where she talks about real estate trends. Her most recent video she discussed a bit about crime in Raleigh relative to other cities because of the most recent shooting.

My criminologist hot take is that generally most cities in the US are relatively low crime. So Ellen shows Dallas has quite a few more per-capita shootings than Raleigh, but Dallas is quite safe “overall”. Probably somewhat contra to what most people think, the cities that in my opinion really have the most crime problems tend to be smaller rust belt cities. I love Troy, NY (where I was a crime analyst for a few years), but Troy is quite a bit rougher around the edges than Raleigh or Dallas.

So this post is more about, you have already chosen to move to Raleigh – if I am comparing house 1 and house 2 (or looking at general neighborhoods), do I need to worry about crime in this specific location?

So for a few specific resources/strategies for the home hunter. Not just in Raleigh, but many cities now have an open data portal. You can often look at crime. Here is an example with the Raleigh open data:

So if you have a specific address in mind, you can go and see the recent crime around that location (cities often fuzz the address a bit, so the actual points are just nearby on that block of the street). Blue dots in that screenshot are recent crimes in 2022 against people (you can click on each dot and get a more specific breakdown). Be prepared when you do this – crime is everywhere. But that said the vast majority of minor crime incidents should not deter you from buying a house or renting at a particular location.

Note I recommend looking at actual crime data (points on a map) for this. Several vendors release crime stats aggregated to neighborhoods or zipcodes, but these are of very low quality. (Often they “make up” data when it doesn’t exist, and when data does exist they don’t have a real great way to rank areas of low or high crime.)

For the more high level, should I worry about this neighborhood, I made an interactive hotspot map.

For the methodology, I focused on crimes that I would personally be concerned with as a homeowner. If I pull larceny crimes, I am sure the Target in North Hills would be a hotspot (but I would totally buy a condo in North Hills). So this pulls the recent crime data from Raleigh open data starting in 2020, but scoops up aggravated assaults, interpersonal robberies, weapon violations, and residential burglaries. Folks may be concerned about drug incidents and breaking into cars as well, but my experience those also do not tend to be in residential areas. The python code to replicate the map is here.

Then I created DBScan clusters that had at least 34 crimes – so these areas average at least one of these crimes per month over the time period I sampled. Zooming in, even though I tried to filter for more potentially residential related crimes, you can see the majority of these hot spots of crime are commercial areas in Raleigh. So for example you can zoom in and check out the string of hot spots on Capital Blvd (and if you click a hot spot you can get breakdowns of specific crime stats I looked at):

Very few of these hot spots are in residential neighborhoods – most are in more commercial areas. So when considering looking at homes in Raleigh, there are very few spots I would worry about crime at all in the city when making a housing choice. If moving into a neighborhood with a higher proportion of renters I think is potentially more important long term signal than crime here in Raleigh.

Hyperparameter tuning for Random Forests

Motivated to write this post based on a few different examples at work. One, we have periodically tried different auto machine learning (automl) libraries at work (with quite mediocre success). They are OK for a baseline, not so much for production. Two, a fellow data scientist was trying some simple hyperparameter search in random forests, and was searching over all the wrong things.

For those not familiar, automl libraries, such as data robot or databricks automl, typically do a grid search over different models given a particular problem (e.g. random forest, logistic regression, XGBoost, etc.). And within this you can have variants of similar models, e.g. RFMod1(max-depth=5) vs RFMod2(max-depth=10). Then given a train/test split, see which model wins in the test set, and then suggests promoting that model into production.

My experience at work with these different libraries (mostly tabular medical records data), they choose XGBoost variants at a very high frequency. I think this is partially due to poor defaults for the hyperparameter search in several of these automl libraries. Here I will just be focused on random forests, and to follow along I have code posted on Github.

Here I am using the NIJ recidivism challenge data I have written about in the past, and for just some upfront work I am loading in various libraries/functions (I have the recid functions in the githup repo). And then making it easy to just load in the train/test data (with some smart feature engineering already completed).

from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
from sklearn.metrics import brier_score_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import recid # my custom NIJ recid funcs

# Read in data and feature engineering
train, test = recid.fe()

Now first, in terms of hyperparameter tuning you need to understand the nature of the model you are fitting, and how those hyperparameters interact with the model. In terms of random forests for example, I see several automl libraries (and my colleague) doing a search over the number of trees (or estimators in sklearn parlance). Random forests the trees are independent, and so having more trees is always better. It is sort of like saying would you rather me do 100 simulations or 1,000,000 simulations to estimate a parameter – the 1,000,000 will of course have a more accurate estimate (although may be wasteful, as the extra precision is not needed).

So here, using the NIJ defined train/test split, and a set of different fixed parameters (close to what I typically default to for random forests). I show how AUC or the Brier Score changes with the number of trees, over a grid from 10 to 1000 trees:

# Loop over tree sizes
ntrees = np.arange(10,1010,10)
auc, bs = [], []

# This is showing number of estimators is asymptotic
# should approach best value with higher numbers
# but has some variance in out of sample test
for n in ntrees:
    print(f'Fitting Model for ntrees:{n} @ {datetime.now()}')
    # Baseline model
    mod = RandomForestClassifier(n_estimators=n, max_depth=5, min_samples_split=50)
    # Fit Model
    mod.fit(train[recid.xvars], train[recid.y1])
    # Evaluate AUC/BrierScore out of sample
    pred_out = mod.predict_proba(test[recid.xvars])[:,1]
    auc.append(roc_auc_score(test[recid.y1], pred_out))
    bs.append(brier_score_loss(test[recid.y1], pred_out))

# Making a plot of the 
fig, (ax1, ax2) = plt.subplots(2, figsize=(6,8))
ax1.plot(ntrees, auc)
ax1.set_title('AUC')
ax2.plot(ntrees, bs)
ax2.set_title('Brier Score')
plt.savefig('Ntree_grid.png', dpi=500, bbox_inches='tight')

We can see that the relationship is noisy, but the trend clearly decreases with tree size, and perhaps asymptotes post 200 some trees for both metrics in this particular set of data.

So of course it depends on the dataset, but when I see automl libraries choosing trees in the range of 10, 50, 100 for random forests I roll my eyes a bit. You always get more accurate (in a statistical sense), with more trees. You would only choose that few for convenience in fitting and time savings. The R library ranger has a better default of 500 trees IMO (sklearns 100 is often too small in doing tests). But there isn’t much point in trying to hyperparameter tune this – your hyperparameter library may choose a smaller number of trees in a particular run, but this is due to noise in the tuning process itself.

So what metrics do matter for random forests? All machine learning models you need to worry about over-fitting/under-fitting. This depends on the nature of the data (number of rows, can fit more parameters, fewer columns less of opportunity to overfit). Random forests this is dependent on the complexity of the trees – more complex trees can overfit the data. If you have more rows of data, you can fit more complex trees. So typically I am doing searches over (in sklearn parlance) max-depth (how deep a tree can grow), min-samples-split (can grow no more trees is samples are too tiny in a leaf node). Another parameter to search for is how many columns to subsample as well (more columns can find more complex trees).

It can potentially be a balancing act – if you have more samples per leaf, it by default will create less complex trees. Many of the other hyperparameters limiting the complexity of the trees are redundant with these as well (so you could really swap out with max-depth). Here is an example of using the Optuna library a friend recommended to figure out the best parameters for this particular data set, using a train/eval approach (so this splits up the training set even further):

# Consistent train/test split for all evaluations
tr1, tr2 = train_test_split(train, test_size=2000)

def objective(trial):
    param = {
        "n_estimators": 500,
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "min_samples_split": trial.suggest_int("min_samples_split", 10, 200, step=10),
        "max_features": trial.suggest_int("max_features", 3, 15),
    }
    mod = RandomForestClassifier(**param)
    mod.fit(tr1[recid.xvars], tr1[recid.y1])
    pred_out = mod.predict_proba(tr2[recid.xvars])[:,1]
    auc = roc_auc_score(tr2[recid.y1], pred_out)
    return auc

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
trial = study.best_trial

print(f"Best AUC {trial.value}")
print("Best Params")
print(trial.params)

So here it chooses fairly deep trees, 9, but limited complexity via the sample size in a leaf (90). The number of features per tree is alsio slightly larger than default (here 25 features, so default is sqrt(25)). So it appears my personal defaults were slightly under-fitting the data.

My experience regression prefers smaller depth of trees but lower sample sizes (splitting to 1 is OK), but for binary classification limiting sample sizes in trees is preferable.

One thing to pay attention to in these automl libraries, a few do not retrain on the full dataset with the selected hyperparameters. For certain scenarios you don’t want to do this (e.g. if using the eval set to do calibrations/prediction intervals), but if you don’t care about those then you typically want to retrain on the full set. Another random personal observation, random forests really only start to outperform simpler regression strategies at 20k cases, with smaller sample sizes and further splitting train/eval/test, the hyperparameter search is very noisy and mostly a waste of time. If you only have a few hundred cases just fit a reasonable regression model and call it a day.

So here I retrain on the full test set, and then look at the AUC/Brier Score compared to my default model.

# Lets refit according to these params for the full
# training set
mod2 = RandomForestClassifier(n_estimators=500,**trial.params)
mod2.fit(train[recid.xvars], train[recid.y1])
pred_out = mod2.predict_proba(test[recid.xvars])[:,1]
auc_tuned = roc_auc_score(test[recid.y1], pred_out)
bs_tuned = brier_score_loss(test[recid.y1], pred_out)
print(f"AUC tuned {auc_tuned:.4f} vs AUC default {auc[-1]:.4f}")
print(f"Brier Score tuned {bs_tuned:.4f} vs default {bs[-1]:.4f}")

It is definately worth hyperparameter tuning once you have your feature set down, but lift of ~0.01 AUC (which is typical for hypertuned tabular models in my experience) is not a big deal with initial model building. Figuring out a smart way to encode some relevant feature (based on domain knowledge of the problem you are dealing with) typically has more lift than this in my experience.

Column storage for wide datasets

The notion of big data is quite a bit overhyped. I do get some exposure to it at work, but many tools like Hadoop are not needed over doing things in chunks on more typical machines. One thing though I have learned more generally about databases, many relational databases (such as postgres) store data under the hood like this:

Index1|1|'a'|Index2|2|'b'.....|Index9999|1|'z'

So they are stacked cumulatively in an underlying file format. And then to access the information the system has to know “ok I need to find Index2 and column 2”, so it calculates offsets for where those pieces of information should be located in the file.

This however is not so nice for databases that have many columns. In particular administrative record databases that have few rows, but many thousands of columns (often with many missing fields and low dimensional categories), this default format is not very friendly. In fact many databases have limits on the number of columns a table can have. (“Few rows” is relative, but really I am contrasting with sensor data that often has billions/trillions of rows, but very few columns.)

In these situations, a columnar database (or data format) makes more sense. Instead of having to calculate many large number of offsets, the database often will represent the key-column pairs in some easier base format, and then will only worry about grabbing specific columns. Both representing the underlying data in a more efficient manner will lower on computer disk space (e.g. only takes up 1 gig instead of many gigabytes) as well as improve input/output operations on the data (e.g. it is faster to read the data/write new data).

I will show some examples via the American Community Survey data for micro areas. I have saved the functions/code here on github to follow along.

So first, I load in pandas and DuckDB. DuckDB is an open source database that uses by default a columnar storage format, but can consider it a very similar drop in replacement for sqllite for persistantly storing data.

import pandas as pd
import duckdb
import os

# my local census functions
# grabbing small area 2019 data
# here defaults to just Texas/Delaware
# still takes a few minutes
import census_funcs
cen2019_small, fields2019 = census_funcs.get_data(year=2019)
print(cen2019_small.shape) # (21868, 17145)

Here I grab the small area data for just Texas/Delaware (this includes mostly census tracts and census block groups in the census FTP data). You can see not so many rows, almost 22k, but quite a few columns, over 17k. This is after some data munging to drop entirely missing/duplicated columns even. The census data just has very many slightly different aggregate categories.

Next I save these data files to disk. For data that does not change very often, you just want to do this process a single time and save that data somewhere you can more easily access it. No need to re-download the data from the internet/census site everytime you want to do a new analysis.

I don’t have timing data here, but you can just experiment to see the parquet data format is quite a bit faster to save (and csv formats are the slowest). DuckDB is smart and just knows to look at your local namespace to find the referenced pandas dataframe in the execute string.

# Save locally to CSV file
cen2019_small.to_csv('census_test.csv')

# Trying zip compression
cen2019_small.to_csv('census_test.csv.zip',compression="zip")

# Save to parquet files
cen2019_small.to_parquet('census.parquet.gzip',compression='gzip',engine='pyarrow')

# Save to DuckDB, should make the DB if does not exist
con = duckdb.connect(database='census.duckdb',read_only=False)
con.execute('CREATE TABLE census_2019 AS SELECT * FROM cen2019_small')

If we then go and check out the datasizes on disk, csv for just these two states is 0.8 gigs (the full set of data for all 50 states is closer to 20 gigs). Using zip compression reduces this to around 1/4th of the size for the csv file. Using parquet format (which can be considered an alternative fixed file format to CSV although columnar oriented) and gzip compression is pretty much the same as zip compression for this data (not many missing values or repeat categories of numbers), but if you have repeat categorical data with a bit of missing data should be slightly better compression (I am thinking NIBRS here). The entire DuckDB database is actually smaller than the CSV file (I haven’t checked closely, I try to coerce the data to smarter float/int formats before saving, but there are probably even more space to squeeze out of my functions).

# Lets check out the file sizes
files = ['census_test.csv','census_test.csv.zip',
         'census.parquet.gzip','census.duckdb']

for fi in files:
    file_size = os.path.getsize(fi)
    print(f'File size for {fi} is {file_size/1024**3:,.1f} gigs')

# File size for census_test.csv is 0.8 gigs
# File size for census_test.csv.zip is 0.2 gigs
# File size for census.parquet.gzip is 0.2 gigs
# File size for census.duckdb is 0.7 gigs

The benefit of having a database here like DuckDB is for later SQL querying, as well as the ability to save additional tables. If I scoop up the 2018 data (that has slightly different columns), I can save to an additional table. Then later on downstream applications can select out the limited columns/years as needed (unlikely any real analysis workflow needs all 17,000 columns).

# Can add in another table into duckdb
cen2018_small, fields2018 = census_funcs.get_data(year=2018)
con.execute('CREATE TABLE census_2018 AS SELECT * FROM cen2018_small')
con.close()

ducksize = os.path.getsize(files[-1])
print(f'File size for {files[-1]} with two tables is {ducksize/1024**3:,.1f} gigs')
# File size for census.duckdb with two tables is 1.5 gigs

Sqllite is nice, as you can basically share a sqllite file and you know your friend will be able to open it. I have not worked with DuckDB that much, but hopefully it has similar functionality and ability to share without too much headache.

I have not worried about doing timing – I don’t really care about write timings of these data formats compared to CSV (slow read timing is annoying, but 10 seconds vs 1 minute to read data is not a big deal). But it is good practice to not be gluttonous with on disk space, which saving a bunch of inefficient csv files can be a bit wasteful.

Legends in python

Legends in python and matplotlib can be a little tricky (it is quite a web of different objects). Here are a few notes I have collected over different projects. First, one thing python does not do, even if you name things the same, it does not combine them in a legend. First example will show superimposing lines and error bars – even though I only have two labels, they each get their own slot in the resulting legend.

import numpy as np
import geopandas as gpd

# Need a bunch of junk from matplotlib
import matplotlib
from matplotlib import pyplot as plt
from matplotlib import patches
from matplotlib.legend_handler import HandlerPatch

# Making simple fake data
x = [0,1,2,3]
y1 = np.array([1,1,1,1])
y1l = y1 - 0.5
y1h = y1 + 0.5

y2 = np.array([2,2,2,2])
y2l = y2 - 0.1
y2h = y2 + 0.1

# Default, does not combine legends
fig, ax = plt.subplots()
# First Line
ax.plot(x,y1,zorder=3, color='blue',alpha=0.9,label='y1')
ax.fill_between(x,y1l,y1h,alpha=0.2,zorder=2,color='blue',label='y1')
# Second Line
ax.plot(x,y2,zorder=3, color='k',alpha=0.9,label='y2')
ax.fill_between(x,y2l,y2h,alpha=0.2,zorder=2,color='k',label='y2')
ax.legend(bbox_to_anchor=(1.0, 0.5))
plt.savefig('Leg01.png', dpi=500, loc="center left", bbox_inches='tight')

You can combine the legend items by scooping out the original objects, here via the ax.get_* function to get the labels and the “handles”. You can think of handles as just points/lines/polygons that refer to individual parts of the legend. And so combining the lines and polygons together, we can make the legend how we want it.

fig, ax = plt.subplots()
# First Line
ax.plot(x,y1,zorder=3, color='blue',alpha=0.9,label='y1')
ax.fill_between(x,y1l,y1h,alpha=0.2,zorder=2,color='blue',label='y1')
# Second Line
ax.plot(x,y2,zorder=3, color='k',alpha=0.9,label='y2')
ax.fill_between(x,y2l,y2h,alpha=0.2,zorder=2,color='k',label='y2')
# Can combine legend items
handler, labeler = ax.get_legend_handles_labels()
hd = [(handler[0],handler[1]),
       (handler[2],handler[3])]
lab = ['y1','y2']
ax.legend(hd, lab, loc="center left", bbox_to_anchor=(1, 0.5))
plt.savefig('Leg02.png', dpi=500, bbox_inches='tight')

I made a simple function combo_legend(ax) to combine items that have the same label in such matplotlib figures. So they do not need to per se be collections of similar items, just have the same text label.

# Can pass these to legend
def combo_legend(ax):
    handler, labeler = ax.get_legend_handles_labels()
    hd = []
    labli = list(set(labeler))
    for lab in labli:
        comb = [h for h,l in zip(handler,labeler) if l == lab]
        hd.append(tuple(comb))
    return hd, labli

# Combo line/error/scatter
fig, ax = plt.subplots()
# First Line
ax.plot(x,y1,zorder=3, color='blue',alpha=0.9,label='y1')
ax.fill_between(x,y1l,y1h,alpha=0.2,zorder=2,color='blue',label='y1')
ax.scatter(x,y1,c='blue', edgecolor='white',
           s=30,zorder=4,label='y1')
# Second Line
ax.plot(x,y2,zorder=3, color='k',alpha=0.9,label='y2')
ax.fill_between(x,y2l,y2h,alpha=0.2,zorder=2,color='k',label='y2')
hd, lab = combo_legend(ax)
ax.legend(hd, lab, loc="center left", bbox_to_anchor=(1, 0.5))
plt.savefig('Leg03.png', dpi=500, bbox_inches='tight')

Note that the set() call makes it so the order may not be how you want. You can just rearrange the objects you get back from combo_legend to sort them in the order you want.

Now for the second example in the post, is going to show some examples munging with geopandas objects/plots.

So default geopandas uses a continuous color ramp:

# Get counties for all US
county_url = r'https://www2.census.gov/geo/tiger/TIGER2019/COUNTY/tl_2019_us_county.zip'
us_county = gpd.read_file(county_url)
# Get counties for North Carolina
nc = us_county[us_county['STATEFP'] == '37'].copy()
nc['wat_prop'] = nc['AWATER']/(nc['AWATER'] + nc['ALAND'])

# Plot proportion area water
fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop', edgecolor='grey', linewidth=0.2, ax=ax, legend=True)
plt.savefig('Leg04.png', dpi=1000, bbox_inches='tight')

I have a tough time with continuous color ramps. The geopandas mapping user guide has an example of making a nicer sized continuous legend, but overall I find making classed choropleth maps much easier in general. geopandas objects have a special set of arguments, where you can pass information to class the map and make a legend:

# Make the legend not the raster scale
bin_edge = np.arange(0.2,0.9,0.2)
leg_args = {'loc': 'lower left',
            'prop': {'size': 9},
            'title':'Prop. Water'}

fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop',
        scheme="User_Defined",
        classification_kwds=dict(bins=bin_edge),
        legend_kwds=leg_args,
        edgecolor='grey',
        linewidth=0.2,
        ax=ax,
        legend=True)
plt.savefig('Leg05.png', dpi=1000, bbox_inches='tight')

But now we have circles. ArcGIS has the ability to use different glyphs in its legends, and it is common to use a blocky type glyph for geopolitical boundaries (which have unnatural straight lines).

If you go down the rabbit hole of geopandas objects, in this scenario the matplotlib handler is actually the same Line2D type object as if you call ax.plot(). So hacking together from a matplotlib doc tutorial, we can make a custom handler to draw our prespecified glyph. (Don’t take this as a nice example legend glyph, just threw something together very quick!) I also add in different labels to signify the boundary edges for the choropleth bins.

class MapHandler:
    def legend_artist(self, legend, orig_handle, fontsize, handlebox):
        x0, y0 = handlebox.xdescent, handlebox.ydescent
        # Can handle lines or polygons
        try:
            face_col = orig_handle.get_markerfacecolor()
        except:
            face_col = orig_handle.get_facecolor()
        width, height = handlebox.width, handlebox.height
        # Ugly shape, can make your own!
        xy = [[0,0],
              [1,0.2],
              [0.7,1],
              [0.05,0.4]]
        xy2 = []
        for x,y in xy:
            xt = x*width
            yt = y*height
            xy2.append([xt,yt])
        patch = patches.Polygon(xy2, fc=face_col)
        handlebox.add_artist(patch)
        return patch

leg_args2 = leg_args.copy()
leg_args2['handler_map'] = {matplotlib.lines.Line2D: MapHandler()}
leg_args2['labels'] = ['[0.0 to 0.2)','[0.2 to 0.4)','[0.4 to 0.6)','[0.6 to 0.8]']

fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop',
        scheme="User_Defined",
        classification_kwds=dict(bins=bin_edge),
        legend_kwds=leg_args2,
        edgecolor='grey',
        linewidth=0.2,
        ax=ax,
        legend=True)
plt.savefig('Leg06.png', dpi=1000, bbox_inches='tight')

This will be sufficient for many peoples choropleth map needs, but often times in maps you superimpose additional things, such as roads or other types of boundaries. So it may be necessary to go back into the legend and rebuild it entirely.

Here is an example of adding in something totally different, an ellipse, into the legend:

# Taken from that same matplotlib doc linked earlier
class HandlerEllipse(HandlerPatch):
    def create_artists(self, legend, orig_handle,
                       xdescent, ydescent, width, height, fontsize, trans):
        center = 0.5 * width - 0.5 * xdescent, 0.5 * height - 0.5 * ydescent
        p = patches.Ellipse(xy=center, width=width + xdescent,
                             height=height + ydescent)
        self.update_prop(p, orig_handle, legend)
        p.set_transform(trans)
        return [p]

fig, ax = plt.subplots(figsize=(8,4))
nc.plot(column='wat_prop',
        scheme="User_Defined",
        classification_kwds=dict(bins=bin_edge),
        legend_kwds=leg_args2,
        edgecolor='grey',
        linewidth=0.2,
        ax=ax,
        legend=True)

# ax.get_legend_handles_labels() does not work here
leg = ax.get_legend()
handlers = leg.legendHandles
p = patches.Ellipse(xy=(0,0),width=2,height=1,facecolor='red')
handlers += [p]
new_labs = ['a','b','c','d','Whatever']
new_map = leg.get_legend_handler_map()
new_map[matplotlib.patches.Polygon] = MapHandler()
new_map[matplotlib.patches.Ellipse] = HandlerEllipse()
ax.legend(handlers, new_labs, handler_map=new_map, loc='lower left')
plt.savefig('Leg07.png', dpi=1000, bbox_inches='tight')

Checking for Stale RSS feeds in Python

So while I like RSS feeds, one issue I have noted over the years is that people/companies change the url for the feed. This ends up being a silent error in most feed readers (I swear the google reader had some metrics for this, but that was so long ago).

I have attempted to write code to check for not working feeds at least a few different times and given up. RSS feeds are quite heterogeneous if you look closely. Aliquote had his list though that I started going through to update my own, and figured it would be worth a shot to again filter out old/dormant feeds when checking out new blogs.

So first we will get the blog feeds and Christophe’s tags for each blog:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import ssl
import warnings # get warnings for html parser
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# aliquotes feed list
feeds = r'https://aliquote.org/pub/urls~'
gcont = ssl.SSLContext()
head = {'User-Agent': 'Chrome/104.0.0.0'}
response = urlopen(Request(feeds,headers=head),context=gcont).read()
list_feeds = response.splitlines()

I do not really know anything about SSL and headers (have had some small nightmares recently on work machines with ZScaler and getting windows subsystem for linux working with poetry/pip). So caveat emptor.

Next we have this ugly function I wrote. It could be cleaned up no doubt, but RSS feeds can fail (either website down, or the url is old/bad), so wrap everything in a try. Also most feeds have items/pubdate tags, but we have a few that have different structure. So all the if/elses are just to capture some of these other formats as best I can.

# Ugly Function to grab RSS last update
def getrss_stats(feed,context=gcont,header=head):
    try:
        req = Request(feed,headers=header)
        response = urlopen(req,context=context)
        soup = BeautifulSoup(response,'html.parser')
        all_items = soup.find_all("item")
        all_pub = soup.find_all("published")
        all_upd = soup.find_all("updated")
        if len(all_items) > 0:
            totn = len(all_items)
            if all_items[0].pubdate:
                last_dt = all_items[0].pubdate.text
            elif all_items[0].find("dc:date"):
                last_dt = all_items[0].find("dc:date").text
            else:
                last_dt = '1/1/1900'
        elif len(all_pub) > 0:
            totn = len(all_pub)
            last_dt = all_pub[0].text
        elif len(all_upd) > 0:
            totn = len(all_upd)
            last_dt = all_upd[0].text
        else:
            totn = 0 # means able to get response
            last_dt = '1/1/1900'
        return [feed,totn,last_dt]
    except:
        return [feed,None,None]

This returns a list of the feed url you put in, as well as the total number of posts in the feed, and the (hopefully) most recent date of a posting. Total numbers is partially a red herring, people may only publish to the feed some limited number of recent blog posts. I fill in missing data from a parsed feed as 0 and ‘1/1/1900’. If the response just overall was bad you get back None values.

So now I can loop over the list of our feeds. Here I only care about those with stats/python/sql (not worrying about Julia yet!):

# Looping over the list
# only care about certain tags
tg = set(['stats','rstats','python','sql'])
fin_stats = []

# Takes a few minutes
for fd in list_feeds:
    fdec = fd.decode().split(" ")
    rss, tags = fdec[0], fdec[1:]
    if (tg & set(tags)):
        rss_stats = getrss_stats(rss)
        rss_stats.append(tags)
        fin_stats.append(rss_stats.copy())

These dates are a bit of a mess. But this is the quickest way to clean them up I know of via some pandas magic:

# Convert data to dataframe
rss_dat = pd.DataFrame(fin_stats,columns=['RSS','TotItems','LastDate','Tags'])

# Coercing everything to a nicer formatted day
rss_dat['LastDate'] = pd.to_datetime(rss_dat['LastDate'].fillna('1/1/1900'),errors='coerce',utc=False)
rss_dat['LastDate'] = pd.to_datetime(rss_dat['LastDate'].astype(str).str[:11])

rss_dat.sort_values(by='LastDate',ascending=False,inplace=True,ignore_index=True)
print(rss_dat.head(10))

And you can see (that links to csv file) that most of Christophe’s feed is up to date. But some are dormant (Peter Norvig’s blog RSS is dormant, but his github he posts snippets/updates), and some have moved to different locations (such as Frank Harrell’s).

So for a feed reader to do something like give a note when a feed has not been updated for say 3 months I think would work for many feeds. Some people have blogs that are not as regular (which is fine), but many sites, such as journal papers, something is probably wrong if no updates for several months.

Assessing Categorical Effects in Machine Learning Models

One model diagnostic for machine learning models I like are accumulated local effects (ALE), see Wheeler & Steenbeek (2021) for an example and Molnar (2020) for a canonical mathematical reference. With these we get some ex-ante interpretability of models – I use this for mostly EDA of the final fitted model. Here is an example of seeing the diffusion effect of DART stations on robberies in Dallas from my cited paper:

So the model is behaving as expected – nearby DART stations causes an uptick, and that slowly diffuses away. And the way the ML model is set up it can estimate that diffusion effect, I did not apriori specify what that should look like.

These are essentially average/marginal effects (or approximate derivatives) for complicated machine learning models. In short pseudoish/python code, pretend we have a set of data D and a model, the local effect of variable x at the value 5 is something like:

D['x'] = 5 # set all the data for x to value 5
pred5 = mod.predict(D)
D['x'] = 5 + 0.001 # change value x by just alittle
predc = mod.predict(D)
loc_eff = (pred5 - predc)/0.001
print(loc_eff.mean())

So in shorthand [p(y|x) - p(y|x+s)]/s, so s is some small change (approximate the continuous derivative via finite differences). Then you generate these effects (over your sample), for various values of x, and then make a plot.

Many people say that this only applies to numerical features. So say we have a categorical effect with three variables, a/b/c. We could calculate p(y|a) and p(y|b), but because a - b is not defined (what is the difference in categorical variables), we cannot have anything like a derivative in the ALE for categorical features.

This seems to me though short sited. While we cannot approximate a derivative, the value p(y|a) - p(y|b) is pretty interpretable without the division – this is the predicted difference if we switch from category a to category b. Here I think a decent default is to simply do p(y|cat) - mean(p(y|other cat)), and then you can generate average categorical effects for each category (with very similar interpretation to ALEs). For those who know about regression contrasts, this would be like saying we have dummy variables for A/B/C, and the effect of A is contrast coded via 1,-1/2,-1/2.

Here is a simple example in python. For data see my prior post on the NIJ recidivism challenge or mine and Gio’s working paper (Circo & Wheeler, 2021). Front end cleaning up the data is very similar. I use a catboost model here.

import catboost
import numpy as np
import pandas as pd

# Ommitted Code, see 
# https://andrewpwheeler.com/2021/07/24/variance-of-leaderboard-metrics-for-competitions/
# for how pdata is generated
pdata = prep_data(full_data)

# Original train/test split
train = pdata[pdata['Training_Sample'] == 1].copy()
test = pdata[pdata['Training_Sample'] == 0].copy()

# estimate model, treat all variables as categorical
y_var = 'Recidivism_Arrest_Year1'
x_vars = list(pdata)
x_vars.remove(y_var)
x_vars.remove('Training_Sample')
cat_vars = list( set(x_vars) - set(more_clip) )

cb = catboost.CatBoostClassifier(cat_features=cat_vars)
cb.fit(train[x_vars],train[y_var])

Now we can do the hypothetical change the category and see how it impacts the predicted probabilities (you may prefer to do this on the logit scale, but since it is conditional on all other covariates it should be OK IMO). Here I calculate the probabilities over each of the individual PUMAs in the sample.

# Get the differences in probabilities swapping
# out each county, conditional on other factors
pc = train.copy()
counties = pd.unique(train['Residence_PUMA']).tolist()
res_vals = []
for c in counties:
    pc['Residence_PUMA'] = c
    pp = cb.predict_proba(pc[x_vars])[:,1]
    res_vals.append(pd.Series(pp))

res_pd = pd.concat(res_vals,axis=1)
res_pd.columns = counties
res_pd

So you can see for the person in the first row, if they were in PUMA 16, they would have a predicted probability of recidivism of 0.140. If you switched them to PUMA 24, it changes to 0.136. So you can see the PUMA overall doesn’t appear to have much of an impact on the recidivism prediction in this catboost model.

Now here is leave one out centering as I stated before. So we compare PUMA 16 to the average of all other PUMAs, within each row.

# Now mean center
n = res_pd.shape[1]
row_sum = res_pd.sum(axis=1)
row_adj = (-1*res_pd).add(row_sum,axis=0)/(n-1)
ycent = res_pd - row_adj
ycent

And now we can do various column aggregations to get the average categorical effects per each category. You can do whatever aggregation you want (means/medians/percentiles). (I’ve debated on making my own library to make ALEs a bit more general and return variance estimates as well.)

# Now can get mean/sd/ptils
mn = ycent.mean(axis=0)
sd = ycent.std(axis=0)
low = ycent.quantile(0.025,axis=0)
hig = ycent.quantile(0.975,axis=0)
fin_stats = pd.concat([mn,sd,low,hig],axis=1)
# Cleaning up the data
fin_stats.columns = ['Mean','Std','Low','High']
fin_stats.reset_index(inplace=True)
fin_stats.rename(columns={"index":"PUMA"}, inplace=True)
fin_stats.sort_values(by='Mean',ascending=False,
                      inplace=True,ignore_index=True)
fin_stats

And we can see that while I sorted the PUMAs and PUMA 25 has a mean effect of 0.03, its standard deviation is quite high. The only PUMA that the percentiles do not cover 0 is PUMA 13, with a negative effect of -0.06.

Like I said, I like these more so for model checking/EDA/face validity. Here I would dig into further PUMA 25/13, make sure nothing funny is going on with the data (and whether I should try to tease out more features from these in real life if I had access to the source data, e.g. smaller aggregations). The other PUMAs though are quite unremarkable and have spreads of +/-2 percentage points pretty consistently.

References

  • Circo, G., & Wheeler, A.P. (2021). National Institute of Justice Recidivism Forecasting Challange Team “MCHawks” Performance Analysis. CrimRXiv.
  • Molnar, C. (2020). Interpretable machine learning. Ebook.
  • Wheeler, A. P., & Steenbeek, W. (2021). Mapping the risk terrain for crime using machine learning. Journal of Quantitative Criminology, 37(2), 445-480.

Using linear programming to assess spatial access

So one of the problems I have been thinking about at work is assessing spatial access to providers. Some common metrics are ‘distance to nearest’, or combining distance as well as total provider capacity into one metric, the two-step floating catchment method (2SFCA).

So imagine we are trying to evaluate the coverage of opioid treatment facilities relative to the spatial distribution of people abusing opioids. Say for example we have two areas, A and B. Area A has a facility that can treat 50 people, but has a total of 100 people who need service. Now Area B has no treatment provider, but only has 10 people that need service.

If you looked at distance to nearest provider, location B would look worse than A. So you may think to yourself – we should open up a facility in Area B. But if you look at the total number of people, as well as the capacity of the treatment provider in Area A, it would probably be a better investment to expand the capacity of treatment center A. It has more potential demand.

The 2SFCA partially captures this, but I am going to show a linear programming approach that can decompose the added benefit of adding capacity to a particular provider, or adding in potential new providers. I have the posted these functions on github, but can walk through the linear programming model, and how to assess potential policy interventions in that framework.

So first lets make just a simple set of data, we have 10 people with XY coordinates, and 2 service providers. The two service providers have capacity 3 and 5, so we cannot cover 2 of the people.

# Python example code
import numpy as np
import pandas as pd
import pulp

# locations of people
id = range(10)
px = [1,1,3,3,4,5,6,6,9,9]
py = [2,8,1,8,4,6,1,2,5,8]
peop = pd.DataFrame(zip(id,px,py),columns=['id','px','py'])

# locations of 2 providers & capacity 3/5
hid = [1,2]
hx = [1,8]
hy = [1,8]
hc = [3,5] # so cant cover 2 people
prov = pd.DataFrame(zip(hid,hx,hy,hc),columns=['hid','hx','hy','hc'])

Now for the subsequent model I show, it is going to assign the people to the nearest possible provider, given the distance constraints. To make the model feasible, we need to add in a slack provider that can soak up the remaining people not covered. We set this location to somewhere very far away, so the model will only assign people to this slack provider in the case no other options are available.

# add in a slack location to the providers
# very out of the way and large capacity
prov_slack = prov.iloc[[0],:].copy()
prov_slack['hid'] = 999
prov_slack['hx'] = 999
prov_slack['hy'] = 999
prov_slack['hc'] = peop.shape[0]
prov_add = pd.concat([prov,prov_slack],axis=0)

Now the decision model looks at all pairwise combinations between people and providers (potentially eliminating combinations that are too far away to be reasonable). So I do the cross join between the people/providers, and then calculate the Euclidean distance between them. I set the distance for the slack provider to a very high value (here 9999).

# distance between people/providers
peop['const'] = 1
prov_add['const'] = 1
cross = pd.merge(peop,prov_add,on='const')
cross['dist'] = np.sqrt( (cross['px']-cross['hx'])**2 + 
                         (cross['py']-cross['hy'])**2 )
cross.set_index(['id','hid'],inplace=True)
# setting max distance for outside provider to 9999
cross.loc[cross.xs(999,level="hid",drop_level=False).index,"dist"] = 9999

Now we are ready to fit our linear programming model. First our objective is to minimize distances. (If all inputs are integers, you can use continuous decision variables and it will still return integer solutions.)

# now setting up linear program
# each person assigned to a location
# locations have capacity
# minimize distance traveled

# Minimize distances
P = pulp.LpProblem("MinDist",pulp.LpMinimize)

# each pair gets a decision variable
D = pulp.LpVariable.dicts("DA",cross.index.tolist(),
                          lowBound=0, upBound=1, cat=pulp.LpContinuous)

# Objective function based on distance
P += pulp.lpSum(D[i]*cross.loc[i,'dist'] for i in cross.index)

And we have two types of constraints, one is that each person is assigned a single provider in the end:

# Each person assigned to a single provider
for p in peop['id']:
    provl = cross.xs(p,0,drop_level=False).index
    P += pulp.lpSum(D[i] for i in provl) == 1, f"pers_{p}"

As a note later on, I will expand this model to include multiple people from a single source (e.g. count of people in a zipcode). For that expanded model, this constraint turns into pulp.lpSum(...) == tot where tot is the total people in a single area.

The second constraint is that providers have a capacity limit.

# Each provider capacity constraint
for h in prov_add['hid']:
    peopl = cross.xs(h,level=1,drop_level=False)
    pid = peopl.index
    cap = peopl['hc'].iloc[0] # should be a constant
    P += pulp.lpSum(D[i] for i in pid) <= cap, f"prov_{h}"

Now we can solve the model, and look at who was assigned to where:

# Solve the model
P.solve(pulp.PULP_CBC_CMD()) 
# CPLEX or CBC only ones I know of that return shadow
pulp.value(P.objective) #print objective 20024.33502494309


# Get the person distances
res_pick = []
for ph in cross.index:
    res_pick.append(D[ph].varValue)

cross['picked'] = res_pick
cross[cross['picked'] > 0.99]

So we can see two people were assigned the slack provider 999. Note that some people are not even assigned the closest – person 7 (6,2) is assigned to provider 2 at a distance of 6.3, it is only a distance of 5.1 away from provider 1. Because provider 1 has limited capacity though, they are assigned to provider 2 in the end.

In this framework, we can get the shadow price for the constraints, which says if we relax the constraint, how much it will improve our objective value.So if we add 1 capacity to provider 1, we will improve our objective by -9994.

# Get the shadow constraints per provider
o = [{'name':name, 'shadow price':c.pi, 'slack': c.slack} 
     for name, c in P.constraints.items()]
sc = pd.DataFrame(o)
print(sc)

I have helper functions at the github link above, so I don’t need to go through all of these motions again. You input your people matrix and the field names for the id, x, y, totn values.

And then you input your provider matrix with the field names for the providerid, x, y, prov_capacity (note this is not the matrix with the additional slack provider, my code adds that in automatically). The final two arguments limit the potential locations (e.g. here saying can’t assign a person to a provider over 12 distance away). And the last argument sets the super high distance penalty to people are not assigned.

# load in model functions
from assign_funcs import ProvAssign

# Const=1 is the total people per area
m1 = ProvAssign(peop,
                ['id','px','py','const'],
                prov,
                ['hid','hx','hy','hc'],
                12,
                9999)

m1.solve(pulp.PULP_CBC_CMD())
# see the same objective as before

Now we can go ahead and up our capacity at provider 1 by 1, and see how the objective is reduced by -9994:

# if we up the provider capacity for
# prov by 1, the model objective goes 
# down by -9994
prov['hc'] = [4,5]

m2 = ProvAssign(peop,
                ['id','px','py','const'],
                prov,
                ['hid','hx','hy','hc'],
                12,
                9999)
m2.solve(pulp.PULP_CBC_CMD())
m1.obj - m2.obj # 9994!

Like I said, I extended this to the scenario that you don’t have individual people, but have multiple counts of people in a spatial area. What can happen in this scenario is one source location can send people to multiple providers. Here you can see that source location 4 (total of 12 people), 5 were sent to provider 1, and 7 were sent to provider 2.

# Can make these multiple people, 100 total
peop['tot'] = [10,15,5,10,12,11,20,6,9,2]
prov['cap'] = [40,50] # should assign 10 people to slack

m3 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov,
                ['hid','hx','hy','cap'],
                12,
                9999)
m3.solve(pulp.PULP_CBC_CMD())
# Can see the assignments from one
# source can spill over into multiple
# provider locations
m3.assign

So one of the things I like about this approach I already showed, we can do hypothetical scenarios ‘add capacity’ and see how it improves overall travel. Another potential intervention is to just place dummy 0 capacity providers over the study area, then look at the shadow constraints to see the best locations to open new facilities. Here I add in a potential provider at location (5,5).

# Can add in hypothetical providers with
# no capacity (make new providers)
# and check out the shadow
p2 = prov_slack.copy()
p2['hid'] = 10
p2['hx'] = 5
p2['hy'] = 5
p2['hc'] = 0
p2['cap'] = 0
prov_add = pd.concat([prov,p2],axis=0)


m4 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov_add,
                ['hid','hx','hy','cap'],
                10,
                9999)

# we now don't have source9 and prov1
m4.cross

Here in the code, I also have a function to limit potential assignments. Here if I set that limit to 10, it only just prevents id 9 (9,8) from being assigned to provider 1 (1,1), which is a distance of 10.6 away. This is useful with larger datasets, in which you may not be able to fit all of the pairwise distances into memory and model. (Although this model is pretty simple, you may be able to look at +1 million pairwise combos and solve in a reasonable time with open source solvers.)

Now we can go ahead and solve this model. I have a bunch of helper functions as well, so after solving we can check out the shadow price matrix:

# Solve m4 model
m4.solve(pulp.PULP_CBC_CMD())

# can see the new provider 10
# if we added capacity would
# decrease distance travelled by -9996.2426
m4.shadow

And based on this, it does look like we will have the best improvement by adding capacity at our hypothetical new provider 10, as oppossed to adding capacity at provider 1 or 2. Lets see what happens if you add 10 capacity to our hypothetical provider:

# Now lets add capacity for new provider
# by 10, should the objective go down
# by 10*-9996.2426 ?
prov_add['cap'] = [40,50,10]

m5 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov_add,
                ['hid','hx','hy','cap'],
                12,
                9999)
m5.solve(pulp.PULP_CBC_CMD())

# Not quite -9996.2 but close!
(m5.obj - m4.obj)/10

We can see that the objective was not quite reduced by the expected, but is close. If it is not feasible to add a totally new provider, but simpler to give resources to expand current ones, we can see what will happen if we expand provider 1 by 10.

# we could add capacity
# to provider 1 by 10
# as well
prov_add['cap'] = [50,50,0]

m6 = ProvAssign(peop,
                ['id','px','py','tot'],
                prov_add,
                ['hid','hx','hy','cap'],
                10,
                9999)
m6.solve(pulp.PULP_CBC_CMD())

# Not quite -9993.4 but close!
(m6.obj - m4.obj)/10

In addition to doing different policy interventions in this approach, I provide different metrics to assess distance traveled and coverage. So going back to model 4, we can look at which areas have limited coverage. It only ends up that source area 1 has 10/15 people not covered. The rest of the areas are covered.

# Can look at sources and see stats
# for distance travelled as well
# as potential non-coverage
m4.source_stats

The fields are Trav is the average distance travelled for people who are covered (so could have high coverage but those people have to go a ways). Tot is the total number of people within that particular source area, and picked/not-covered are those assigned a provider/not assigned a provider (so go to the slack) respectively.

I additionally I have stats available rolled up to providers, mostly based on the not covered nearby. One way that the shadow doesn’t work, if you only have 10 people nearby, it doesn’t make sense to expand capacity to any more than 10 people. Here you can see the shadow price, as well as those not covered that are nearby to that provider.

# Can look at providers
# and see which ones would result in
# most reduced travel given increased coverage
# Want high NotCover and low shadow price
m4.prov_stats

The other fields, hc is the listed capacity. Trav is the total travel for those not covered. The last field bisqw, is a way to only partially count not covered based on distance. It uses the bi-square kernel weight, based on the max distance field you provided in the model object. So if many people are nearby it will get a higher weight.

Here good add capacity locations are those with low shadow prices, and high notcovered/bisqw fields.

Note these won’t per se give the best potential places to add capacity. It may be the best new solution is to spread out new capacity – so here instead of adding 10 to a single provider, add a few to different providers.

You could technically figure that out by slicing out those leftover people in the m4.source_stats that have some not covered, adding in extra capacity to each provider, and then rerunning the algorithm and seeing where they end up.

Again I like this approach as it lets you articulate different policy proposals and see how that changes the coverage/distance travelled. Although no doubt there are other ways to formulate the problem that may make sense, such as maximizing coverage or making a coverage/distance bi-objective function.

Estimating Criminal Lifetime Value

At work I am currently working on estimating/forecasting healthcare spending. Similar to work I have done on forecasting person level crime risks (Wheeler et al., 2019), I build the predictive model dataset like this:

CrimeYear2020 PriorCrimeA PriorCrimeB
     0              2          3
     1              5          0
     0              0          0

etc. So I flatten people to a single row, and as covariates include prior cumulative crime histories. Most people do this similarly in the healthcare setting, so it looks like:

SpendingYear2020 PriorComorbidA PriorComorbidB
     3000              1          2
      500              3          0
    10000              0          0

Or sometimes people do a longitudinal dataset, where it is a spending*year*person panel (see Lauffenburger et al., 2020 for an example). I find this approach annoying for a few reasons though. One, it requires arbitrary temporal binning, which is somewhat problematic in these transaction level databases. We are talking for chronic offenders a few crimes per year is quite high, and ditto in the healthcare setting a few procedures a year can be very costly. So there is not much data to estimate the underlying function over time.

A second aspect I think is bad is that it doesn’t take into account the recency of the patterns. So the variables on the right hand side can be very old or very new. And with transaction level databases it is somewhat difficult to define how to estimate the lookback – do you consider it normalized by time? The VOID paper I mentioned we evaluated the long term scores, but the PD that does that chronic offender system has two scores – one a cumulative history and another a 90 day history to attempt to deal with that issue (again ad-hoc).

One approach to this issue from marketing research I have seen from very similar types of transactions databases are models to estimate Customer Lifetime Value (Fader et al. 2005). These models in the end generate a dataset that looks like this:

Person    RecentMonths  TotalEvents AveragePurchase
  A            3             5            $50
  B            1             2           $100
  C            9             8            $25

TotalEvents should be straightforward, RecentMonths just is a measure of the time since the last purchase, and then you have the average value of the purchases. And using just this data, estimates the probability of any future purchases, as well as projects the total value of the future average purchases. So here I use an example of this approach, using the Wolfgang Philly cohort public data. I am not going into the model more specifically (read some of the Bruce Hardie notes to get a flavor).

I have created some python code to follow along and apply these same customer lifetime value estimates to chronic offender data. Most examples of weighting crime harm apply it to spatial areas (Mitchell, 2019; Wheeler & Reuter, 2021), but you can apply it the same to chronic offender lists (Liggins et al., 2019).

Example Criminal Lifetime Value in Python

First, install the lifetimes python library – Cam’s documentation is excellent and makes the data manipulation/modelling quite simple.

Here I load in the transaction level crime data, e.g. it just have person A, 1/5/1960, 1000, where the 1000 is a crime seriousness index created by Wolfgang. Then the lifetimes package has some simple functions to turn our data into the frequency/recency format.

Note that for these models, you drop the first event in the series. To build a model to do train/test, I also split the data into evens before 1962, and use 1962 as the holdout test period.

import lifetimes as lt
import pandas as pd

# Just the columns from dataset II
# ID, SeriousScore, Date
df = pd.read_csv('PhilData.csv')
df['Date'] = pd.to_datetime(df['Date'])

# Creating the cumulative data
# Having holdout for one year in future
sd = lt.utils.calibration_and_holdout_data(df,'ID','Date',
              calibration_period_end='12-31-1961',
              observation_period_end='12-31-1962',
              freq='M',
              monetary_value_col='SeriousScore')

# Only keeping people with 2+ events in prior period
sd = sd[sd['frequency_cal'] > 0].copy()
sd.head()

Recency_cal is how many months since a prior crime (starting in 1/1/1962), frequency is the total number of events (minus 1, so number of repeat events technically), and the monetary_value_cal here is the average of the crime seriousness across all the events. The way this function works, the variables with the subscript _cal are in the training period, and _holdout are events in the 1962 period. For subsequent models I subset out people with at least 2 events total in the modeling.

Now we can fit a model to estimate the predicted number of future crimes a person will commit – so this does not take into account the seriousness of those crimes. The final groupby statement shows the predicted number of crimes vs those actually committed, broken down by number of crimes in the training time period. You can see the model is quite well calibrated over the entire sample.

# fit BG model
bgf = lt.BetaGeoFitter(penalizer_coef=0)
bgf.fit(sd['frequency_cal'],sd['recency_cal'],sd['T_cal'])

# look at fit of BG model
t = 12
sd['pred_events'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, sd['frequency_cal'], sd['recency_cal'],sd['T_cal'])
sd.groupby('frequency_cal',as_index=False)[['frequency_holdout','pred_events']].sum() # reasonable

Now we can fit a model to estimate the average crime severity score for an individual as well. Then you can project a future cumulative score for an offender (here over a horizon of 1 year), by multiple the predicted number of events times the estimate of the average severity of the events, what I label as pv here:

# See conditional seriousness
sd['pred_ser'] = ggf.conditional_expected_average_profit(
                              sd['frequency_cal'],
                              sd['monetary_value_cal'])

sd['pv'] = sd['pred_ser']*sd['pred_events']
sd['cal_tot_val'] = sd['monetary_value_holdout']*sd['frequency_holdout']
# Not great correlation, around 0.2
vc = ['frequency_holdout','monetary_value_holdout','cal_tot_val','pred_events','pv']
sd[vc].corr()

The correlation between pv and the holdout cumulative crime severity cal_tot_val, is not great at 0.26. But lets look at this relative to the more typical approach analysts will do, simply rank prior offenders based on either total number of events or the crime seriousness:

# Lets look at this method via just ranking prior
# seriousness or frequency
sd['rank_freq'] = sd['frequency_cal'].rank(method='first',ascending=True)
sd['rank_seri'] = (sd['monetary_value_cal']*sd['frequency_cal']).rank(method='first',ascending=True)
vc += ['rank_freq','rank_seri']
sd[vc].corr()[vc[-3:]]

So we can see that pv outperforms ranking based on total crimes (rank_freq), or ranking based on the cumulative serious score for offenders (rank_seri) in terms of the correlation for either the total number of future events or the cumulative crime harm.

If we look at capture rates, e.g. pretend we highlight the top 50 chronic offenders for intervention, we can see the criminal lifetime value pv estimate outperforms either simple ranking scheme by quite a bit:

# Look at capture rates by ranking
topn = 50
res_summ = []
for v in vc[-3:]:
    rank = sd[v].rank(method='first',ascending=False)
    locv = sd[rank <= topn].copy()
    tot_crimes = locv['frequency_holdout'].sum()
    tot_ser = locv['cal_tot_val'].sum()
    res_summ.append( [v,tot_crimes,tot_ser,topn] )

res_df = pd.DataFrame(res_summ,columns=['Var','TotCrimes','TotSer','TotN'])
res_df

In terms of the seriousness projection, it is reasonably well calibrated over the entire sample, but has a very tiny variance – it basically just predicts the average crime serious score over the sample and assigns that as the prediction going forward:

# Cumulative stats over sample reasonable
# variance much too small
sd[['cal_tot_val','pv']].describe()

So what this means is that if say Chicago READI wanted to do estimates to reasonably justify the max dollar cost for their program (over a large number of individuals) that would be reasonable. And this is how most marketing people use this info, average benefits of retaining a customer.

For individual projections though, e.g. I think OffenderB will generate between [low,high] crime harm in the next year, this is not quite up to par. I am hoping though to pursue these models further, maybe either in a machine learning/regression framework to estimate the parameters directly, or to use mixture models in an equivalent way that marketers use “segmentation” to identify different types of customers. Knowing the different way people have formulated models though is very helpful to be able to build a machine learning framework, which you can incorporate covariates.

References

Some pandas notes (part 1, EDA)

Teaching my son python at the moment – when I gave him a final project to do on his own, he complained and said he did not remember how to do one step from the prior lessons.

In the end, I am guessing I really only have maybe 30 commands in any language memorized, and then maybe another 70 I know of (but often need to look at the docs for its less often used arguments).

Here I figured it would be good for learners for me to put down my notes on pandas dataframes in python. At least the less than 20 commands I regularly use.

So first of I think a 3 part series, is using pandas for exploratory data analysis (EDA). Besides these tables I will show here, I often only use histograms and smooth conditional plots. Those I don’t even use very often.

Debated on writing a post that EDA is overrated, (I spend way more time understanding business objectives, then tuning a reasonably non-linear machine learning model and seeing its results dominates after that point). So it makes writing a post on EDA quite easy though – I can show a dozen pandas methods I regularly use and that is over 90% of the typical EDA work I do.

Here I will show an example using the PPP loan data available to download, in particular I will be using the over 150k loan data. A nice aspect of pandas is you can read in a CSV file directly from a url.

# Python code examples for EDA
import pandas as pd

# PPP loans, see full dataset
# at https://data.sba.gov/dataset/ppp-foia
url = r'https://data.sba.gov/dataset/8aa276e2-6cab-4f86-aca4-a7dde42adf24/resource/501af711-1c91-477a-80ce-bf6428eb9253/download/public_150k_plus_220403.csv'

# Can read in from url directly
pp_loans = pd.read_csv(url)
pp_loans.shape # over 900k rows

At work I am often querying a database using SQLAlchemy instead of reading a csv file. But this is a nice example to show off – the data is too big to really work effectively in Excel, but pandas it should be no problem for most peoples machines.

So you can see the first thing I look at is data.shape to get the number of rows and columns. Next I often just print the columns using list(data):

# Can use list to see all of the variables
list(pp_loans)

So you can access/modify the actual column names via data.columns, but this is a bit nicer printing.

Next I often summarize the data using the data.describe() method. I like transposing this result to be easier to read though.

# Transpose describe for easier reading
pp_loans.describe().T

Here we get a bit of a mess of scientific notation (also this only shows summaries of float data by default, not categorical or dates). If you want to set a bit easier on the eyes printing, I typically have this pd.set_option() handy. Also you can scale the data, here I divide everything by 1000.

# If you dont want scientific notation
pd.set_option('display.float_format', '{:,.0f}'.format)
pp_loans.describe().T/1000

One of the things I like about this the most is the count, so you can see data that has missing values (that here should likely be set to 0 for the *_PROCEED variables). The second thing I look at are the min/max. Sometimes you can spot rogue missing values or censoring in the data. Then looking at the quartiles is typically informative as to how skewed the data is. Only then do I bother with the mean – which for binary data is reasonably informative.

You can see for this particular data, it is the loans over $150k – here this means apparently the CurrentApprovalAmount, as 150k to 10million (the apparent cap on the PPP loans). So this is $150k conditional on approval dataset.

The next step I am typically looking the distribution of various categorical data. For that I am often just using data['variable'].value_counts():

Value counts is maybe my favorite

pp_loans[‘BusinessType’].value_counts()

This prints out all of the business types in this example, but with a very large number will limit to the head/tail. Here we can see Corps are the most common, with LLCs and s-corps as the next.

The next most common pandas method I use is data.groupby(), this produces summary aggregations conditional on different groups in the data. For example, here we can do the total sum of the loans (per million dollars) for each of the business types.

# Groupby with stats is next, per million
x = 'BusinessType'
y = 'CurrentApprovalAmount'
pp_loans.groupby(x)[y].sum().sort_values()/1e6

You can see that you can chain multiple pandas methods together, which is just a typical python object oriented pattern. So I tack on a .sort_values() to the end of the groupby. Note again these are in millions of dollars, so Coop’s got over a billion dollars of PPP loans in this data (it will be more overall, these again are only loans over $150k, the link at the beginning has a bunch of additional datasets with smaller loan amounts).

You can subsequently do additional functions in groupby, for just an example here I create my own normal distribution confidence interval of the mean for large samples.

# Can do more complicated functions
def ci95(x):
    mx = x.mean()
    sx = x.std()
    res = pd.DataFrame([[mx - 2*sx,mx + 2*sx]])
    res.columns = ['low','high']
    return res

pp_loans.groupby(x)[y].apply(ci95)

I often use this for functions with binary data, but otherwise we can again chain methods together, so we can use data.groupby(x)[y].describe() to get those summaries for each of our groups. Also as an FYI you can pass in multiple y values to summarize via a list, although it gets a bit hairy in the output:

# And can have multiple outcome variables
ys = [y,'PAYROLL_PROCEED']
pp_loans.groupby(x)[ys].describe()

One of the common things I do at this point is to simply dump the results to a CSV file I can lookat, e.g. something like:

# Can chain together and save result to csv
pp_loans.groupby(x)[ys].describe().to_csv('aggstats.csv')

Or assign to a new object, and some python editors (such as Spyder) have a data browsers.

The final EDA part I use most often is simply sorting the dataset and looking at the head/tail.

# Can sort values and look at head/tail
pp_loans.sort_values(by=ys,ascending=False,inplace=True,ignore_index=True)
pp_loans[['BorrowerName'] + ys].head()

I often will export the head of the data, e.g. something like data.head().to_csv('checkdata.csv') to view all of the fields and their content as well when first figuring out fields, how often they are filled in, etc.

Next part in the series, I will describe more regular pandas data manipulation operations I regularly use, e.g. getting a single column vs multiple columns, filtering, string manipulation functions, etc.

Getting census data over time

A former student recently asked about getting census data over time, in particular for smaller geographies like block groups. My GIS course I teach students the manual way of downloading data year-by-year from the FTP site. That is partially for pedagogical reasons though, I want students to realize the number of variables (there are so many) and how the data is stored by the census for the American Community Survey.

But Census now has a web api, where you can query the data. So if you are familiar with R or python programming, you can get the data in a bit easier fashion. You just need to know the years + census geographies + variables. I have notes on variables I often use for crim research, but going to the FTP site you can find the big documents or the excel templates.

I have honestly avoided these APIs in my workflows for several years, as my experience with the Census geocoding API was quite flaky, but I have not had the same problems with the APIs for querying the data. Here are examples in R (tidycensus library) and python (census library) of downloading several variables over the 2014-2019 span.

#############################
# R code
library(tidycensus)

# sign up for census key#
# https://api.census.gov/data/key_signup.html
census_api_key(key='????yourkeyhere????')

# place to store results and combine them
years <- 2014:2019
res <- vector("list",length(years))
names(res) <- years

# variables that you want
#        Tot Pop     White non-Hisp  FemHeadHouse  FamPoverty
vars <- c('B03001_001','B03002_003','B11003_016','B17010_002')

# loop over years, save data
# could also apply county filter, see help(get_acs)
# using smaller Deleware just for example
for (y in years){
    # download data
    ld <- as.data.frame(get_acs(year = y,
                                geography='cbg',
                                survey='acs5',
                                variables = vars,
                                state="DE"))
    # reshape long to wide
    ld2 <- reshape(ld,
                   idvar="GEOID",
                   timevar="variable",
                   direction="wide",
                   drop=c("NAME","moe"))
    # insert into list and add in year
    res[[y]] <- ld2
    res[[y]]$year <- y
}

# Combining the data frames together for final analysis
combo <- do.call("rbind",res)
head(combo) # can see B03001_001 is missing for block groups
summary(combo)
#############################

So in R you can ask for a variable, but if it is not available you will just get missing. So you need to make sure the variables you ask for are available over the time span.

The python census library will just straight up give you an error if the variable is not available. Also you need to specify E/M estimates, not just the base variable.

#############################
# Python code

from census import Census
import pandas as pd

key = '????yourkeyhere????'
c = Census(key)
# will get error with unknown variable
# need to specify E/M for estimate or margin of error
vars = ['B03002_003E','B11003_016E','B17010_002E']
res = []

for y in range(2014,2019+1):
    # '10' is Delaware, first '*' is county, second '*' is specific
    # geoid for a block group
    lk = c.acs5.state_county_blockgroup(vars, '10', "*", "*",year=y)
    ld = pd.DataFrame(lk)
    ld['year'] = y
    res.append(ld)

combo = pd.concat(res,axis=0)
combo.head()
#############################

(Initial post had an error not passing in year into the download function, now the two results are the same.)

For making reproducible scripts, instead of putting your API key into the code, a common way is to create a config file with the API key (don’t upload the config file to github), and then read in the config file into your script. (Another way is to use environment variables as secrets, I think the config is easier for people to grok though.)

Another friend recently referred me to requests-cache library. It is a good idea to only download the data locally once, then use that local data. No need to requery the data every time you update your code. Easiest approach is to just have a special script to download the data and save it (in a database or csv files would work here), and then later scripts work with that local data.