Job advice for entry crime analysts

I post occasionally on the Crime Analysis Reddit, and a few recent posts I mentioned about expanding the net to private sector gigs for those interested in crime analysis. And got a question from a recent student as well, so figured a blog post on my advice is in order.

For students interested in crime analysis, it is standard advice to do an internship (while a student), and that gets you a good start on networking. But if that ship has sailed and you are now finished with school and need to get a job that does not help. Also standard to join the IACA (and if you have a local org, like TXLEAN for Texas, you can join that local org and get IACA membership at the same time). They have job boards for openings, and for local it is a good place to network as well for entry level folks. IACA has training material available as well.

Because there are not that many crime analysis jobs, I tell students to widen their net and apply to any job that lists “analyst” in the title. We hire many “business analysts” at Gainwell, and while having a background in healthcare is nice it is not necessary. They mostly do things in Excel, Powerpoint, and maybe some SQL. Probably more have a background in business than healthcare specifically. Feel free to take any background experience in the job description not as requirements but as “nice to have”.

These are pretty much the same data skills people use in crime analysis. So if you can do one you can do the other.

This advice is also true for individuals who are currently crime analysts and wish to pursue other jobs. Unfortunately because crime analysis is more niche in departments, there is not much upward mobility. Other larger organizations that have analysts will just by their nature have more senior positions to work towards over your career. Simultaneously you are likely to have a larger salary in the private sector than public sector for even the same entry level positions.

Don’t get the wrong impression on the technical skills needed for these jobs if you read my blog. Even more advanced data science jobs I am mostly writing python + SQL. I am not writing bespoke optimization functions very often. So in terms of skills for analyst positions I just suggest focusing on Excel. My crime analysis course materials I intentionally did in a way to get you a broad background that is relevant for other analyst positions as well (some SQL/Powerpoint, but mostly Excel).

Sometimes people like to think doing crime analysis is a public service, so look down on going to private sector. Plenty of analysts in banks/healthcare do fraud/waste/abuse that have just as large an impact on the public as do crime analysts, so I think this opinion is misguided in general.

Many jobs at Gainwell get less than 10 applicants. Even if these jobs have listed healthcare background requirements, if they don’t have options among the pool those doing the hiring will lower their expectations. I imagine it is the same for many companies. Just keep applying to analyst jobs and you will land something eventually.

I wish undergrad programs did a better job preparing social science students with tech skills. It is really just minor modifications – courses teaching Excel/SQL (maybe some coding for real go-getters). Better job at making stats relevant to the real world business applications (calculating expected values/variance and trends in those is a common task, doing null hypothesis significance testing is very rare). But you can level up on Excel with various online resources, my course included.

Using linear programming to assess spatial access

So one of the problems I have been thinking about at work is assessing spatial access to providers. Some common metrics are ‘distance to nearest’, or combining distance as well as total provider capacity into one metric, the two-step floating catchment method (2SFCA).

So imagine we are trying to evaluate the coverage of opioid treatment facilities relative to the spatial distribution of people abusing opioids. Say for example we have two areas, A and B. Area A has a facility that can treat 50 people, but has a total of 100 people who need service. Now Area B has no treatment provider, but only has 10 people that need service.

If you looked at distance to nearest provider, location B would look worse than A. So you may think to yourself – we should open up a facility in Area B. But if you look at the total number of people, as well as the capacity of the treatment provider in Area A, it would probably be a better investment to expand the capacity of treatment center A. It has more potential demand.

The 2SFCA partially captures this, but I am going to show a linear programming approach that can decompose the added benefit of adding capacity to a particular provider, or adding in potential new providers. I have the posted these functions on github, but can walk through the linear programming model, and how to assess potential policy interventions in that framework.

So first lets make just a simple set of data, we have 10 people with XY coordinates, and 2 service providers. The two service providers have capacity 3 and 5, so we cannot cover 2 of the people.

# Python example code
import numpy as np
import pandas as pd
import pulp

# locations of people
id = range(10)
px = [1,1,3,3,4,5,6,6,9,9]
py = [2,8,1,8,4,6,1,2,5,8]
peop = pd.DataFrame(zip(id,px,py),columns=['id','px','py'])

# locations of 2 providers & capacity 3/5
hid = [1,2]
hx = [1,8]
hy = [1,8]
hc = [3,5] # so cant cover 2 people
prov = pd.DataFrame(zip(hid,hx,hy,hc),columns=['hid','hx','hy','hc'])

Now for the subsequent model I show, it is going to assign the people to the nearest possible provider, given the distance constraints. To make the model feasible, we need to add in a slack provider that can soak up the remaining people not covered. We set this location to somewhere very far away, so the model will only assign people to this slack provider in the case no other options are available.

# add in a slack location to the providers
# very out of the way and large capacity
prov_slack = prov.iloc[[0],:].copy()
prov_slack['hid'] = 999
prov_slack['hx'] = 999
prov_slack['hy'] = 999
prov_slack['hc'] = peop.shape[0]
prov_add = pd.concat([prov,prov_slack],axis=0)

Now the decision model looks at all pairwise combinations between people and providers (potentially eliminating combinations that are too far away to be reasonable). So I do the cross join between the people/providers, and then calculate the Euclidean distance between them. I set the distance for the slack provider to a very high value (here 9999).

# distance between people/providers
peop['const'] = 1
prov_add['const'] = 1
cross = pd.merge(peop,prov_add,on='const')
cross['dist'] = np.sqrt( (cross['px']-cross['hx'])**2 + 
                         (cross['py']-cross['hy'])**2 )
# setting max distance for outside provider to 9999
cross.loc[cross.xs(999,level="hid",drop_level=False).index,"dist"] = 9999

Now we are ready to fit our linear programming model. First our objective is to minimize distances. (If all inputs are integers, you can use continuous decision variables and it will still return integer solutions.)

# now setting up linear program
# each person assigned to a location
# locations have capacity
# minimize distance traveled

# Minimize distances
P = pulp.LpProblem("MinDist",pulp.LpMinimize)

# each pair gets a decision variable
D = pulp.LpVariable.dicts("DA",cross.index.tolist(),
                          lowBound=0, upBound=1, cat=pulp.LpContinuous)

# Objective function based on distance
P += pulp.lpSum(D[i]*cross.loc[i,'dist'] for i in cross.index)

And we have two types of constraints, one is that each person is assigned a single provider in the end:

# Each person assigned to a single provider
for p in peop['id']:
    provl = cross.xs(p,0,drop_level=False).index
    P += pulp.lpSum(D[i] for i in provl) == 1, f"pers_{p}"

As a note later on, I will expand this model to include multiple people from a single source (e.g. count of people in a zipcode). For that expanded model, this constraint turns into pulp.lpSum(...) == tot where tot is the total people in a single area.

The second constraint is that providers have a capacity limit.

# Each provider capacity constraint
for h in prov_add['hid']:
    peopl = cross.xs(h,level=1,drop_level=False)
    pid = peopl.index
    cap = peopl['hc'].iloc[0] # should be a constant
    P += pulp.lpSum(D[i] for i in pid) <= cap, f"prov_{h}"

Now we can solve the model, and look at who was assigned to where:

# Solve the model
# CPLEX or CBC only ones I know of that return shadow
pulp.value(P.objective) #print objective 20024.33502494309

# Get the person distances
res_pick = []
for ph in cross.index:

cross['picked'] = res_pick
cross[cross['picked'] > 0.99]

So we can see two people were assigned the slack provider 999. Note that some people are not even assigned the closest – person 7 (6,2) is assigned to provider 2 at a distance of 6.3, it is only a distance of 5.1 away from provider 1. Because provider 1 has limited capacity though, they are assigned to provider 2 in the end.

In this framework, we can get the shadow price for the constraints, which says if we relax the constraint, how much it will improve our objective value.So if we add 1 capacity to provider 1, we will improve our objective by -9994.

# Get the shadow constraints per provider
o = [{'name':name, 'shadow price':c.pi, 'slack': c.slack} 
     for name, c in P.constraints.items()]
sc = pd.DataFrame(o)

I have helper functions at the github link above, so I don’t need to go through all of these motions again. You input your people matrix and the field names for the id, x, y, totn values.

And then you input your provider matrix with the field names for the providerid, x, y, prov_capacity (note this is not the matrix with the additional slack provider, my code adds that in automatically). The final two arguments limit the potential locations (e.g. here saying can’t assign a person to a provider over 12 distance away). And the last argument sets the super high distance penalty to people are not assigned.

# load in model functions
from assign_funcs import ProvAssign

# Const=1 is the total people per area
m1 = ProvAssign(peop,

# see the same objective as before

Now we can go ahead and up our capacity at provider 1 by 1, and see how the objective is reduced by -9994:

# if we up the provider capacity for
# prov by 1, the model objective goes 
# down by -9994
prov['hc'] = [4,5]

m2 = ProvAssign(peop,
m1.obj - m2.obj # 9994!

Like I said, I extended this to the scenario that you don’t have individual people, but have multiple counts of people in a spatial area. What can happen in this scenario is one source location can send people to multiple providers. Here you can see that source location 4 (total of 12 people), 5 were sent to provider 1, and 7 were sent to provider 2.

# Can make these multiple people, 100 total
peop['tot'] = [10,15,5,10,12,11,20,6,9,2]
prov['cap'] = [40,50] # should assign 10 people to slack

m3 = ProvAssign(peop,
# Can see the assignments from one
# source can spill over into multiple
# provider locations

So one of the things I like about this approach I already showed, we can do hypothetical scenarios ‘add capacity’ and see how it improves overall travel. Another potential intervention is to just place dummy 0 capacity providers over the study area, then look at the shadow constraints to see the best locations to open new facilities. Here I add in a potential provider at location (5,5).

# Can add in hypothetical providers with
# no capacity (make new providers)
# and check out the shadow
p2 = prov_slack.copy()
p2['hid'] = 10
p2['hx'] = 5
p2['hy'] = 5
p2['hc'] = 0
p2['cap'] = 0
prov_add = pd.concat([prov,p2],axis=0)

m4 = ProvAssign(peop,

# we now don't have source9 and prov1

Here in the code, I also have a function to limit potential assignments. Here if I set that limit to 10, it only just prevents id 9 (9,8) from being assigned to provider 1 (1,1), which is a distance of 10.6 away. This is useful with larger datasets, in which you may not be able to fit all of the pairwise distances into memory and model. (Although this model is pretty simple, you may be able to look at +1 million pairwise combos and solve in a reasonable time with open source solvers.)

Now we can go ahead and solve this model. I have a bunch of helper functions as well, so after solving we can check out the shadow price matrix:

# Solve m4 model

# can see the new provider 10
# if we added capacity would
# decrease distance travelled by -9996.2426

And based on this, it does look like we will have the best improvement by adding capacity at our hypothetical new provider 10, as oppossed to adding capacity at provider 1 or 2. Lets see what happens if you add 10 capacity to our hypothetical provider:

# Now lets add capacity for new provider
# by 10, should the objective go down
# by 10*-9996.2426 ?
prov_add['cap'] = [40,50,10]

m5 = ProvAssign(peop,

# Not quite -9996.2 but close!
(m5.obj - m4.obj)/10

We can see that the objective was not quite reduced by the expected, but is close. If it is not feasible to add a totally new provider, but simpler to give resources to expand current ones, we can see what will happen if we expand provider 1 by 10.

# we could add capacity
# to provider 1 by 10
# as well
prov_add['cap'] = [50,50,0]

m6 = ProvAssign(peop,

# Not quite -9993.4 but close!
(m6.obj - m4.obj)/10

In addition to doing different policy interventions in this approach, I provide different metrics to assess distance traveled and coverage. So going back to model 4, we can look at which areas have limited coverage. It only ends up that source area 1 has 10/15 people not covered. The rest of the areas are covered.

# Can look at sources and see stats
# for distance travelled as well
# as potential non-coverage

The fields are Trav is the average distance travelled for people who are covered (so could have high coverage but those people have to go a ways). Tot is the total number of people within that particular source area, and picked/not-covered are those assigned a provider/not assigned a provider (so go to the slack) respectively.

I additionally I have stats available rolled up to providers, mostly based on the not covered nearby. One way that the shadow doesn’t work, if you only have 10 people nearby, it doesn’t make sense to expand capacity to any more than 10 people. Here you can see the shadow price, as well as those not covered that are nearby to that provider.

# Can look at providers
# and see which ones would result in
# most reduced travel given increased coverage
# Want high NotCover and low shadow price

The other fields, hc is the listed capacity. Trav is the total travel for those not covered. The last field bisqw, is a way to only partially count not covered based on distance. It uses the bi-square kernel weight, based on the max distance field you provided in the model object. So if many people are nearby it will get a higher weight.

Here good add capacity locations are those with low shadow prices, and high notcovered/bisqw fields.

Note these won’t per se give the best potential places to add capacity. It may be the best new solution is to spread out new capacity – so here instead of adding 10 to a single provider, add a few to different providers.

You could technically figure that out by slicing out those leftover people in the m4.source_stats that have some not covered, adding in extra capacity to each provider, and then rerunning the algorithm and seeing where they end up.

Again I like this approach as it lets you articulate different policy proposals and see how that changes the coverage/distance travelled. Although no doubt there are other ways to formulate the problem that may make sense, such as maximizing coverage or making a coverage/distance bi-objective function.

Estimating Criminal Lifetime Value

At work I am currently working on estimating/forecasting healthcare spending. Similar to work I have done on forecasting person level crime risks (Wheeler et al., 2019), I build the predictive model dataset like this:

CrimeYear2020 PriorCrimeA PriorCrimeB
     0              2          3
     1              5          0
     0              0          0

etc. So I flatten people to a single row, and as covariates include prior cumulative crime histories. Most people do this similarly in the healthcare setting, so it looks like:

SpendingYear2020 PriorComorbidA PriorComorbidB
     3000              1          2
      500              3          0
    10000              0          0

Or sometimes people do a longitudinal dataset, where it is a spending*year*person panel (see Lauffenburger et al., 2020 for an example). I find this approach annoying for a few reasons though. One, it requires arbitrary temporal binning, which is somewhat problematic in these transaction level databases. We are talking for chronic offenders a few crimes per year is quite high, and ditto in the healthcare setting a few procedures a year can be very costly. So there is not much data to estimate the underlying function over time.

A second aspect I think is bad is that it doesn’t take into account the recency of the patterns. So the variables on the right hand side can be very old or very new. And with transaction level databases it is somewhat difficult to define how to estimate the lookback – do you consider it normalized by time? The VOID paper I mentioned we evaluated the long term scores, but the PD that does that chronic offender system has two scores – one a cumulative history and another a 90 day history to attempt to deal with that issue (again ad-hoc).

One approach to this issue from marketing research I have seen from very similar types of transactions databases are models to estimate Customer Lifetime Value (Fader et al. 2005). These models in the end generate a dataset that looks like this:

Person    RecentMonths  TotalEvents AveragePurchase
  A            3             5            $50
  B            1             2           $100
  C            9             8            $25

TotalEvents should be straightforward, RecentMonths just is a measure of the time since the last purchase, and then you have the average value of the purchases. And using just this data, estimates the probability of any future purchases, as well as projects the total value of the future average purchases. So here I use an example of this approach, using the Wolfgang Philly cohort public data. I am not going into the model more specifically (read some of the Bruce Hardie notes to get a flavor).

I have created some python code to follow along and apply these same customer lifetime value estimates to chronic offender data. Most examples of weighting crime harm apply it to spatial areas (Mitchell, 2019; Wheeler & Reuter, 2021), but you can apply it the same to chronic offender lists (Liggins et al., 2019).

Example Criminal Lifetime Value in Python

First, install the lifetimes python library – Cam’s documentation is excellent and makes the data manipulation/modelling quite simple.

Here I load in the transaction level crime data, e.g. it just have person A, 1/5/1960, 1000, where the 1000 is a crime seriousness index created by Wolfgang. Then the lifetimes package has some simple functions to turn our data into the frequency/recency format.

Note that for these models, you drop the first event in the series. To build a model to do train/test, I also split the data into evens before 1962, and use 1962 as the holdout test period.

import lifetimes as lt
import pandas as pd

# Just the columns from dataset II
# ID, SeriousScore, Date
df = pd.read_csv('PhilData.csv')
df['Date'] = pd.to_datetime(df['Date'])

# Creating the cumulative data
# Having holdout for one year in future
sd = lt.utils.calibration_and_holdout_data(df,'ID','Date',

# Only keeping people with 2+ events in prior period
sd = sd[sd['frequency_cal'] > 0].copy()

Recency_cal is how many months since a prior crime (starting in 1/1/1962), frequency is the total number of events (minus 1, so number of repeat events technically), and the monetary_value_cal here is the average of the crime seriousness across all the events. The way this function works, the variables with the subscript _cal are in the training period, and _holdout are events in the 1962 period. For subsequent models I subset out people with at least 2 events total in the modeling.

Now we can fit a model to estimate the predicted number of future crimes a person will commit – so this does not take into account the seriousness of those crimes. The final groupby statement shows the predicted number of crimes vs those actually committed, broken down by number of crimes in the training time period. You can see the model is quite well calibrated over the entire sample.

# fit BG model
bgf = lt.BetaGeoFitter(penalizer_coef=0)['frequency_cal'],sd['recency_cal'],sd['T_cal'])

# look at fit of BG model
t = 12
sd['pred_events'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, sd['frequency_cal'], sd['recency_cal'],sd['T_cal'])
sd.groupby('frequency_cal',as_index=False)[['frequency_holdout','pred_events']].sum() # reasonable

Now we can fit a model to estimate the average crime severity score for an individual as well. Then you can project a future cumulative score for an offender (here over a horizon of 1 year), by multiple the predicted number of events times the estimate of the average severity of the events, what I label as pv here:

# See conditional seriousness
sd['pred_ser'] = ggf.conditional_expected_average_profit(

sd['pv'] = sd['pred_ser']*sd['pred_events']
sd['cal_tot_val'] = sd['monetary_value_holdout']*sd['frequency_holdout']
# Not great correlation, around 0.2
vc = ['frequency_holdout','monetary_value_holdout','cal_tot_val','pred_events','pv']

The correlation between pv and the holdout cumulative crime severity cal_tot_val, is not great at 0.26. But lets look at this relative to the more typical approach analysts will do, simply rank prior offenders based on either total number of events or the crime seriousness:

# Lets look at this method via just ranking prior
# seriousness or frequency
sd['rank_freq'] = sd['frequency_cal'].rank(method='first',ascending=True)
sd['rank_seri'] = (sd['monetary_value_cal']*sd['frequency_cal']).rank(method='first',ascending=True)
vc += ['rank_freq','rank_seri']

So we can see that pv outperforms ranking based on total crimes (rank_freq), or ranking based on the cumulative serious score for offenders (rank_seri) in terms of the correlation for either the total number of future events or the cumulative crime harm.

If we look at capture rates, e.g. pretend we highlight the top 50 chronic offenders for intervention, we can see the criminal lifetime value pv estimate outperforms either simple ranking scheme by quite a bit:

# Look at capture rates by ranking
topn = 50
res_summ = []
for v in vc[-3:]:
    rank = sd[v].rank(method='first',ascending=False)
    locv = sd[rank <= topn].copy()
    tot_crimes = locv['frequency_holdout'].sum()
    tot_ser = locv['cal_tot_val'].sum()
    res_summ.append( [v,tot_crimes,tot_ser,topn] )

res_df = pd.DataFrame(res_summ,columns=['Var','TotCrimes','TotSer','TotN'])

In terms of the seriousness projection, it is reasonably well calibrated over the entire sample, but has a very tiny variance – it basically just predicts the average crime serious score over the sample and assigns that as the prediction going forward:

# Cumulative stats over sample reasonable
# variance much too small

So what this means is that if say Chicago READI wanted to do estimates to reasonably justify the max dollar cost for their program (over a large number of individuals) that would be reasonable. And this is how most marketing people use this info, average benefits of retaining a customer.

For individual projections though, e.g. I think OffenderB will generate between [low,high] crime harm in the next year, this is not quite up to par. I am hoping though to pursue these models further, maybe either in a machine learning/regression framework to estimate the parameters directly, or to use mixture models in an equivalent way that marketers use “segmentation” to identify different types of customers. Knowing the different way people have formulated models though is very helpful to be able to build a machine learning framework, which you can incorporate covariates.


Over 10 years of blogging

I just realized the other day that I have been blogging for over 10 years (I am old!) First hello world post post was back in December 2011.

I would recommend folks in academia/coding to at a minimum do a personal webpage. I use wordpress for my blog (did a free wordpress for quite a long time). WordPress is 0 code to make a personal page to host your CV.

I treat the blog as mostly my personal nerd journal, and blog about things I am working on or rants on occasion. I do not make revenue off of the blog directly, but in terms of getting me exposure it has given quite a few consulting leads over the years. As well as just given my academic work a much wider exposure.

So I always have a few things I want to blog about in the hopper. But always feel free to ask me anything (similar to how Andrew Gelman answers emails), and if I get a chance I will throw up a blog post in response.

Some pandas notes (part 1, EDA)

Teaching my son python at the moment – when I gave him a final project to do on his own, he complained and said he did not remember how to do one step from the prior lessons.

In the end, I am guessing I really only have maybe 30 commands in any language memorized, and then maybe another 70 I know of (but often need to look at the docs for its less often used arguments).

Here I figured it would be good for learners for me to put down my notes on pandas dataframes in python. At least the less than 20 commands I regularly use.

So first of I think a 3 part series, is using pandas for exploratory data analysis (EDA). Besides these tables I will show here, I often only use histograms and smooth conditional plots. Those I don’t even use very often.

Debated on writing a post that EDA is overrated, (I spend way more time understanding business objectives, then tuning a reasonably non-linear machine learning model and seeing its results dominates after that point). So it makes writing a post on EDA quite easy though – I can show a dozen pandas methods I regularly use and that is over 90% of the typical EDA work I do.

Here I will show an example using the PPP loan data available to download, in particular I will be using the over 150k loan data. A nice aspect of pandas is you can read in a CSV file directly from a url.

# Python code examples for EDA
import pandas as pd

# PPP loans, see full dataset
# at
url = r''

# Can read in from url directly
pp_loans = pd.read_csv(url)
pp_loans.shape # over 900k rows

At work I am often querying a database using SQLAlchemy instead of reading a csv file. But this is a nice example to show off – the data is too big to really work effectively in Excel, but pandas it should be no problem for most peoples machines.

So you can see the first thing I look at is data.shape to get the number of rows and columns. Next I often just print the columns using list(data):

# Can use list to see all of the variables

So you can access/modify the actual column names via data.columns, but this is a bit nicer printing.

Next I often summarize the data using the data.describe() method. I like transposing this result to be easier to read though.

# Transpose describe for easier reading

Here we get a bit of a mess of scientific notation (also this only shows summaries of float data by default, not categorical or dates). If you want to set a bit easier on the eyes printing, I typically have this pd.set_option() handy. Also you can scale the data, here I divide everything by 1000.

# If you dont want scientific notation
pd.set_option('display.float_format', '{:,.0f}'.format)

One of the things I like about this the most is the count, so you can see data that has missing values (that here should likely be set to 0 for the *_PROCEED variables). The second thing I look at are the min/max. Sometimes you can spot rogue missing values or censoring in the data. Then looking at the quartiles is typically informative as to how skewed the data is. Only then do I bother with the mean – which for binary data is reasonably informative.

You can see for this particular data, it is the loans over $150k – here this means apparently the CurrentApprovalAmount, as 150k to 10million (the apparent cap on the PPP loans). So this is $150k conditional on approval dataset.

The next step I am typically looking the distribution of various categorical data. For that I am often just using data['variable'].value_counts():

Value counts is maybe my favorite


This prints out all of the business types in this example, but with a very large number will limit to the head/tail. Here we can see Corps are the most common, with LLCs and s-corps as the next.

The next most common pandas method I use is data.groupby(), this produces summary aggregations conditional on different groups in the data. For example, here we can do the total sum of the loans (per million dollars) for each of the business types.

# Groupby with stats is next, per million
x = 'BusinessType'
y = 'CurrentApprovalAmount'

You can see that you can chain multiple pandas methods together, which is just a typical python object oriented pattern. So I tack on a .sort_values() to the end of the groupby. Note again these are in millions of dollars, so Coop’s got over a billion dollars of PPP loans in this data (it will be more overall, these again are only loans over $150k, the link at the beginning has a bunch of additional datasets with smaller loan amounts).

You can subsequently do additional functions in groupby, for just an example here I create my own normal distribution confidence interval of the mean for large samples.

# Can do more complicated functions
def ci95(x):
    mx = x.mean()
    sx = x.std()
    res = pd.DataFrame([[mx - 2*sx,mx + 2*sx]])
    res.columns = ['low','high']
    return res


I often use this for functions with binary data, but otherwise we can again chain methods together, so we can use data.groupby(x)[y].describe() to get those summaries for each of our groups. Also as an FYI you can pass in multiple y values to summarize via a list, although it gets a bit hairy in the output:

# And can have multiple outcome variables

One of the common things I do at this point is to simply dump the results to a CSV file I can lookat, e.g. something like:

# Can chain together and save result to csv

Or assign to a new object, and some python editors (such as Spyder) have a data browsers.

The final EDA part I use most often is simply sorting the dataset and looking at the head/tail.

# Can sort values and look at head/tail
pp_loans[['BorrowerName'] + ys].head()

I often will export the head of the data, e.g. something like data.head().to_csv('checkdata.csv') to view all of the fields and their content as well when first figuring out fields, how often they are filled in, etc.

Next part in the series, I will describe more regular pandas data manipulation operations I regularly use, e.g. getting a single column vs multiple columns, filtering, string manipulation functions, etc.

Staggered Treatment Effect DiD count models

So I have been dealing with various staggered treatments for difference-in-difference (DiD) designs for crime data analysis on how interventions reduce crime. I’ve written about in the past mine and Jerry’s WDD estimator (Wheeler & Ratcliffe, 2018), as well as David Wilson’s ORR estimator (Wilson, 2022).

There has been quite a bit of work in econometrics recently describing how the traditional way to apply this design to staggered treatments using two-way fixed effects can be misleading, see Baker et al. (2022) for human readable overview.

The main idea is that in the scenario where you have treatment heterogeneity (TH from here on) (either over time or over units), the two-way fixed effects estimator is a weird average that can misbehave. Here are just some notes of mine though on fitting the fully saturated model, and using post-hoc contrasts (in R) to look at that TH as well as to estimate more reasonable average treatment effects.

So first, we can trick R to use glm to get my WDD estimator (or of course Wilson’s ORR estimator) for the DiD effect with count data. Here is a simple example from my prior blog post:

# R code for DiD model of count data
count <- c(50,30,60,55)
post <- c(0,1,0,1)
treat <- c(1,1,0,0)

df <- data.frame(count,post,treat)

# Wilson ORR estimate
m1 <- glm(count ~ post + treat + post*treat,data=df,family="poisson")

And here is the WDD estimate using glm passing in family=poisson(link="identity"):

m2 <- glm(count ~ post + treat + post*treat,data=df,

And we can see this is the same as my WDD in the ptools package:

library(ptools) # via

Using glm will be more convenient than me scrubbing up all the correct weights, as I’ve done in the past examples (such as temporal weights and different area sizes). It is probably the case you can use different offsets in regression to accomplish similar things, but for this post just focusing on extending the WDD to varying treatment timing.

Varying Treatment Effects

So the above scenario is a simple pre/post with only one treated unit. But imagine we have two treated units and three time periods. This is very common in real life data where you roll out some intervention to more and more areas over time.

So imagine we have a set of crime data, G1 is rolled out first, so the treatment is turned on for periods One & Two, G2 is rolled out later, and so the treatment is only turned on for period Two.

Period    Control     G1     G2
Base          50      70     40
One           60      70     50
Two           70      80     50

I have intentionally created this example so the average treatment effect per period per unit is 10 crimes. So no TH. Here is the R code to show off the typical default two-way fixed effects model, where we just have a dummy variable for unit+timeperiods that are treated.

# Examples with Staggered Treatments
df <- read.table(header=TRUE,text = "
 Period    Control     G1     G2
 Base          50      70     40
 One           60      70     50
 Two           70      80     50

# reshape wide to long
nvars <- c("Control","G1","G2")
dfl <- reshape(df,direction="long",

dfl$Unit <- as.factor(dfl$Unit)
names(dfl)[3] <- 'Crimes'

# How to set up design matrix appropriately?
dfl$PostTreat <- c(0,0,0,0,1,1,0,0,1)

m1 <- glm(Crimes ~ PostTreat + Unit + Period,

summary(m1) # TWFE, correct point estimate

The PostTreat variable is the one we are interested in, and we can see that we have the correct -10 estimate as we expected.

OK, so lets create some treatment heterogeneity, here now G1 has no effects, and only G2 treatment works.

dfl[dfl$Unit == 2,'Crimes'] <- c(70,80,90)

m2 <- glm(Crimes ~ PostTreat + Unit + Period,

summary(m2) # TWFE, estimate -5.29, what?

So you may naively think that this should be something like -5 (average effect of G1 + G2), or -3.33 (G1 gets a higher weight since it is turned on for the 2 periods, whereas G2 is only turned on for 1). But nope rope, we get -5.529.

We can estimate the effects of G1 and G2 seperately though in the regression equation:

# Lets seperate out the two units effects
dfl$pt1 <- 1*(dfl$Unit == 2)*dfl$PostTreat
dfl$pt2 <- 1*(dfl$Unit == 3)*dfl$PostTreat

m3 <- glm(Crimes ~ pt1 + pt2 + Unit + Period,

summary(m3) # Now we get the correct estimates

And now we can see that as expected, the effect for G2 is the pt2 coefficient, which is -10. And the effect for G1, the pt1 coefficient, is only floating point error different than 0.

To then get a cumulative crime reduction effect for all of the areas, we can use the multcomp library and the glht function and construct the correct contrast matrix. Here the G1 effect gets turned on for 2 periods, and the G2 effect is only turned on for 1 period.

cont <- matrix(c(0,2,1,0,0,0,0),1)
cumtreat <- glht(m3,cont) # correct cumulative

And if we want an ‘average treatment effect per unit and per period’, we just change the weights in the contrast matrix:

atreat <- glht(m3,cont/3) # correct average over 3 periods

And this gets us our -3.33 that is a more reasonable average treatment effect. Although you would almost surely just focus on that the G2 area intervention worked and the G1 area did not.

You can also fit this model alittle bit easier using R’s style formula instead of rolling your own dummy variables via the formula Crimes ~ PostTreat:Unit + Unit + Period:

But, glht does not like it when you have dropped levels in these interactions, so I don’t do this approach directly later on, but construct the model matrix and drop non-varying columns.

Next lets redo the data again, and now have time varying treatments. Now only period 2 is effective, but it is effective across both the G1 and G2 locations. Here is how I construct the model matrix, and what the resulting sets of dummy variables looks like:

# Time Varying Effects
# only period 2 has an effect

dfl[dfl$Unit == 2,'Crimes'] <- c(70,80,80)

# Some bookkeeping to make the correct model matrix
mm <- -1 + PostTreat:Period + Unit + Period, dfl))
mm <- mm[,names(mm)[colSums(mm) > 0]] # dropping zero columns
names(mm) <- gsub(":","_",names(mm))  # replacing colon
mm$Crimes <- dfl$Crimes

Now we can go ahead and fit the model without the intercept.

# Now can fit the model
m6 <- glm(Crimes ~ . -1,


And you can see we estimate the correct effects here, PostTreat_PeriodOne has a zero estimate, and PostTreat_PeriodTwo has a -10 estimate. And now our cumulative crimes reduced estimate -20

cumtreat2 <- glht(m6,"1*PostTreat_PeriodOne + 2*PostTreat_PeriodTwo=0")

And if we did the average, it would be -6.66.

Now for the finale – we can estimate the saturated model with time-and-unit varying treatment effects. Here is what the design matrix looks like, just a bunch of columns with a single 1 turned on:

# Now for the whole shebang, unit and period effects
mm2 <- -1 + Unit:PostTreat:Period + Unit + Period, dfl))
mm2 <- mm2[,names(mm2)[colSums(mm2) > 0]] # dropping zero columns
names(mm2) <- gsub(":","_",names(mm2))  # replacing colon
mm2$Crimes <- dfl$Crimes

And then we can fit the model the same way:

m7 <- glm(Crimes ~ . -1,

summary(m7) # Now we get the correct estimates

And you can see our -10 estimate for Unit2_PostTreat_PeriodTwo and Unit3_PostTreat_PeriodTwo as expected. You can probably figure out how to get the cumulative or the average treatment effects at this point:

tstr <- "Unit2_PostTreat_PeriodOne + Unit2_PostTreat_PeriodTwo + Unit3_PostTreat_PeriodTwo = 0"
cumtreat3 <- glht(m7,tstr)

We can also use this same framework to get a unit and time varying estimate for Wilson’s ORR estimator, just using family=poisson with its default log link function:

m8 <- glm(Crimes ~ . -1,


It probably does not make sense to do a cumulative treatment effect in this framework, but I think an average is OK:

avtreatorr <- glht(m8,
  "1/3*Unit2_PostTreat_PeriodOne + 1/3*Unit2_PostTreat_PeriodTwo + 1/3*Unit3_PostTreat_PeriodTwo = 0")

So the average linear coefficient is -0.1386, and if we exponentiate that we have an IRR of 0.87, so on average when a treatment occurred in this data a 13% reduction. (But beware, I intentionally created this data so the parallel trends for the DiD analysis were linear, not logarithmic).

Note if you are wondering about robust estimators, Wilson suggests using quasipoisson, e.g. glm(Crimes ~ . -1,family="quasipoisson",data=mm2), which works just fine for this data. The quasipoisson or other robust estimators though return 0 standard errors for the saturated family=poisson(link="identity") or family=quasipoisson(link="identity").

E.g. doing

cumtreat_rob <- glht(m7,tstr,vcov=vcovHC,type="HC0")

Or just looking at robust coefficients in general:


Returns 0 standard errors. I am thinking with the saturated model and my WDD estimate, you get the issue with robust standard errors described in Mostly Harmless Econometrics (Angrist & Pischke, 2008), that they misbehave in small samples. So I am a bit hesitant to suggest them without more work to establish they behave the way they should in smaller samples.


  • Angrist, J.D., & Pischke, J.S. (2008). Mostly Harmless Econometrics. Princeton University Press.
  • Baker, A.C., Larcker, D.F., & Wang, C.C. (2022). How much should we trust staggered difference-in-differences estimates? Journal of Financial Economics, 144(2), 370-395.
  • Wheeler, A.P., & Ratcliffe, J.H. (2018). A simple weighted displacement difference test to evaluate place based crime interventions. Crime Science, 7(1), 1-9.
  • Wilson, D.B. (2022). The relative incident rate ratio effect size for count-based impact evaluations: When an odds ratio is not an odds ratio. Journal of Quantitative Criminology, 38(2), 323-341.

Getting census data over time

A former student recently asked about getting census data over time, in particular for smaller geographies like block groups. My GIS course I teach students the manual way of downloading data year-by-year from the FTP site. That is partially for pedagogical reasons though, I want students to realize the number of variables (there are so many) and how the data is stored by the census for the American Community Survey.

But Census now has a web api, where you can query the data. So if you are familiar with R or python programming, you can get the data in a bit easier fashion. You just need to know the years + census geographies + variables. I have notes on variables I often use for crim research, but going to the FTP site you can find the big documents or the excel templates.

I have honestly avoided these APIs in my workflows for several years, as my experience with the Census geocoding API was quite flaky, but I have not had the same problems with the APIs for querying the data. Here are examples in R (tidycensus library) and python (census library) of downloading several variables over the 2014-2019 span.

# R code

# sign up for census key#

# place to store results and combine them
years <- 2014:2019
res <- vector("list",length(years))
names(res) <- years

# variables that you want
#        Tot Pop     White non-Hisp  FemHeadHouse  FamPoverty
vars <- c('B03001_001','B03002_003','B11003_016','B17010_002')

# loop over years, save data
# could also apply county filter, see help(get_acs)
# using smaller Deleware just for example
for (y in years){
    # download data
    ld <- = y,
                                variables = vars,
    # reshape long to wide
    ld2 <- reshape(ld,
    # insert into list and add in year
    res[[y]] <- ld2
    res[[y]]$year <- y

# Combining the data frames together for final analysis
combo <-"rbind",res)
head(combo) # can see B03001_001 is missing for block groups

So in R you can ask for a variable, but if it is not available you will just get missing. So you need to make sure the variables you ask for are available over the time span.

The python census library will just straight up give you an error if the variable is not available. Also you need to specify E/M estimates, not just the base variable.

# Python code

from census import Census
import pandas as pd

key = '????yourkeyhere????'
c = Census(key)
# will get error with unknown variable
# need to specify E/M for estimate or margin of error
vars = ['B03002_003E','B11003_016E','B17010_002E']
res = []

for y in range(2014,2019+1):
    # '10' is Delaware, first '*' is county, second '*' is specific
    # geoid for a block group
    lk = c.acs5.state_county_blockgroup(vars, '10', "*", "*",year=y)
    ld = pd.DataFrame(lk)
    ld['year'] = y

combo = pd.concat(res,axis=0)

(Initial post had an error not passing in year into the download function, now the two results are the same.)

For making reproducible scripts, instead of putting your API key into the code, a common way is to create a config file with the API key (don’t upload the config file to github), and then read in the config file into your script. (Another way is to use environment variables as secrets, I think the config is easier for people to grok though.)

Another friend recently referred me to requests-cache library. It is a good idea to only download the data locally once, then use that local data. No need to requery the data every time you update your code. Easiest approach is to just have a special script to download the data and save it (in a database or csv files would work here), and then later scripts work with that local data.

The limit on the cost efficiency of gun violence interventions

Imagine a scenario where someone came out with technology that would 100% reduce traffic fatalities at a particular curve in a road. But, installation and maintenance of the tech would cost $36 million dollars per 100 feet per year. It is unlikely anyone would invest in such technology – perhaps if you had a very short stretch of road that resulted in a fatality on average once a month it would be worth it. In that case, the tech would result in $36/12 = $3 million dollars to ‘save a life’.

There are unlikely any stretches of roads that have this high of fatality rate though (and this does not consider potential opportunity costs of less effective but cheaper other interventions). So if we had a location that has a fatality once a year, we are then paying $36 million dollars to save one life. We ultimately have upper limits on what society will pay to save a life.

Working on gun violence prevention is very similar. While gun violence has potentially very large costs to society, see Everytown’s estimates of $50k to a nonfatal shooting and $270k for a fatality, preventing that gun violence is another matter.

The translation to gun violence interventions from the traffic scenario is ‘we don’t have people at super high risk of gun violence’ and ‘the interventions are not going to be 100% effective’.

My motivation to write this post is the READI intervention in Chicago, which has a price tag of around $60k per participant per 20 months. What makes this program then ‘worth it’ is the probability of entrants being involved with gun violence multiplied by the efficacy of the program.

Based on other work I have done on predicting gun violence (Wheeler et al., 2019b), I guesstimate that any gun violence predictive instrument spread over a large number of individuals will have at best positive predictive probabilities of 10% over a year. 10% risk of being involved in gun violence is incredibly high, a typical person will have something more on the order of 0.01% to 0.001% risk of being involved with gun violence. So what this means is if you have a group of 100 high risk people, I would expect ~10 of them to be involved in a shooting (either as a victim or offender).

This lines up almost perfectly with READI, which in the control group had 10% shot over 20 months. So I think READI actual did a very good job of referring high risk individuals to the program. I don’t think they could do any better of a job in referring even higher risk people.

This though implies that even with 100% efficacy (i.e. anyone who is in READI goes to 0% risk of involvement in gun violence), you need to treat ~10 people to prevent ~1 shooting victimization. 100% efficacy is not realistic, so lets go with 50% efficacy (which would still be really good for a crime prevention program, and is probably way optimistic given the null results). Subsequently this implies you need to treat ~20 people to prevent ~1 shooting. This results in a price tag of $1.2 million to prevent 1 shooting victimization. If we only count the price of proximal gun violence (as per the Everytown estimates earlier), READI is already cost-inefficient from the get go – a 100% efficacy you would still need around 10 people (so $600k) to reduce a single shooting.

The Chicago Crime Lab uses estimates from Cohen & Piquero (2009) to say that READI has a return on investment of 3:1, so per $60k saves around $180. These however count reductions over the life-course, including person lost productivity, not just state/victim costs, which I think are likely to be quite optimistic for ROI that people care about. (Productivity estimates always seem suspect to me, models I have put into production in my career have generated over 8 digits of revenue, but if I did not do that work someone else would have. I am replaceable.)

I think it is likely one can identify other, more cost effective programs to reduce gun violence compared to READI. READI has several components, part of which is a caseworker, cognitive behavior therapy (CBT), and a jobs program. I do not know cost breakdowns for each, but it may be some parts drive up the price without much benefit over the others.

I am not as much on the CBT bandwagon as others (I think it looks quite a bit like the other pysch research that has come into question more recently), but I think caseworkers are a good idea. The police department I worked with on the VOID paper had caseworkers as part of their intervention, as did focused deterrence programs I have been involved with (Wheeler et al., 2019a). Wes Skogan even discussed how caseworkers were part of Chicago CEASEFIRE/outreach workers on Jerry Ratcliffe’s podcast. For those not familiar, case workers are just social workers assigned to these high risk individuals, and they often help their charges with things like getting an ID/Drivers License and applying to jobs. So just an intervention of caseworkers assigned to high risk people I think is called for.

You may think many of these high risk individuals are not amenable to treatment, but my experience is a non-trivial number of them are willing to sit down and try to straighten their lives out, and they need help to do that it. Those are people case workers are a good potential solution.

Although I am a proponent of hot spots policing as well, if we are just talking about shootings, I don’t think hot spots will have a good return on investment either (Drake et al., 2022). Only if you widen the net to other crimes do a think hot spots makes sense (Wheeler & Reuter, 2021). And maybe here I am being too harsh, if you reduce other criminal behavior READIs cost-benefit ratio likely looks better. But just considering gun violence, I think dropping $60k per person is never going to be worth it in realistic high gun violence risk populations.


Building wheel files in github actions

At work we are using a new databricks environment (claims based pop health related models). Databricks is very nice as a data querying environment, but it is challenging building well vetted code libraries in python. See the blog post Please don’t make me use databricks notebooks for an overview of the issues. (Other environments that make you write in notebooks, such as Apache Zeppelin, have pretty much all the same limitations.)

So we are still working out the design pattern for how to best write well vetted code. It is looking a bit like this workflow by menziess, I have been able to get dbconnect (and databricks-sql), installed on local windows machines. From there I can do all the usual junk – linting pre-commits, writing unit tests, etc. on my local machine. Then I push, and can do some final checks (or run a real life pipeline), in the databricks GUI environment.

One difference though is instead of doing Azure pipelines to build the wheel files, I am using Github Actions. To share I use my retenmod package as an example. The github action is pretty straightforward, and uses the same trick to push inside the action as I wrote about previously.

So here is the action code in-situ, but I can copy-paste the workflow right here in the blog to illustrate the yaml:

# Github actions to build
# and push wheel files
      - main
      - master

    runs-on: ubuntu-latest
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
          python-version: 3.9
      - name: Build wheel and install
        run: |
          python -m pip install --user --upgrade build
          python -m build
          #pip install .
          find ./dist/*.whl | xargs pip install
      - name: Configure Git
        run: |
          git config --global ""
          git config --global "apwheele"
      - name: Commit and push wheel
        run: |
          git add -f ./dist/*.whl
          git commit -m 'pushing new wheel'
          git push

And then in your databricks notebooks, you can then have a locally scoped environment, so can have:

%pip install ./dist/libname.whl

At the front of your notebook. And then in a code cell, can then do:

import retenmod as rm
# do whatever rm functions from the library

Just like any normal python package. There are a few potential gotchas here. 1) I will need to write a python script to also edit libname.whl in the data pipelines whenever I update versions (my unix grep/sed fu is not up to task to grep out whl files). But that should be as simple as calling python inside the github action, and then amending the git add . to scoop up the edited files.

A second part is that with work repos, pushing inside the action is a bit trickier, so we need to work with personal access tokens/actions secrets and set the remote url for the push. It is tough for me to illustrate that with public repos though, so will have to wait until another blog post.

Home buying and collective efficacy

With the recent large appreciation in home values, around 20% in the prior year, there have been an increase in private investors purchasing homes to rent out. Recent stories on this by Tyler Dukes and colleagues have collated open parcel data to identify the scope of these companies across all of North Carolina.

For bit of background, I tried to purchase a home in Plano, TX early 2018. Homes in our price range at that time were going in a single day and typically a few thousand over asking price.

Fast forward to early 2021, I am full remote data scientist instead of a professor, and kiddo is in online school. Even with the pay bump, housing competition was even worse in Plano at this point, so we knew we were likely going to have to move school districts to be able to purchase a home. So we decided to strike out, and ended up looking around Raleigh. Ended up quite quickly deciding to purchase a new build home in the suburb of Clayton (totally recommend our realtor, Ellen Pitts, her crew did quite a bit of work for us remotely).

I was lucky to get in then it appears – many of the new developments in the area are being heavily scooped up by these equity firms (and rent would be ~$600 more for my home than the mortgage). So I downloaded the public data Dukes put together, and loaded it into Excel to make a quick map of the properties.

For a NC state view, we have big clusters in Charlotte, Greensboro and Raleigh:

We can zoom in, and here is an overview of triangle area:

So you can see that inside the loop in Raleigh is pretty sparse, but many of the newer developments on the east side have many more of the private firm purchased houses. Charlotte is much more infilled with these private firms purchasing properties.

Zooming in even further to my town of Clayton, there is quite a bit of variance in the proportion of private vs residential purchases across various developments. My development is less than 50% of these purchases, several developments though appear almost 100% private purchased though. (This is not my home/neighborhood FYI.)

So what does this have to do with collective efficacy? Traditionally areas with higher home ownership have been associated with lower rates of crime. For not criminologists reading my blog, one of the most prominent criminological theories is that state actions only move the needle slightly on increasing/decreasing crime, people enforcing social norms is a bigger factor that explains high crime vs low crime areas. Places with people churning out more frequently – which occurs in areas with more renters – tend to have fewer people effectively keeping the peace. Because social scientists love to make up words, we call this concept collective efficacy.

Downloading and looking at this data, while I was mostly just interested in zooming into my neighborhood and seeing the infill of renters, sparked a criminological hypothesis: I expect neighborhoods with higher rates of private equity purchased housing in the long run to have higher rates of criminal behavior.

This hypothesis will be difficult to test in the wild. It is partially confounded with capital – those who buy their homes accumulate more wealth over time (again mortgage is quite a bit cheaper than rent, so even ignoring home value appreciation this is true). But the variance in the number of homes purchased by private equity firms in different areas makes me wonder if there is enough variation to do a reasonable research design to test my hypothesis, especially in the Charlotte area in say two or three years post a development being finished.