Simulating data with numpy and scipy

I’ve answered a few different questions on forums recently about simulating data:

I figured it would be worth my time to put some of my notes down in a blog post. It is not just academic, not too infrequently I use simulations at work to generate estimated return on investment estimates for putting a machine learning model into production. For social scientists this is pretty much equivalent to doing cost/benefit policy simulations.

Simulations are also useful for testing the behavior of different statistical tests, see prior examples on my blog for mixture trajectories or random runs.

Here I will be showing how to mostly use the python libraries numpy and scipy to do various simulation tasks. For some upfront (note I set the seed in numpy, important for reproducing your simulations):

import itertools as it
import numpy as np
import pandas as pd
np.random.seed(10)

# total simulations for different examples
n = 10000

# helper function to pretty print values
def pu(x,ax=1):
  te1,te2 = np.unique(x.sum(axis=ax),return_counts=True)
  te3 = 100*te2/te2.sum()
  res = pd.DataFrame(zip(te1,te2,te3),columns=['Unique','Count','Percent'])
  return res

Sampling from discrete choice sets

First, in the linked data science post, the individual had a very idiosyncratic set of simulations. Simulate a binary vector of length 10, that had 2,3,4 or 5 ones in that vector, e.g. [1,1,0,0,0,0,0,0,0,0] is a valid solution, but [0,0,0,0,1,0,0,0,0,0] is not, since the latter only has 1 1. One way to approach problems like these is to realize that the valid outcomes are a finite number of discrete solutions. Here I use itertools to generate all of the possible permutations (which can easily fit into memory, only 627). Then I sample from that set of 627 possibilities:

# Lets create the whole sets of possible permutation lists
res = []
zr = np.zeros(10)
for i in range(2,6):
    for p in it.combinations(range(10),i):
        on = zr.copy()
        on[list(p)] = 1
        res.append(on.copy())

resnp = np.stack(res,axis=0)

# Now lets sample 1000 from this list
total_perms = resnp.shape[0]
samp = np.random.choice(total_perms,n)
res_samp = resnp[samp]

# Check to make sure this is OK
pu(res_samp)

Ultimately you want the simulation to represent reality. So pretend this scenario was we randomly pick a number out of the set {2,3,4,5}, and then randomly distribute the 1s in that length 10 vector. In that case, this sampling procedure does not reflect reality, because 2’s have fewer potential permutations than do 5’s. You can see this in the simulated proportions of rows with 2 (7.25%) vs rows with 5 (39.94%) in the above simulation.

We can fix that though by adjusting the sampling probabilities from the big set of 627 possibilities though. Pretty much all of the np.random methods an option to specify different marginal probabilities, where in choice it defaults to equal probability.

# If you want the different subsets to have equal proportions
sum_pop = resnp.sum(axis=1)
tot_pop = np.unique(sum_pop,return_counts=True)
equal_prop = 1/tot_pop[1].shape[0]
pop_prob = pd.Series(equal_prop/tot_pop[1],index=tot_pop[0])
long_prob = pop_prob[sum_pop]

samp_equal = np.random.choice(total_perms,n,p=long_prob)
res_samp_equal = resnp[samp_equal]
pu(res_samp_equal)

So now we can see that each of those sets of results have similar marginal proportions in the simulation.

You can often figure out exact distributions for your simulations, for an example of similar discrete statistics, see my work on small sample Benford tests. But I often use simulations to check my math even if I do know how to figure out the exact distribution.

Another trick that I commonly use in other applications that don’t have access to something equivalent to np.random.choice, but most applications have a random uniform generator. Basically you can generate random numbers on whatever interval, chunk those up into bits, and turn those bits into your resulting categories. This is what I did in that SPSS post at the beginning.

unif = np.floor(np.random.uniform(1,33,(32*n,1)))/2
pu(unif)

But again this is not strictly necessary in python because we can generate the set/list of items and sample from that directly.

# Same as using random choice from that set of values
half_vals = np.arange(1,33)/2
unif2 = np.random.choice(half_vals,(32*n,1))
pu(unif2)

But if limited in your libraries that is a good trick to know. Note that most random number generators operate over (0,1), so open intervals and will never generate an exact 0 or 1. To get a continuous distribution over whatever range, you can just multiply the 0/1 random number generator (and subtract if you need negative values) to match the range you want. But again most programs let you input min/max in a uniform number generator.

Sampling different stat distributions

So real data often can be reasonably approximated via continuous distributions. If you want to generate different continuous distribtions with approximate correlations, one approach is to:

  • generate multi-variate standard normal data (mean 0, variance 1, and correlations between those variables)
  • turn that matrix into a standard uniform via the CDF function
  • then for each column, use the inverse CDF function for the distribution of choice

Here is an example generating a uniform 0/1 and a Poisson with a mean of 3 variable using this approach.

from scipy.stats import norm, poisson

# Here generate multivariate standard normal with correlation 0.2
# then transform both to uniform
# Then transform 2nd column to Poisson mean 3

mu = [0,0]
cv = [[1,0.2],[0.2,1]] #needs to be symmetric
mv = np.random.multivariate_normal([0,0],cov=cv,size=n)
umv = pd.DataFrame(norm.cdf(mv),columns=['Uniform','Poisson'])
umv['Poisson'] = poisson.ppf(umv['Poisson'],3)
print(umv.describe())

This of course doesn’t guarantee the transformed data has the exact same correlation as the original specified multi-variate normal. (If interested in more complicated scenarios, it will probably make sense to check out copulas.)

Like I mentioned in the beginning, I am often dealing with processes of multiple continuous model predictions. E.g. one model that predicts a binary ‘this claim is bad’, and then a second model predicting ‘how bad is this claim in terms of $$$’. So chaining together simulations of complicated data (which can be mixtures of different things) can be useful to see overall behavior of a hypothetical system.

Here I chain together a predicted probability for 3 claims (20%,50%,90%), and mean/standard deviations of (1000,100),(100,20),(50,3). So pretend we can choose 1 of these three claims to audit. We have the predicted probability that the claim is wrong, as well as an estimate of how much money we would make if the claim is wrong (how wrong is the dollar value).

The higher dollar values have higher variances, so do you pick the safe one, or do you pick the more risky audit with higher upside. We can do a simulation to see the overall distribution:

pred_probs = np.array([0.2,0.5,0.9])
bin_out = np.random.binomial(n=1,p=pred_probs,size=(n,3))
print( bin_out.mean(axis=0) )

# Pretend have both predicted prob
# and distribution of values
# can do a simulation of both

val_pred = np.array([1000,100,50])
val_std = np.array([100,20,3])
val_sim = np.random.normal(val_pred,val_std,size=(n,3))
print(val_sim.mean(axis=0))
print(val_sim.std(axis=0))

revenue = val_sim*bin_out
print(revenue.mean(axis=0))
print(revenue.std(axis=0))

I could see a reasonably risk averse person picking the lowest dollar claim here to audit. Like I said in the discrete case, we can often figure out exact distributions. Here the expected value is easy, prob*val, the standard deviation is alittle more tricky to calculate in your head (it is a mixture of a spike at 0 and then the rest of the typical distribution):

# Theoretical mean/variance
expected = pred_probs*val_pred
low_var = (expected**2)*(1-pred_probs)
hig_var = ((val_pred - expected)**2)*pred_probs
std_exp = np.sqrt(low_var + hig_var)

But it still may be useful to do the simulation here anyway, as the distribution is lumpy (so just looking at mean/variance may be misleading).

Other common continuous distributions I use are beta, to simulate between particular endpoints but not have them uniform. And negative binomial is a good fit to not only many count distributions, but things that are more hypothetically continuous (e.g. for distances instead of gamma, or for money instead of log-normal).

Here is an example generating a beta distribution between 0/1, with more of the distribution towards 0 than 1:

# mean probability 0.2
a,b = 1,5
beta_prob = beta.rvs(a, b, size=n)
plt.hist(beta_prob, bins=100,density=True,label='Sim')
# theoretical PDF
x = np.arange(0,1.01,0.01)
plt.plot(x,beta.pdf(x,a,b),label=f'beta({a},{b}) PDF')
plt.legend()
plt.show()

For beta, the parameters a,b, the mean is a/b. To make the distribution more spread out, you have larger overall values of a/b (I don’t have the conversion to variance memorized offhand). But if you have real data, you can plop that into scipy to fit the parameters, here I fix the location and scale parameters.

# If you want to fit beta based on observed data
fitbeta = beta.fit(beta_prob,floc=0,fscale=1)

We can do similar for negative binomial, I often think of these in terms of regression dispersion parameters, and I have previously written about translating mean/dispersion to n/p notation:

# Ditto for negative binomial
mean_nb = 2
disp_nb = 4

def trans_np(mu,disp):
    x = mu**2/(1 - mu + disp*mu)
    p = x/(x + mu)
    n = mu*p/(1-p)
    return n,p

nb_n,nb_p = trans_np(mean_nb,disp_nb)
nb_sim = nbinom.rvs(nb_n,nb_p,size=(n,1))
nb_bars = pu(nb_sim)
plt.bar(nb_bars['Unique'],nb_bars['Percent'],label='Sim')

x = np.arange(0,nb_sim.max()+1)
plt.plot(x,nbinom.pmf(x,nb_n,nb_p)*100,linestyle='None',
         marker='o',markerfacecolor='k',ms=7,
         label=f'PMF Negative Binomial')
plt.legend()
plt.show()

Downloading geo files from Census FTP using python

I was working with some health data that only has MSA identifiers the other day. Not many people seem to know about the US Census’s FTP data site. Over the years they have had various terrible GUI’s to download data, but I almost always just go to the FTP site directly.

For geo data, check out https://www2.census.gov/geo/tiger/TIGER2019/ for example. Python for pandas/geopandas also has the nicety that you can point to a url (even a url of a zip file), and load in the data in memory. So to get the MSA areas was very simple:

# Example download MSA
import geopandas as gpd
from matplotlib import pyplot as plt

url_msa = r'https://www2.census.gov/geo/tiger/TIGER2019/CBSA/tl_2019_us_cbsa.zip'
msa = gpd.read_file(url_msa)
msa.plot()
plt.show()

Sometimes the census has files spread across multiple states. So here is an example of doing some simple scraping to get all of the census tracts in the US. You can combine the geopandas dataframes the same as pandas dataframes using pd.concat:

# Example scraping all of the zip urls on a page
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests

def get_zip(url):
    front_page = requests.get(url,verify=False)
    soup = BeautifulSoup(front_page.content,'html.parser')
    zf = soup.find_all("a",href=re.compile(r"zip"))
    # Maybe should use href 
    zl = [os.path.join(url,i['href']) for i in zf]
    return zl

base_url = r'https://www2.census.gov/geo/tiger/TIGER2019/TRACT/'
res = get_zip(base_url)

geo_tract = []
for surl in res:
    geo_tract.append(gpd.read_file(surl))

geo_full = pd.concat(geo_tract)

# See State FIPS codes
# https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696

geo_full[geo_full['STATEFP'] == '01'].plot()
plt.show()

Unfortunately for the census data tables, such as https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/5_year_seq_by_state/Alabama/Tracts_Block_Groups_Only/, those zip files contain two files (an estimate file and a margin of error file), so you cannot just do pd.read_csv(url) for those tables. But for the shapefile zip files this appears to work just fine and dandy.

I am currently working on a project at work (but Gainwell has given me the thumbs up to open source it) to build tables to create the CDC’s Social Vulnerability Index, which I can build for multiple geographies in combo with the census data. So hopefully in the next few weeks will be able to share that work.

Cointegration analysis of Ethereum and BitCoin

So a friend recently has heavily encouraged investment into Ethereum and NFTs. Part of the motivation of these cryptocurrencies is to be independent of fiat currency. So that lends itself to a hypothesis – are cryptocurrency prices and more typical securities independent? Or are we simply seeing similar trends in these different securities over time? This is a job for cointegration analysis. The python code is simple enough to follow along in a blog post.

So first I import the libraries I am using – it leverages the Yahoo finance API to download ticker data (here I analyze closing prices), and statsmodels to conduct the various analyses in python.

from datetime import datetime
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt

from statsmodels.tsa.stattools import adfuller, ccf
from statsmodels.tsa.api import VAR
from statsmodels.tsa.vector_ar import vecm

Now we can download the ticker data, here I will analyze BitCoin and Ethereum, along with Gold prices and the S&P 500 index fund.

# BTC-USD : Bitcoin
# ETH-USD : Ethereum
# ^GSPC ; S&P 500
# GC=F : Gold

end_date = datetime.now().strftime("%Y-%m-%d")
print(end_date) #running as of 2/9/2022

tick_str = 'BTC-USD ETH-USD ^GSPC GC=F'
dat = yf.download(tick_str,start='2017-01-01',end=end_date)

Now for data prep – I am going to interpolate missing data (for when the market was closed). Then I only subset out Fridays at close to conduct a weekly analysis. Even weekly is too short for me to bother with rebalancing if I do decide to invest.

# Fill in missing data before sub-sampling to once a week
dat2 = dat.interpolate()

# Only Fridays close post 11/9/2017
dat2.reset_index(inplace=True)
after = pd.to_datetime('2017-11-09')
sel = (dat2.Date >= after) & (dat2.Date.dt.weekday == 4) #Friday
sdat = dat2.loc[sel,['Date','Close']]
sdat.columns = ['Date'] + ['BitCoin','Eth','S&P 500','Gold']
sdat.set_index('Date',inplace=True)

Now lets look at the overall trends by superimposing these four stocks on the same graph. Just min-max normalizing to range from 0 to 1.

# Time Series Graphs of Each
# Normalized to be 0/1
snorm = sdat.copy()
for v in sdat:
    mi,ma = sdat[v].min(),sdat[v].max()
    snorm[v] = (sdat[v] - mi)/(ma-mi)

# All four series on the same graph
snorm.plot(linewidth=2)
plt.show()

So you can see these all appear to follow a similar upward trajectory after Covid hit in 2020, although crypto has way more volatility recently. If we subset out just the crypto’s, we can see how they trend with each other more easily.

s2 = sdat.copy()
s2['BitCoin/10'] = s2['BitCoin']/10
s2[['BitCoin/10','Eth']].plot(linewidth=2)
plt.show()

So based on this, I would say that maybe Bitcoin is a leading indicator of Ethereum (increases in BitCoin precede increases in Eth with maybe just a week lag).

Typically with any time series analysis like this, we are concerned with whether the series are stationary. Just going off of my Ender’s Applied Econometric Time Series book, we typically look at the Adjusted Dickey-Fuller test for the levels:

# Integration analysis
adfuller(sdat['BitCoin'], regression='ct', maxlag=5, autolag='t-stat', regresults=True)

And we can see this fails to reject the null, so we would conclude the series is integrated. Since we have a fairly large sample here (over 200 weeks), the test should be reasonably powered. If we then take the first differences and conduct the same test, we then reject the null of an integrated series (here for Bitcoin).

# Create differenced data
sdiff = sdat.diff().dropna()

# All appear 1st order integrated!
adfuller(sdiff['BitCoin'], regression='ct', maxlag=5, autolag='t-stat', regresults=True)

So this reasonably suggests Bitcoin is an I(1) process. Doing the same for all of the other securities in this example you come to the same inference, all integrated of order 1 (which is very typical for stock data).

Using the differenced data, we can see the cross-correlations between different securities. In this example, it appears BitCoin/Ethereum just have a large 1 positive lag, and close to 0 after that.

# Only 1 lag positive in differenced data
pd.Series(ccf(sdiff['BitCoin'],sdiff['Eth'])[:10]).plot(kind='bar',grid=False)

So based on this, I subsequent only look at 1 lag in subsequent models. (Prior week impacts current week, since we are analyzing weekly data.)

So you need to be careful here – typically we want to avoid doing regression analysis of integrated time series, as that can lead to spurious correlations. But in the case a series is co-integrated, it is ok to conduct analysis on the levels. So here we do the analysis of the levels for each of the securities to assess our hypothesis. (Including temporal trends results in different coefficients, but similar overall inferences.)

mod = VAR(sdat)
res = mod.fit(1) #trend='ctt'
res.summary()

So we can see here that contrary to the graphs, Ethereum has a negative relationship with BitCoin – when Ethereum goes up a dollar, the following week BitCoin goes down $1.7. For BitCoin the relationships with S&P is negative (but weaker), and Gold it is positive.

# Ethereum causes BitCoin to go down
irf = res.irf(4)
irf.plot(orth=False,impulse='Eth',response='BitCoin')

For Ethereum the converse is not true though – BitCoin + increases Ethereum (although given that BitCoin is currently 10x the value of Eth the magnitude is smaller).

# Ethereum causes BitCoin to go down
irf = res.irf(4)
irf.plot(orth=False,impulse='Eth',response='BitCoin')

There are more formal tests to look at Granger causality and cointegration with error correction models, but looking at the VAR of the levels I think is the easiest to Grok here.

Do not take this as investment advice, looking at the volatility of these securities makes me very hesistant to invest even a small sum.

# Granger causality test
gc = res.test_causality('Eth', 'BitCoin', kind='f').summary()
print(gc)

# Cointegration test
ecm = vecm.coint_johansen(sdat[['BitCoin','Eth']], 1, 1)
print(ecm.max_eig_stat)
print(ecm.max_eig_stat_crit_vals)

ecm = vecm.VECM(sdat[['BitCoin','Eth']],deterministic='co')
est = ecm.fit()

est.plot_forecast(4,n_last_obs=10)
plt.show()

Based on this analysis it might make sense to include BitCoin as a portfolio diversification relative to traditional stocks – if willing to assume quite a bit of risk. If you are a gambler it may make sense to do some type of pairs trading strategy between Eth/Bitcoin on a short term basis. (If I had some real magic low risk money making strategy I would not put it in a blog post!)

Gambling is fun (and it is fun to think damn if I invested in Eth in 2019 I would be up 10x) – but I don’t think I am going onto the crypto roller-coaster at the moment.

Prediction Intervals for Random Forests

I previously knew about generating prediction intervals via random forests by calculating the quantiles over the forest. (See this prior python post of mine for getting the individual trees). A recent set of answers on StackExchange show a different approach – apparently the individual tree approach tends to be too conservative (coverage rates higher than you would expect). Those Cross Validated posts have R code, figured it would be good to illustrate in python code how to generate these prediction intervals using random forests.

So first what is a prediction interval? I imagine folks are more familiar with confidence intervals, say we have a regression equation y = B1*x + e, you often generate a confidence interval around B1. Imagine we use that equation to make a prediction though, y_hat = B1*(x=10), here prediction intervals are errors around y_hat, the predicted value. They are actually easier to interpret than confidence intervals, you expect the prediction interval to cover the observations a set percentage of the time (whereas for confidence intervals you have to define some hypothetical population of multiple measures).

Prediction intervals are often of more interest for predictive modeling, say I am predicting future home sale value for flipping houses. I may want to generate prediction intervals that cover the value 90% of the time, and only base my decisions to buy based on the much lower value (if you are more risk averse). Imagine I give you the choice of buy a home valuated at 150k - 300k after flipped vs a home valuated at 230k-250k, the upside for the first is higher, but it is more risky.

In short, this approach to generate prediction intervals from random forests relies on out of bag error metrics (it is sort of like a for free hold out sample based on the bootstrapping approach random forest uses). And based on the residual distribution, one can generate forecast intervals (very similar to Duan’s smearing).

To illustrate, I will use a dataset of emergency room visits and time it took to see a MD/RN/PA, the NHAMCS data. I have code to follow along here, but I will walk through it in this post (that code has some nice functions for data definitions for the NHAMCS data).

At work I am working on a project related to unnecessary emergency room visits, and I actually went to the emergency room in December (for a Kidney stone). So I am interested here in generating prediction intervals for the typical time it takes to be served in an ER to see if my visit was normal or outlying.

Example Python Code

First for some set up, I import the libraries I am using, and read in the emergency room use data:

import numpy as np
import pandas as pd
from nhanes_vardef import * #variable definitions
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Reading in fixed width data
# Can download this data from 
# https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHAMCS/
nh2019 = pd.read_fwf('ED2019',colspecs=csp,header=None)
nh2019.columns = list(fw.keys())

Here I am only going to work with a small set of the potential variables. Much of the information wouldn’t make sense to use as predictors of time to first being seen (such as subsequent tests run). One thing I was curious about though was if I changed my pain scale estimate would I have been seen sooner!

# WAITTIME
# PAINSCALE [- missing]
# VDAYR [Day of Week]
# VMONTH [Month of Visit]
# ARRTIME [Arrival time of day]
# AGE [top coded at 95]
# SEX [1 female, 2 male]
# IMMEDR [triage]
#  9 = Blank
#  -8 = Unknown
#  0 = ‘No triage’ reported for this visit but ESA does conduct nursing triage
#  1 = Immediate
#  2 = Emergent
#  3 = Urgent
#  4 = Semi-urgent
#  5 = Nonurgent
#  7 = Visit occurred in ESA that does not conduct nursing triage 

keep_vars = ['WAITTIME','PAINSCALE','VDAYR','VMONTH','ARRTIME',
             'AGE','SEX','IMMEDR']
nh2019 = nh2019[keep_vars].copy()

Many of the variables encode negative values as missing data, so here I throw out visits with a missing waittime. I am lazy though and the rest I keep as is, with enough data random forests should sort out all the non-linear effects no matter how you encode the data. I then create a test split to evaluate the coverage of my prediction intervals out of sample for 2k test samples (over 13k training samples).

# Only keep wait times that are positive
mw = nh2019['WAITTIME'] >= 0
print(nh2019.shape[0] - mw.sum()) #total number missing
nh2019 = nh2019[mw].copy()

# Test hold out sample to show
# If coverage is correct
train, test = train_test_split(nh2019, test_size=2000, random_state=10)
x = keep_vars[1:]
y = keep_vars[0]

Now we can fit our random forest model, telling python to keep the out of bag estimates.

# Fitting the model on training data
regr = RandomForestRegressor(n_estimators=1000,max_depth=7,
  random_state=10,oob_score=True,min_samples_leaf=50)
regr.fit(train[x], train[y])

Now we can use these out of bag estimates to generate error intervals around our predictions based on the test oob error distribution. Here I generate 50% prediction intervals.

# Generating the error distribution
resid = train[y] - regr.oob_prediction_
# 50% interval
lowq = resid.quantile(0.25)
higq = resid.quantile(0.75)
print((lowq,higq)) 
# negative much larger
# so tends to overpredict time

Even 50% here are quite wide (which could be a function of both the data has a wide variance as well as the model is not very good). But we can test whether our prediction intervals are working correctly by seeing the coverage on the out of sample test data:

# Generating predictions on out of sample data
test_y = regr.predict(test[x])
lowt = (test_y + lowq).clip(0) #cant have negative numbers
higt = (test_y + higq)

cover = (test[y] >= lowt) & (test[y] <= higt)
print(cover.mean())

Pretty much spot on. So lets see what the model predicts my referent 50% prediction interval would be (I code myself a 2 on the IMMEDR scale, as I was billed a CPT code 99284, which those should line up pretty well I would think):

# Seeing what my referent time would be
myt = np.array([[6,4,12,930,36,2,6]])
mp = regr.predict(myt)
print(mp)
print( (mp+lowq).clip(0), (mp+higq) )

So a predicted mean of 35 minutes, and a prediction interval of 4 to 38 minutes. (These intervals based on the residual quantiles are basically non-parametric, and don’t have any strong assumptions about the distribution of the underlying data.)

To first see the triage nurse it probably took me around 30 minutes, but to actually be treated it was several hours long. (I don’t think you can do that breakdown in this dataset though.)

We can do wider intervals, here is a screenshot for 80% intervals:

You can see that they are quite wide, so probably not very effective in identifying outlying cases. It is possible to make them thinner with a better model, but it may just be the variance is quite wide. For folks monitoring time it takes for things (whether time to respond to calls for service for police, or here be served in the ER), it probably makes sense to build models focusing on quantiles, e.g. look at median time served instead of mean.

Optimal and Fair Spatial Siting

A bit of a belated MLK day post. Much of the popular news on predictive or machine learning algorithms has a negative connotation, often that they are racially biased. I tend to think about algorithms though in almost the exact opposite way – we can adjust them to suit our objectives. We just need to articulate what exactly we mean by fair. This goes for predictive policing (Circo & Wheeler, 2021; Liberatore et al., 2021; Mohler et al., 2018; Wheeler, 2020) as much as it does for any application.

I have been reading a bit about spatial fairness in siting health resources recently, one example is the Urban Institutes Equity Data tool. For this tool, you put in where your resources are currently located, and it tells you whether those locations are located in areas that have demographic breakdowns like the overall city. So this uses the container approach (not distance to the resources), which distance traveled to resources is probably a more typical way to evaluate fair spatial access to resources (Hassler & Ceccato, 2021; Koschinsky et al., 2021).

Here what I am going to show is instead of ex-ante saying whether the siting of resources is fair, I construct an integer linear program to site resources in a way we define to be fair. So imagine that we are siting 3 different locations to do rapid Covid testing around a city. Well, we do the typical optimization and minimize the distance traveled for everyone in the city on average to those 3 locations – on average 2 miles. But then we see that white people on average travel 1.9 miles, and minorities travel 2.2 miles. So it that does not seem so fair does it.

I created an integer linear program to take this difference into account, so instead of minimizing average distance, it minimizes:

White_distance + Minority_distance + |White_distance - Minority_distance|

So in our example above, if we had a solution that was white travel 2.1 and minority 2.1, this would be a lower objective value than (4.2), than the original minimize overall travel (1.9 + 2.2 + 0.3 = 4.4). So this gives each minority groups equal weight, as well as penalizes if one group (either whites or minorities) has much larger differences.

I am not going to go into all the details. I have python code that has the functions (it is very similar to my P-median model, Wheeler, 2018). The codes shows an example of siting 5 locations in Dallas (and uses census block group centroids for the demographic data). Here is a map of the results (it has points outside of the city, since block groups don’t perfectly line up with the city boundaries).

In this example, if we choose 5 locations in the city to minimize the overall distance, the average travel is just shy of 3.5 miles. The average travel for white people (not including Hispanics) is 3.25 miles, and for minorities is 3.6 miles. When I use my fair algorithm, the white average distance is 3.5 miles, and the minority average distance is 3.6 miles (minority on average travels under 200 more feet on average than white).

So this is ultimately a trade off – it ends up pushing up the average distance a white person will travel, and only slightly pushes down the minority travel, to balance the overall distances between the two groups. It is often the case though that one can be somewhat more fair, but in only results in slight trade-offs though in the overall objective function (Rodolfa et al., 2021). So that trade off is probably worth it here.

References

Learning a fair loss function in pytorch

Most of the time when we are talking about deep learning, we are discussing really complicated architectures – essentially complicated sets of (mostly linear) equations. A second innovation in the application of deep learning though is the use of back propagation to fit under-determined systems. Basically you can feed these systems fairly complicated loss functions, and they will chug along with no problem. See a prior blog post of mine of creating simple regression weights for example.

Recently in the NIJ challenge, I used pytorch and the non-linear fairness function defined by NIJ. In the end me and Gio used an XGBoost model, because the data actually do not have very large differences in false positive rates between racial groups. I figured I would share my example here though for illustration of using pytorch to learn a complicated loss function that has fairness constraints. And here is a link to the data and code to follow along if you want.

First, in text math the NIJ fairness function was calculated as (1 - BS)*(1 - FP_diff), where BS is the Brier Score and FP_diff is the absolute difference in false positive rates between the two groups. My pytorch code to create this loss function looks like this (see the pytorch_mods.py file):

# brier score loss function with fairness constraint
def brier_fair(pred,obs,minority,thresh=0.5):
    bs = ((pred - obs)**2).mean()
    over = 1*(pred > thresh)
    majority = 1*(minority == 0)
    fp = 1*over*(obs == 0)
    min_tot = (over*minority).sum().clamp(1)
    maj_tot = (over*majority).sum().clamp(1)
    min_fp = (fp*minority).sum()
    maj_fp = (fp*majority).sum()
    min_rate = min_fp/min_tot
    maj_rate = maj_fp/maj_tot
    diff_rate = torch.abs(min_rate - maj_rate)
    fin_score = (1 - bs)*(1 - diff_rate)
    return -fin_score

I have my functions all stashed in different py files. But here is an example of loading up all my functions, and fitting my pytorch model to the training recidivism data. Here I set the threshold to 25% instead of 50% like the NIJ competition. Overall the model is very similar to a linear regression model.

import data_prep as dp
import fairness_funcs as ff
import pytorch_mods as pt

# Get the train/test data
train, test = dp.get_y1_data()
y_var = 'Recidivism_Arrest_Year1'
min_var = 'Race' # 0 is White, 1 is Black
x_vars = list(train)
x_vars.remove(y_var)

# Model learning fair loss function
m2 = pt.pytorchLogit(loss='brier_fair',activate='ident',
                     minority=min_var,threshold=0.25)
m2.fit(train[x_vars],train[y_var])

I have a burn in start to get good initial parameter estimates with a more normal loss function before going into the more complicated function. Another approach would be to initialize the weights to the solution for a linear regression equation though. After that burn in though it goes into the NIJ defined fairness loss function.

Now I have functions to see how the different model metrics in whatever sample. Here you can see the model is quite balanced in terms of false positive rates in the training sample:

# Seeing training sample fairness
m2pp = m2.predict_proba(train[x_vars])[:,1]
ff.fairness_metric(m2pp,train[y_var],train[min_var],thresh=0.25)

But of course in the out of sample test data it is not perfectly balanced. In general you won’t be able to ensure perfect balance in whatever fairness metrics out of sample.

# Seeing test sample fairness
m2pp = m2.predict_proba(test[x_vars])[:,1]
ff.fairness_metric(m2pp,test[y_var],test[min_var],thresh=0.25)

It actually ends up that the difference in false positive rates between the two racial groups, even in models that do not incorporate the fairness constraint in the loss function, are quite similar. Here is a model using the same architecture but just the usual Brier Score loss. (Code not shown, see the m1 model in the 01_AnalysisFair.py file in the shared dropbox link earlier.)

You can read mine and Gio’s paper (or George Mohler and Mike Porter’s paper) about why this particular fairness function is not the greatest. In general it probably makes more sense to use an additive fairness loss instead of multiplicative, but in this dataset it won’t matter very much no matter how you slice it in terms of false positive rates. (It appears in retrospect the Compas data that started the whole false positive thing is somewhat idiosyncratic.)

There are other potential oddities that can occur with such fair loss functions. For example if you had the choice between false positive rates for (white,black) of (0.62,0.60) vs (0.62,0.62), the latter is more fair, but the minority class is worse off. It may make more sense to have an error metric that sets the max false positive rate you want, and then just has weights for different groups to push them down to that set threshold.

These aren’t damning critiques of fair loss functions (these can be amended/changed to behave better), but in the end defining fair loss functions will be very tricky. Both for empirical reasons as well as for ethical ones – they will ultimately involve quite a few trade-offs.

genrules: using a genetic algo to find high risk cases

So I recently participated in the Decoding Maternal Morbidity Data Challenge. I did not win, but will share my work anyway. I created a genetic algorithm in python to identify sets of association rules that result in high relative risks, genrules. Here I intentionally used a genetic algorithm, with the idea I wanted not just one final model, but a host of different potential rules with the goal of exploratory data analysis.

You can go see how it works by checking out the notebook using the nuMoM2b data (which you have to request, I cannot upload that to github). Here are rules I found to predict high relative risk for infections:

And I have other code to summarize these, it happens to govt insurance is very commonly in the set of rules found.

I actually like it better just as a convenient tool (that can take weights), and go through every possible permutation of a set of categorical data. I uploaded a second notebook showing off NIBRS and predicting officer assaults (see my prior blog post on this).

So here one of the rules is stolen property, and the assailant using their motor vehicle as a weapon. So perhaps people running with stolen property in a vehicle? (While I built quite a bit of machinery to look through higher level rules, why bother with higher sets than 3 variables in the vast majority of circumstances.)

This shows the risk of competing in challenges, so a few weekends lost with no reward. This was a fuzzy competition, and something that appears I misread was in the various notes the challenge asked for the creation of novel algorithms. It appears from the titles of the winners people just used typical machine learning approaches and did a deep dive into the data. I did the opposite, created a general tool that could be used by many (who are better experts on the topical material than me).

Also note this approach is very data mining. While I use p-values as a mechanism to penalize too complicated of outcomes in the genetic algorithm, I wouldn’t take those at face value. With large datasets you will always mine some sets that are ultimately confounded.

Regression Discontinuity Designs

Regression Discontinuity Designs (RDD) are one way to evaluate predictive model systems with causal outcomes associated with what you do with that information. For a hypothetical example, imagine you have a predictive model assessing the probability that you will be diagnosed with diabetes in the next two years. Those that score above 30% probability get assigned a caseworker, to try to prevent that individual from contracting diabetes. How do you know how effective that caseworker is in reducing diabetes in those high risk individuals?

The RDD design works like this – you have your running variable (here the predicted probability), and the threshold (over 30%) that gets assigned a treatment. You estimate the probability of the actual outcome (it may be other outcomes besides just future diabetes, such as the caseworker may simply reduce overall medical costs even if the person still ends up being diagnosed with diabetes). You then estimate the dip in the predicted outcome just before and just after the threshold. The difference in those two curves is the estimated effect.

Here is an example graph illustrating (with fake data I made, see the end of the post). The bump in the line (going from the blue to the red) is then the average treatment effect of being assigned a caseworker, taking into account the underlying trend that higher risk people here are more likely to have higher medical bills.

A few things to note about RDD – so there is a tension between estimating the underlying curve and the counterfactual bump at the threshold. Theoretically values closer to the threshold should be more relevant, so some (see the Wikipedia article linked earlier) try to estimate non-linear weighted curves, giving cases closer to the threshold higher weights. This often produces strange artifacts (that Andrew Gelman likes to point out on his blog) that can miss-estimate the RDD effect. This is clearly the case in noisy data if the line dips just before the threshold and curves up right after the threshold.

So you can see in my code at the end I prefer to estimate this using splines, as opposed to weighted estimators that have a bit of noise. (Maybe someday I will code up a solution to do out of sample predictive evaluations for this as an additional check.) And with this approach it is easy to incorporate other covariates (and look at treatment heterogeneity if you want). Note that while wikipedia says the weighted estimator is non-parametric this is laughably wrong (it is two straight lines in their formulation, a quite restrictive for actually) – while I can see some theoretical justification for the weighted estimators, in practice these are quite noisy and not very robust to minor changes in the weights (and you always need to make some parametric assumptions in this design, both for the underlying curve as well as the standard error around the threshold bump).

There are additional design details you need to be worried about, such as fuzzy treatments or self selection. E.g. if a person can fudge the numbers a bit and choose which side of the line they are on, it can cause issues with this design. In those cases there may be a reasonable workaround (e.g. an instrumental variable design), but the jist of the research design will be the same.

Last but not least, to actually conduct this analysis you need to cache the original continuous prediction. In trying to find real data for this blog post, many criminal justice examples of risk assessment people only end up saving the final categories (low, medium, high) and not the original continuous risk instrument.

If folks have a good public data example that I could show with real data please let me know! Scouting for a bit (either parole/probabation risk assessment, or spatial predictive policing) has not turned up any very good examples for me to construct examples with (also health examples would be fine as well).


This is the simulated data (in python), the RDD graph, and the statsmodel code for how I estimate the RDD bump effect. You could of course do more fancy things (such as penalize the derivatives for the splines), but this I think would be a good way to estimate the RDD effect in many designs where appropriate.

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

########################################
# Simulating data
np.random.seed(10)
n_cases = 3000 # total number of cases
# pretend this is predicted prob
prob = np.random.beta(3,10,size=n_cases)
# pretend this is med costs over a year
med_cost = 3000 + 5000*prob + -500*(prob > 0.3) + np.random.normal(0,500,n_cases)
df = pd.DataFrame(zip(prob,med_cost), columns=['Prob','MedCost'])
# could do something fancier with non-linear effects for prob
########################################

########################################
# Fitting regression model

# Knots are small distance from threshold
# (Could also do a knot right on threshold)
mod = smf.ols(formula='MedCost ~ bs(Prob,knots=[0.2,0.25,0.35,0.4]) + I(Prob > 0.3)', data=df)
res = mod.fit()
print(res.summary())
########################################

########################################
# Plotting fit

# Getting standard errors
prob_se = res.get_prediction().summary_frame()
prob_se['Prob'] = prob
prob_se.sort_values(by='Prob',inplace=True,ignore_index=True)
low = prob_se[prob_se['Prob'] <= 0.3].copy()
high = prob_se[prob_se['Prob'] > 0.3].copy()

# Getting effect for threshold bump
coef = res.summary2().tables[1]
ci = coef.iloc[1,4:6].astype(int).to_list()

fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(df['Prob'], df['MedCost'], c='grey',
           edgecolor='k', alpha=0.15, s=5, zorder=1)
ax.axvline(0.3, linestyle='solid', alpha=1.0, 
           color='k',linewidth=1, zorder=2)
ax.fill_between(low['Prob'],low['mean_ci_lower'],
                low['mean_ci_upper'],alpha=0.6,
                zorder=3, color='darkblue')
ax.fill_between(high['Prob'],high['mean_ci_lower'],
                high['mean_ci_upper'],alpha=0.6,
                zorder=3, color='red')
ax.set_xlabel('Predicted Prob Diabetes')
ax.set_ylabel('Medical Costs')
ax.set_title(f'RDD Reduced Cost Estimate {ci[0]} to {ci[1]} (95% CI)')
ax.text(0.3,6500,'Threshold',rotation=90, size=9,
         ha="center", va="center",
         bbox=dict(boxstyle="round", ec='k',fc='grey'))
plt.savefig('RDD.png', dpi=500, bbox_inches='tight')
########################################

pre-commit hooks and github actions

In keeping up with learning about code development in python and R, two things I have added to my retenmod python package recently are pre-commit hooks and github actions. I am not going to give a code example here, you can google pre-commit python black flake and get like a dozen different blog posts to describe the process. Ditto for github actions. I think it is good to do some googling on the processes, and then you can see the final yaml files I have made to do either process:

The idea behind pre-commit, is that it runs a set of commands to check your py files before you commit your changes to github on your local system. So here the pre-commit I created for this package does three things:

  • runs black code formatter for python (formats whitespace nicely where it can)
  • runs flake8 checks (checks whether py files meet pep standards)
  • updates my readme document

Note one thing I want to be able to do but cannot currently with pre-commit is to update all jupyter notebooks in place as well. Unfortunately this does not work, as notebooks generate some time meta-data under the hood (so the file is modified, and fails the test). There is probably a way to fix this (maybe some smart config via nbdime), but it is not a big deal for me at the moment. But it works fine executing notebooks and saving to different files, so the readme hook in that example works just fine. (And won’t be as painful as say for Rmarkdown as for Jupyter with that time meta-data.)

Pre-commit hooks run on your local system, so you might not want to have it do pytests. My retenmod package though is tiny (intentionally did it as a very simple example to learn). At work I do development on both a windows machine and Red Hat virtual machines, and fortunately so far have not had issues with needing different yaml files, although I could potentially seeing that happen in more complicated set ups. Although if that were the case, I would like just not install on windows (I use local windows laptop to edit powerpoint presentations, and use Red Hat for pretty much everything else).

Github actions does not run on your personal machine, but runs after you have pushed your changes to github. And then github basically spins up virtual environments and does whatever tests. This makes sense for package development, to make sure your package can be installed on multiple operating systems. And you can run other unit tests at this stage on those systems as well. But for certain tests that only make sense on your local system (say functions to generate database connections) github actions will not make sense.

Next up I will need to learn whether it makes sense to generate artifacts via github actions (for that Jupyter notebook example where pre-commit is not working so well?). Also I have never really figured out sphinx docs for python. So I will see if I can get that up and running as well for this retenmod package.

Generating multiple solutions for linear programs

I had a question the other day on my TURF analysis about generating not just the single best solution for a linear programming problem, but multiple solutions. This is one arena I like the way people typically do genetic algorithms, in that they have a leaderboard with multiple top solutions. Which is nice for exploratory data analysis.

This example though it is pretty easy to amend the linear program to return exact solutions by generating lazy constraints. (And because the program is fast, you might as well do it this way as well to get an exact answer.) The way it works here, we have decision variables for products selected. You run the linear program, and then collect the k products selected, then add a constraint to the model in the form of:

Sum(Product_k) <= k-1 

Then rerun the model with this constraint, get the solution, and then repeat. This constraint makes it so the group of decision variables selected cannot be replicated (and in the forbidden set). Python’s pulp makes this quite easy, and here is a function to estimate the TURF model I showed in that prior blog post, but generate this multiple leaderboard:

import pandas as pd
import pulp

# Function to get solution from turf model
def get_solution(prod_dec,prod_ind):
    '''
    prod_dec - decision variables for products
    prod_ind - indices for products
    '''
    picked = []
    for j in prod_ind:
        if prod_dec[j].varValue >= 0.99:
            picked.append(j)
    return picked

# Larger function to set up turf model 
# And generate multiple solutions
def turf(data,k,solutions):
    '''
    data - dataframe with 0/1 selected items
    k - integer for products selected
    solutions - integer for number of potential solutions to return
    '''
    # Indices for Dec variables
    Cust = data.index
    Prod = list(data)
    # Problem and Decision Variables
    P = pulp.LpProblem("TURF", pulp.LpMaximize)
    Cu = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust], lowBound=0, upBound=1, cat=pulp.LpInteger)
    Pr = pulp.LpVariable.dicts("Products Selected", [j for j in Prod], lowBound=0, upBound=1, cat=pulp.LpInteger)
    # Objective Function (assumes weight = 1 for all rows)
    P += pulp.lpSum(Cu[i] for i in Cust)
    # Reached Constraint
    for i in Cust:
        #Get the products selected
        p_sel = data.loc[i] == 1
        sel_prod = list(p_sel.index[p_sel])
        #Set the constraint
        P += Cu[i] <= pulp.lpSum(Pr[j] for j in sel_prod)
    # Total number of products selected constraint
    P += pulp.lpSum(Pr[j] for j in Prod) == k
    # Solve the equation multiple times, add constraints
    res = []
    km1 = k - 1
    for _ in range(solutions):
        P.solve()
        pi = get_solution(Pr,Prod)
        # Lazy constraint
        P += pulp.lpSum(Pr[j] for j in pi) <= km1
        # Add in objective
        pi.append(pulp.value(P.objective))
        res.append(pi)
    # Turn results into nice dataframe
    cols = ['Prod' + str(i+1) for i in range(k)]
    res_df = pd.DataFrame(res, columns=cols+['Reach'])
    res_df['Reach'] = res_df['Reach'].astype(int)
    return res_df

And now we can simply read in our data, turn it into binary 0/1, and then generate the multiple solutions.

#This is simulated data from XLStat
#https://help.xlstat.com/s/article/turf-analysis-in-excel-tutorial?language=en_US
prod_data = pd.read_csv('demoTURF.csv',index_col=0)

#Replacing the likert scale data with 0/1
prod_bin = 1*(prod_data > 3)

# Get Top 10 solutions
top10 = turf(data=prod_bin,k=5,solutions=10)
print(top10)

Which generates the results:

>>> print(top10)
        Prod1       Prod2       Prod3       Prod4       Prod5  Reach
0   Product 4  Product 10  Product 15  Product 21  Product 25    129
1   Product 3  Product 15  Product 16  Product 25  Product 26    129
2   Product 4  Product 14  Product 15  Product 16  Product 26    129
3  Product 14  Product 15  Product 16  Product 23  Product 26    129
4   Product 3  Product 15  Product 21  Product 25  Product 26    129
5  Product 10  Product 15  Product 21  Product 23  Product 25    129
6   Product 4  Product 10  Product 15  Product 16  Product 27    129
7  Product 10  Product 15  Product 16  Product 23  Product 27    129
8   Product 3  Product 10  Product 15  Product 21  Product 25    128
9   Product 4  Product 10  Product 15  Product 21  Product 27    128

I haven’t checked too closely, but this looks to me offhand like the same top 8 solutions (with a reach of 129) as the XLStat website. The last two rows are different, likely because there are further ties.

For TURF analysis, you may actually consider a lazy constraint in the form of:

Product_k = 0 for each k

This is more restrictive, but if you have a very large pool of products, will ensure that each set of products selected are entirely different (so a unique set of 5 products for every row). Or you may consider a constraint in terms of customers reached instead of on products, so if solution 1 reaches 129, you filter those rows out and only count newly reached people in subsequent solutions. This ensures each row of the leaderboard above reaches different people than the prior rows.