State dependence and trajectory models

I am currently reviewing a paper that uses group based trajectory models (GBTM) – and to start this isn’t a critique of the paper. GBTM I think is a very useful descriptive tool (how this paper I am reading mostly uses it), and can be helpful in some predictive contexts as well.

It is much more difficult though to attribute a causal framework to those trajectories though. First, my favorite paper on this topic is Distinguishing facts and artifacts in group-based modeling (Skardhamar, 2010). Torbjørn in that paper simulates random data (not dissimilar to what I do here, but a few more complicated factors), and shows that purely random data will still result in GBTM identifying trajectories. You can go the other way as well, I have a blog post where I simulate actual latent trajectories and GBTM recovers them, and another example where fit stats clearly show a random effects continuous model is better for a different simulation. In real data though we don’t know the true model like these simulations, so we can only be reasonably skeptical that the trajectories we uncover really represent latent classes.

In particular, the paper I was reading is looking at a binary outcome, so you just observe a bunch of 0s and 1s over the time period. So given the limited domain, it is difficult to uncover really wild looking curves. They ended up finding a set of curves that although meet all the good fit stats, pretty much cover the domain of possibilities – one starting high an linearly sloping down, one starting low and sloping up, one flat high, one flat low, and a single curved up slope.

So often in criminology we interpret these latent trajectories as population heterogeneity – people on different curves are fundamentally different (e.g. Moffitt’s taxonomy for offending trajectories). But there are other underlying data generating processes that can result in similar trajectories – especially over a limited domain of 0/1 data.

Here I figured the underlying data the paper I am reviewing is subject to very strong state dependence – your value at t-1 is very strongly correlated to t. So here I simulate data in R, and use the flexmix package to fit the latent trajectories.

First, I simulate 1500 people over 15 time points. I assign them an original probability estimate uniformly, then I generate 15 0/1 observations, updating that probability slightly over time with an auto-correlation of 0.9. (Simulations are based on the logit scale, but then backed out into 0/1s.)

# R Code simulating state dependence 0/1
# data
library("flexmix")
set.seed(10)

# logit and inverse function
logistic <- function(x){1/(1+exp(-x))}
logit <- function(x){log(x/(1-x))}

# generate uniform probabilities
n <- 1500
orig_prob <- runif(n)

# translate to logits
ol <- logit(orig_prob)
df <- data.frame(id=1:n,op=orig_prob,ol)

# generate auto-correlated data for n = 10
auto_corr <- 0.90
tp <- 15
vl <- paste0('v',1:tp)
vc <- var(ol) #baseline variance, keep equal

for (v in vl){
   # updated logit
   rsd <- sqrt(vc - vc*(auto_corr^2))
   ol <- ol*0.9 + rnorm(n,0,rsd)
   # observed outcome
   df[,v] <- rbinom(n,1,logistic(ol))
}

This generates the data in wide format, so I reshape to long format needed to fit the models using flexmix, and I by default choose 5 trajectories (same as chosen in the paper I am reviewing).

# reshape wide to long
ld <- reshape(df, idvar="id", direction="long",
        varying = list(vl))

# fit traj model for binary outcomes
mod <- flexmix(v1 ~ time + I(time^2) | id,
               model = FLXMRmultinom(),
               data=ld, k=5)

rm <- refit(mod)
summary(rm)

Now I create smooth curves over the period to plot. I am lazy here, the X axis should actually be 1-15 (I simulated 15 discrete time points).

tc <- summary(rm)@components[[1]]
pd <- data.frame(c=1,t=seq(1,tp,length.out=100))
pd$tsq <- pd$t^2

co <- matrix(-999,nrow=3,ncol=5)

for (i in 1:5){
  vlab <- paste0('pred',i)
  co[,i] <- tc[[i]][,1]
}

pred <- as.matrix(pd) %*% co

# plot on probability scale
matplot(logistic(pred))

These are quite similar to the curves for the paper I am reviewing, a consistent low probability (5), and a consistent high (1), a downward mostly linear slope (3), and an upward linear slope (2), and then one parabola concave down (4) (in the paper they had one concave up).

I figured the initial probability I assigned would highly impact the curve the model assigned a person to in this simulation. It ends up being more spread out than I expected though.

# distribution of classes vs original probability
ld$clus <- clusters(mod)
r1 <- ld[ld$time == 1,]
clustjit <- r1$clus + runif(n,-0.2,0.2)
plot(clustjit,r1$op) # more spread out than I thought

So there is some tendency for each trajectory to be correlated based on the original probability, but it isn’t that strong.

If we look at the average max posterior probabilities, they are OK minus the parabola group 4.

# average posterior probability
pp <- data.frame(posterior(mod))
ld$pp <- pp[cbind(1:(n*tp),ld$clus)]
r1 <- ld[ld$time == 1,]
aggregate(pp ~ clus, data = r1, mean)
#   clus        pp
# 1    1 0.8923801
# 2    2 0.7903938
# 3    3 0.7535281
# 4    4 0.6380946
# 5    5 0.8419221

The paper I am reviewing has much higher APPs for each group, so maybe they are really representing pop heterogeneity instead of continuous state dependence, it is just really hard with such observational data to tell the difference.

Some peer review ideas

I recently did two more reviews for Crime Solutions. I actually have two other reviews due, in which I jumped Crime Solutions up in my queue. This of course is likely to say nothing about anyone but myself and my priorities, but I think I can attribute this behavior to two things:

  1. CrimeSolutions pays me to do a review (not much, $250, IMO I think I should get double this but DSG said it was pre-negotiated with NIJ).
  2. CrimeSolutions has a pre-set template. I just have to fill in the blanks, and write a few sentences to point to the article to support my score for that item.

Number 2 in particular was a determinant in me doing the 2nd review CrimeSolutions forwarded to me in very short order. After doing the 1st, I had the template items fresh in my mind, and knew I could do the second with less mental overhead.

I think these can, on the margins, improve some of the current issues with peer reviews. #1 will encourage more people to do reviews, #2 will improve the reliability of peer reviews (as well as make it easier for reviewers by limiting the scope). (CrimeSolutions has the reviewers hash it out if we disagree about something, but that has only happened once to me so far, because the template to fill in is laid out quite nicely.)

Another problem with peer reviews is not just getting people to agree to review, but to also to get them to do the review in a timely manner. For this, I suggest a time graded pay scale – if you do the review faster, you will get paid more. Here are some potential curves if you set the pay scale to either drop linearly with number of days or a logarithmic drop off:

So here, if using the linear scale and have a base rate of $300, if you do the review in two weeks, you would make $170, but if you take the full 30 days, you make $10. I imagine people may not like the clock running so fast, so I also devised a logarithmic pay scale, that doesn’t ding you so much for taking a week or two, but after that penalizes you quite heavily. So at two weeks is just under $250.

I realize pay is unlikely to happen (although is not crazy unreasonable, publishers extract quite a bit of rents from University libraries to subscriptions). But standardized forms are something journals could do right now.

Managing R environments using conda

DataColada have a recent blog about their groundhog package, intended to aid in reproducible science. This is more from a perspective of “I have this historical code, how can I try to replicate that researchers environment to get the same results”. So more of a forensic task. What I am going to talk about in this post is to create an environment from the get-go that has the info necessary for others to replicate.

First before I get to that though, I have come across people critiquing open science using essentially ‘the perfect is the enemy of the good’ arguments. Sharing code is good, period. Even if there are different standards of replicability, some code is quite a bit better than no code. And scientists are not professional programmers – understanding all of this stuff takes time and training often in short supply in academia (hence me blogging about boring stuff like creating environments and using github). If this stuff is over your head, please feel free to email/ask a question and I can try to help.

At work I have to solve a very similar problem to scientific reproducibility; I need to write code in one environment (a dev environment, or sometimes my laptop), and then have that code run in a production environment. The way we do this at work is either via conda environments (for persistent environments) or docker images (for ephemeral environments). We currently are 100% python for machine learning, but you can also use the same workflow for R environments (or have a mashup of R/python).

Groundhog doesn’t really solve this all by itself – it doesn’t specify the version of R for example. (And there are issues with even using dates to try to forensically recreate environments, see the Hackernews thread.) But you can use conda directly to set up a reproducible environment from the get-go. Again, what is good for reproducible science is good for reproducing my work in different environments at my workplace.

I have a github folder to show the steps, but just here they are quite simple. First to start, in your project directory at the root, have two files. One is a requirements.txt file that specifies the R libraries you want. And this file may look like:

# This is the requirements.txt file
r-spatstat
r-leaflet
r-devtools
r-markdown

Conda has an annoying add r-* at the front to distinguish r packages from python ones. If there happen to be libraries you are using that are not on conda-forge (e.g. just added to CRAN, or more likely just are on github), we can solve that as well. Make a second script, here I name it packs.R, and within this R script you can install these additional packages. Here is an example installing groundhog, and my ptools package that is only on github. Each have ways you can point to a very specific version:

# This is the packs.R script
library(devtools) # for installing github packages

# Install specific commit/version from github
install_github("apwheele/ptools",ref="9826241c93e9975804430cb3d838329b86f27fd3")

# Install a specific library version from CRAN
# Specifying specific version url for cran package (not on conda-forge)
gh_url <- "https://cran.r-project.org/src/contrib/groundhog_1.5.0.tar.gz"
install.packages(gh_url,repos=NULL,type="source")

OK, so now we are ready to set up our conda environment, so from the command line (or more specifically the anaconda prompt), if you are in the root of your project, you can run something like:

conda create --name rnew
conda activate rnew
conda install -c conda-forge r-base=4.0.5 --file requirements.txt

And this installs a specific version of R, as well as those libraries in the text file. Then if you have additional libraries in the packs.R to install, you can then run:

Rscript packs.R

And conda is smart and the library defaults to installing all the R junk in the right folder (can print out .libPaths() in an R session to see where your conda environment lives). (I am more familiar with conda, so cannot comment, but likely this is exchangeable with RStudio’s renv, horses for courses.)

You may notice my requirements.txt file does not have specific versions. Often you want to be generic when you are first setting up your project, and let conda figure out the mess of version dependencies. If you want to be uber vigilant then, you can then save the exact versions of packages via overwriting your initial requirements file, something like:

conda list --export > requirements.txt

And this updated file will have everything in it, R version, conda-forge ID, etc. (although does not have the packages you installed not via conda, so still need to keep the packs.R file to be able to replicate).

I will put on the slate an example of using docker to create a totally independent environment to replicate code on. I think that is a bit over-kill for most academic projects (although is really even more isolated than this work flow). Even all this work is not 100% foolproof. conda or CRAN or the github package you installed could go away tomorrow – no guarantees in life. But again don’t let the perfect be the enemy of the good – share your scientific code, warts and all!

Some more github action tricks

Hackernews shared the other day a project using github actions to generate a nice readme for your base Github profile. That workflow uses rust to query the github API and get some stats to then insert into the README.

Two things I noticed I did not realize you could do with actions previously; 1) you can schedule actions to run on a regular basis via a cron job, 2) you can push to the repo inside of the action. (And this does not cause some infinite recursion with actions.) So I have updated my profile to run some python code, generating an image of the number of potholes filled in Raleigh per week.

And you can see that this was updated on 4/7, and that was the automated job that was re-run.

It is pretty simple python code. You just have to have a step in your actions to build the python environment, then you can run your code.

With the regular cron job, you could offload different pieces of work to github, say automate scraping a site or sending out emails once a week. You just need to have a python (or whatever language) script to automate that process. Or you could do more fancy analysis for a project, and post that in the readme via a Jupyter notebook script. If the source data can be downloaded via the internet anyway.

Downloading Social Vulnerability Index data

So Gainwell has let me open source one of the projects I have been working on at work – a python package to download SVI data. The SVI is an index created by the CDC to identify areas of high health risk in four domains based on census data (from the American Community Survey).

For my criminologist friends, these are very similar variables we typically use to measure social disorganization (see Wheeler et al., 2018 for one example criminology use case). It is a simple python install, pip install svi-data. And then you can get to work. Here is a simple example downloading zip code data for the entire US.

import numpy as np
import pandas as pd
import svi_data

# Need to sign up for your own key
key = svi_data.get_key('census_api.txt')

# Download the data from census API
svi_zips = svi_data.get_svi(key,'zip',2019)
svi_zips['zipcode'] = svi_zips['GEO_ID'].str[-5:]

Note I deviate from the CDC definition in a few ways. One is that when I create the themes, instead of using percentile rankings, I z-score the variables instead. It will likely result in very similar correlations, but this is somewhat more generalizable across different samples. (I also change the denominator for single parent heads of households to number of families instead of number of households, I think that is likely just an error on CDC’s part.)

Summed Index vs PCA

So here quick, lets check out my z-score approach versus a factor analytic approach via PCA. Here I just focus on the poverty theme:

pov_vars = ['EP_POV','EP_UNEMP','EP_PCI','EP_NOHSDP','RPL_THEME1']
svi_pov = svi_zips[['zipcode'] + pov_vars ].copy()

from sklearn import decomposition
from sklearn.preprocessing import scale

svi_pov.corr()

Note the per capita income has a negative correlation, but you can see the index works as expected – lower correlations for each individual item, but fairly high correlation with the summed index.

Lets see what the index would look like if we used PCA instead:

pca = decomposition.PCA()
sd = scale(svi_pov[pov_vars[:-1]])
pc = pca.fit_transform(sd)
svi_pov['PC1'] = pc[:,0]
svi_pov.corr() #almost perfect correlation

You can see that besides the negative value, we have an almost perfect correlation between the first principal component vs the simpler sum score.

One benefit of PCA though is a bit more of a structured approach to understand the resulting indices. So we can see via the Eigen values that the first PC only explains about 50% of the variance.

print(pca.explained_variance_ratio_)

And if we look at the loadings, we can see a more complicated pattern of residual loadings for each sucessive factor.

comps = pca.components_.T
cols = ['PC' + str(i+1) for i in range(comps.shape[0])]
load_dat = pd.DataFrame(comps,columns=cols,index=pov_vars[:-1])
print(load_dat)

So for PC3 for example, it has areas with high no highschool, as well as high per capita income. So higher level components can potentially identify more weird scenarios, which healthcare providers probably don’t care about so much by is a useful thing to know for exploratory data analysis.

Mapping

Since these are via census geographies, we can of course map them. (Here I grab zipcodes, but the code can download counties or census tracts as well.)

We can download the census geo data directly into geopandas dataframe. Here I download the zip code tabulation areas, grab the outline of Raleigh, and then only plot zips that intersect with Raleigh.

import geopandas as gpd
import matplotlib.pyplot as plt

# Getting the spatial zipcode tabulation areas
zip_url = r'https://www2.census.gov/geo/tiger/TIGER2019/ZCTA5/tl_2019_us_zcta510.zip'
zip_geo = gpd.read_file(zip_url)
zip_geo.rename(columns={'GEOID10':'zipcode'},inplace=True)

# Merging in the SVI data
zg = zip_geo.merge(svi_pov,on='zipcode')

# Getting outline for Raleigh
ncp_url = r'https://www2.census.gov/geo/tiger/TIGER2019/PLACE/tl_2019_37_place.zip'
ncp_geo = gpd.read_file(ncp_url)
ral = ncp_geo[ncp_geo['NAME'] == 'Raleigh'].copy()
ral_proj = 'EPSG:2278'
ral_bord = ral.to_crs(ral_proj)

ral_zips = gpd.sjoin(zg,ral,how='left')
ral_zips = ral_zips[~ral_zips['index_right'].isna()].copy()
ral_zipprof = ral_zips.to_crs(ral_proj)

# Making a nice geopandas static map, zoomed into Raleigh

fig, ax = plt.subplots(figsize=(6,6), dpi=100)

# Raleighs boundary is crazy
#ral_bord.boundary.plot(color='k', linewidth=1, edgecolor='k', ax=ax, label='Raleigh')
ral_zipprof.plot(column='RPL_THEME1', cmap='PRGn',
                 legend=True,
                 edgecolor='grey',
                 ax=ax)

# via https://stackoverflow.com/a/42214156/604456
ral_zipprof.apply(lambda x: ax.annotate(text=x['zipcode'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1)

ax.get_xaxis().set_ticks([])
ax.get_yaxis().set_ticks([])

plt.show()

I prefer to use smaller geographies when possible, so I think zipcodes are about the largest areas that are reasonable to use this for (although I do have the ability to download this for counties). Zipcodes since they don’t nicely overlap city boundaries can cause particular issues in data analysis as well (Grubesic, 2008).

Other Stuff

I have a notebook in the github repo showing how to grab census tracts, as well as how to modify the exact variables you can download.

It does allow you to specify a year as well (in the notebook I show you can do the 2018 SVI for the 16/17/18/19 data at least). Offhand for these small geographies I would only expect small changes over time (see Miles et al., 2016 for an example looking at SES).

One of the things I think has more value added (and hopefully can get some time to do more on this at Gainwell), is to peg these metrics to actual health outcomes – so instead of making an index for SES, look at micro level demographics for health outcomes, and then post-stratify based on census data to get estimates across the US. But that being said, the SVI often does have reasonable correlations to actual geospatial health outcomes, see Learnihan et al. (2022) for one example that medication adherence the SVI is a better predictor than distance for pharmacy for example.

References

I have no clue how to interview for data scientists

At work we have been expanding quite a bit, so this comes with many interviews for data science candidates (as well as MLOps and a few business analysts). For a bit while we were between directors, I did the initial screening interviews (soft questions, get some background, just filter out those really out of depth). I filtered very few people in the end, and I have no clue if that was good/bad. One of the hard things about interviews is it is a very lossy feedback loop. You only really know if it worked out well 6+ months after it is over. And you only get on-policy feedback for those you hired – you don’t know if you filtered out someone who would have worked really well.

I have done more of the technical interviews though recently, which are intended to be more discriminatory. I don’t believe I do a good job at this either, or at least everyone seems OK (no clear uber bad and no clear uber good). It has been particularly hard to hire people who can come in and be seniors/independent from the get-go, as even people with many years of experience it isn’t clear to me they have the right experience to be really independent after on-boarding for a month.

For a while we did the homework thing – here is a dataset, do some data manipulation and fit a model. (You can see the homework I made on GitHub, longer version, shorter version.) We have stopped doing these though, partly because everyone’s homework looks the same – what I call copy-pasta code. So I feel asking people to spend 4-8 hours (or likely more) on homework is not good for the very little benefit it provides. The nature of simple homework assignments I believe is so superficial I don’t think you can do it in a way that is amenable to anything but copy-pasta.

So now during the technical interview we do a grilling of questions. We have some regulars, but they do not appear to me to be really discriminatory either.

So I typically start by asking people to pick a project they think had the most value, and we do a deep dive into that. Many people who are good programmers (and some even with math degrees) don’t have what I would consider real fundamental knowledge of the models they are building. How many parameters do you have in your deep learning model? Because you oversampled how did you recalibrate the predictions to correctly estimate false positives? How did you evaluate the return on investment to your model? I feel these are quite fair – I let you pick the best work you have done in your career! You should be quite familiar with it.

So now that I am unsure if you should be left alone to build models, we migrate to some specific technical questions. These can be explain the difference between random forests and xgboost models, explain an ROC curve to a business person, what is the difference between a list and a tuple in python, if you had to query a million claims once a week and apply a predictive model how would you do it, etc. (Tend to focus more on math/stats than programming.)

Those examples above most people answer just fine (so they are maybe worthless, they don’t discriminate anyone). But a few pretty much everyone we interview fails – one is this question about calculating expected utility for auditing claims with a different dollar value amount I shared on LinkedIn the other day:

Pretend you have a model that predicts the probability an insurance claim is fraudulent. You have two claims; one $2000 with a 50% probability, and one $10000 with a 20% probability. If you had to choose a single claim to audit, which would one would you pick and why?

Am I crazy to expect someone with a data science degree to be able to answer this question coherently? (Pretty close to every model we have focusing on health insurance claims at Gainwell looks like this in real life! It is not a weird, unrealistic scenario.) I have more generic variants as well, such as how do you take a predicted probability and a claim value and know it is worth it to audit. Or for a more generic one how do you know how to set the threshold to audit claims given a model prediction? These questions appear too discriminatory, in that they filter out even very experienced individuals.

This and the threshold question kills me a little inside everytime a senior person mangles the logic of it – it really signals to me you don’t understand how to translate mathematical models to make relevant business decisions. To be an independent data scientist this is a critical skill – you need it to know how to structure the model as well as how to feed that back into whatever human or automated decision making process uses that models predictions. It is what distinguishes data scientists from software engineers – I am an applied mathematician who knows how to code is how I view my role. (It is one of the reasons I think a PhD is valuable – being able to think mathematically like that in broader strokes is a typical step in the dissertation process for folks.)

So I am stuck between questions everyone can answer and questions no one can answer. I feel like I might as well flip a coin after the initial entry level interview and not waste everyone’s time.

Use circles instead of choropleth for MSAs

We are homeschooling the kiddo at the moment (the plunge was reading by Bryan Caplan’s approach, and seeing with online schooling just how poor middle school education was). Wife is going through AP biology at the moment, and we looked up various job info on biomedical careers. Subsequently came across this gem of a map of MSA estimates from the Bureau of Labor Stats (BLS) Occupational Employment and Wage Stats series (OES).

I was actually mapping some metro stat areas (MSAs) at work the other day, and these are just terrifically bad geo areas to show via a choropleth map. All choropleth maps have the issue of varying size areas, but I never realized having somewhat regular borders (more straight lines) makes the state and county maps not so bad – these MSA areas though are tough to look at. (Wife says it scintillates for her if she looks too closely.)

There are various incredibly tiny MSAs next to giant ones that you will just never see in these maps (no matter what color scheme you use). Nevada confused for me quite a bit, until I zoomed in to see that there are 4 areas, Reno is just a tiny squib.

Another example is Boulder above Denver. (Look closely at the BLS map I linked, you can just make out Boulder if you squint, but I cannot tell what color it corresponds to in the legend.) The outline heavy OES maps, which are mostly missing data, are just hopeless to display like this effectively. Reno could be the hottest market for whatever job, and it will always be lost in this map if you show employment via the choropleth approach. So of course I spent the weekend hacking together some maps in python and folium.

The BLS has a public API, but I was not able to find the OES stats in that. But if you go through the motions of querying the data and muck around in the source code for those queries, you can see they have an undocumented API call to generate json to fill the tables. Then using this tool to convert the json calls to python (thank you Hacker News), I was able to get those tables into python.

I have these functions saved on github, so check out that source for the nitty gritty. But just here quickly, here is a replicated choropleth map, showing the total employees for bio jobs (you can go to here to look up the codes, or run my function bls_maps.ocodes() to get a pandas dataframe of those fields).

# Creating example bls maps
from bls_geo import *

# can check out https://www.bls.gov/oes/current/oes_stru.htm
bio = '172031'
bio_stats = oes_geo(bio)
areas = get_areas() # this takes a few minutes
state = state_albers()
geo_bio = merge_occgeo(bio_stats,areas)

ax = geo_bio.plot(column='Employment',cmap='inferno',legend=True,zorder=2)
state.boundary.plot(ax=ax,color='grey',linewidth=0.5,zorder=1)
ax.set_ylim(0.1*1e6,3.3*1e6)
ax.set_xlim(-0.3*1e7,0.3*1e7)   # lower 48 focus (for Albers proj)
ax.set_axis_off()
plt.show()

And that is not much better than BLSs version. For this data, if you are just interested in looking up or seeing the top metro areas, just doing a table, e.g. above geo_bio.to_excel('biojobs.xlsx'), works just as well as a map.

So I was surprised to see Minneapolis pop up at the top of that list (and also surprised Raleigh doesn’t make the list at all, but Durham has a few jobs). But if you insist on seeing spatial trends, I prefer to go the approach of mapping proportion or graduate circles, placing the points at the centroid of the MSA:

att = ['areaName','Employment','Location Quotient','Employment per 1,000 jobs','Annual mean wage']
form = ['',',.0f','.2f','.2f',',.0f']

map_bio = fol_map(geo_bio,'Employment',['lat', 'lon'],att,form)
#map_bio.save('biomap.html')
map_bio #if in jupyter can render like this

I am too lazy to make a legend, you can check out nbviewer to see an interactive Folium map, which I have tool tips (similar to the hover for the BLS maps).

Forgive my CSS/HTML skills, not sure how to make nicer popups. So you lose the exact areas these MSA cover in this approach, but I really only expect a general sense from these maps anyway.

These functions are general enough for whatever wage series you want (although these functions will likely break when the 2021 data comes out). So here is the OES table for data science jobs:

I feel going for the 90th percentile (mapping that to the 10 times programmer) is a bit too over the top. But I can see myself reasonably justifying 75th percentile. (Unfortunately these agg tables don’t have a way to adjust for years of experience, if you know of a BLS micro product I could do that with let me know!). So you can see here the somewhat inflated salaries for the SanFran Bay area, but not as inflated as many might have you think (and to be clear, these are for 2020 survey estimates).

If we look at map of data science jobs, varying the circles by that 75th annual wage percentile, it looks quite uniform. What happens is we have some real low outliers (wages under 70k), resulting in tiny circles (such as Athen’s GA). Most of the other metro regions though are well over 100k.

In more somber news, those interactive maps are built using Leaflet as the backend, which was create by a Ukranian citizen, Vladimir Agafonkin. We can do amazing things with open source code, but we should always remember it is on the backs of someones labor we are able to do those things.

Simulating data with numpy and scipy

I’ve answered a few different questions on forums recently about simulating data:

I figured it would be worth my time to put some of my notes down in a blog post. It is not just academic, not too infrequently I use simulations at work to generate estimated return on investment estimates for putting a machine learning model into production. For social scientists this is pretty much equivalent to doing cost/benefit policy simulations.

Simulations are also useful for testing the behavior of different statistical tests, see prior examples on my blog for mixture trajectories or random runs.

Here I will be showing how to mostly use the python libraries numpy and scipy to do various simulation tasks. For some upfront (note I set the seed in numpy, important for reproducing your simulations):

import itertools as it
import numpy as np
import pandas as pd
np.random.seed(10)

# total simulations for different examples
n = 10000

# helper function to pretty print values
def pu(x,ax=1):
  te1,te2 = np.unique(x.sum(axis=ax),return_counts=True)
  te3 = 100*te2/te2.sum()
  res = pd.DataFrame(zip(te1,te2,te3),columns=['Unique','Count','Percent'])
  return res

Sampling from discrete choice sets

First, in the linked data science post, the individual had a very idiosyncratic set of simulations. Simulate a binary vector of length 10, that had 2,3,4 or 5 ones in that vector, e.g. [1,1,0,0,0,0,0,0,0,0] is a valid solution, but [0,0,0,0,1,0,0,0,0,0] is not, since the latter only has 1 1. One way to approach problems like these is to realize that the valid outcomes are a finite number of discrete solutions. Here I use itertools to generate all of the possible permutations (which can easily fit into memory, only 627). Then I sample from that set of 627 possibilities:

# Lets create the whole sets of possible permutation lists
res = []
zr = np.zeros(10)
for i in range(2,6):
    for p in it.combinations(range(10),i):
        on = zr.copy()
        on[list(p)] = 1
        res.append(on.copy())

resnp = np.stack(res,axis=0)

# Now lets sample 1000 from this list
total_perms = resnp.shape[0]
samp = np.random.choice(total_perms,n)
res_samp = resnp[samp]

# Check to make sure this is OK
pu(res_samp)

Ultimately you want the simulation to represent reality. So pretend this scenario was we randomly pick a number out of the set {2,3,4,5}, and then randomly distribute the 1s in that length 10 vector. In that case, this sampling procedure does not reflect reality, because 2’s have fewer potential permutations than do 5’s. You can see this in the simulated proportions of rows with 2 (7.25%) vs rows with 5 (39.94%) in the above simulation.

We can fix that though by adjusting the sampling probabilities from the big set of 627 possibilities though. Pretty much all of the np.random methods an option to specify different marginal probabilities, where in choice it defaults to equal probability.

# If you want the different subsets to have equal proportions
sum_pop = resnp.sum(axis=1)
tot_pop = np.unique(sum_pop,return_counts=True)
equal_prop = 1/tot_pop[1].shape[0]
pop_prob = pd.Series(equal_prop/tot_pop[1],index=tot_pop[0])
long_prob = pop_prob[sum_pop]

samp_equal = np.random.choice(total_perms,n,p=long_prob)
res_samp_equal = resnp[samp_equal]
pu(res_samp_equal)

So now we can see that each of those sets of results have similar marginal proportions in the simulation.

You can often figure out exact distributions for your simulations, for an example of similar discrete statistics, see my work on small sample Benford tests. But I often use simulations to check my math even if I do know how to figure out the exact distribution.

Another trick that I commonly use in other applications that don’t have access to something equivalent to np.random.choice, but most applications have a random uniform generator. Basically you can generate random numbers on whatever interval, chunk those up into bits, and turn those bits into your resulting categories. This is what I did in that SPSS post at the beginning.

unif = np.floor(np.random.uniform(1,33,(32*n,1)))/2
pu(unif)

But again this is not strictly necessary in python because we can generate the set/list of items and sample from that directly.

# Same as using random choice from that set of values
half_vals = np.arange(1,33)/2
unif2 = np.random.choice(half_vals,(32*n,1))
pu(unif2)

But if limited in your libraries that is a good trick to know. Note that most random number generators operate over (0,1), so open intervals and will never generate an exact 0 or 1. To get a continuous distribution over whatever range, you can just multiply the 0/1 random number generator (and subtract if you need negative values) to match the range you want. But again most programs let you input min/max in a uniform number generator.

Sampling different stat distributions

So real data often can be reasonably approximated via continuous distributions. If you want to generate different continuous distribtions with approximate correlations, one approach is to:

  • generate multi-variate standard normal data (mean 0, variance 1, and correlations between those variables)
  • turn that matrix into a standard uniform via the CDF function
  • then for each column, use the inverse CDF function for the distribution of choice

Here is an example generating a uniform 0/1 and a Poisson with a mean of 3 variable using this approach.

from scipy.stats import norm, poisson

# Here generate multivariate standard normal with correlation 0.2
# then transform both to uniform
# Then transform 2nd column to Poisson mean 3

mu = [0,0]
cv = [[1,0.2],[0.2,1]] #needs to be symmetric
mv = np.random.multivariate_normal([0,0],cov=cv,size=n)
umv = pd.DataFrame(norm.cdf(mv),columns=['Uniform','Poisson'])
umv['Poisson'] = poisson.ppf(umv['Poisson'],3)
print(umv.describe())

This of course doesn’t guarantee the transformed data has the exact same correlation as the original specified multi-variate normal. (If interested in more complicated scenarios, it will probably make sense to check out copulas.)

Like I mentioned in the beginning, I am often dealing with processes of multiple continuous model predictions. E.g. one model that predicts a binary ‘this claim is bad’, and then a second model predicting ‘how bad is this claim in terms of $$$’. So chaining together simulations of complicated data (which can be mixtures of different things) can be useful to see overall behavior of a hypothetical system.

Here I chain together a predicted probability for 3 claims (20%,50%,90%), and mean/standard deviations of (1000,100),(100,20),(50,3). So pretend we can choose 1 of these three claims to audit. We have the predicted probability that the claim is wrong, as well as an estimate of how much money we would make if the claim is wrong (how wrong is the dollar value).

The higher dollar values have higher variances, so do you pick the safe one, or do you pick the more risky audit with higher upside. We can do a simulation to see the overall distribution:

pred_probs = np.array([0.2,0.5,0.9])
bin_out = np.random.binomial(n=1,p=pred_probs,size=(n,3))
print( bin_out.mean(axis=0) )

# Pretend have both predicted prob
# and distribution of values
# can do a simulation of both

val_pred = np.array([1000,100,50])
val_std = np.array([100,20,3])
val_sim = np.random.normal(val_pred,val_std,size=(n,3))
print(val_sim.mean(axis=0))
print(val_sim.std(axis=0))

revenue = val_sim*bin_out
print(revenue.mean(axis=0))
print(revenue.std(axis=0))

I could see a reasonably risk averse person picking the lowest dollar claim here to audit. Like I said in the discrete case, we can often figure out exact distributions. Here the expected value is easy, prob*val, the standard deviation is alittle more tricky to calculate in your head (it is a mixture of a spike at 0 and then the rest of the typical distribution):

# Theoretical mean/variance
expected = pred_probs*val_pred
low_var = (expected**2)*(1-pred_probs)
hig_var = ((val_pred - expected)**2)*pred_probs
std_exp = np.sqrt(low_var + hig_var)

But it still may be useful to do the simulation here anyway, as the distribution is lumpy (so just looking at mean/variance may be misleading).

Other common continuous distributions I use are beta, to simulate between particular endpoints but not have them uniform. And negative binomial is a good fit to not only many count distributions, but things that are more hypothetically continuous (e.g. for distances instead of gamma, or for money instead of log-normal).

Here is an example generating a beta distribution between 0/1, with more of the distribution towards 0 than 1:

# mean probability 0.2
a,b = 1,5
beta_prob = beta.rvs(a, b, size=n)
plt.hist(beta_prob, bins=100,density=True,label='Sim')
# theoretical PDF
x = np.arange(0,1.01,0.01)
plt.plot(x,beta.pdf(x,a,b),label=f'beta({a},{b}) PDF')
plt.legend()
plt.show()

For beta, the parameters a,b, the mean is a/b. To make the distribution more spread out, you have larger overall values of a/b (I don’t have the conversion to variance memorized offhand). But if you have real data, you can plop that into scipy to fit the parameters, here I fix the location and scale parameters.

# If you want to fit beta based on observed data
fitbeta = beta.fit(beta_prob,floc=0,fscale=1)

We can do similar for negative binomial, I often think of these in terms of regression dispersion parameters, and I have previously written about translating mean/dispersion to n/p notation:

# Ditto for negative binomial
mean_nb = 2
disp_nb = 4

def trans_np(mu,disp):
    x = mu**2/(1 - mu + disp*mu)
    p = x/(x + mu)
    n = mu*p/(1-p)
    return n,p

nb_n,nb_p = trans_np(mean_nb,disp_nb)
nb_sim = nbinom.rvs(nb_n,nb_p,size=(n,1))
nb_bars = pu(nb_sim)
plt.bar(nb_bars['Unique'],nb_bars['Percent'],label='Sim')

x = np.arange(0,nb_sim.max()+1)
plt.plot(x,nbinom.pmf(x,nb_n,nb_p)*100,linestyle='None',
         marker='o',markerfacecolor='k',ms=7,
         label=f'PMF Negative Binomial')
plt.legend()
plt.show()

Downloading geo files from Census FTP using python

I was working with some health data that only has MSA identifiers the other day. Not many people seem to know about the US Census’s FTP data site. Over the years they have had various terrible GUI’s to download data, but I almost always just go to the FTP site directly.

For geo data, check out https://www2.census.gov/geo/tiger/TIGER2019/ for example. Python for pandas/geopandas also has the nicety that you can point to a url (even a url of a zip file), and load in the data in memory. So to get the MSA areas was very simple:

# Example download MSA
import geopandas as gpd
from matplotlib import pyplot as plt

url_msa = r'https://www2.census.gov/geo/tiger/TIGER2019/CBSA/tl_2019_us_cbsa.zip'
msa = gpd.read_file(url_msa)
msa.plot()
plt.show()

Sometimes the census has files spread across multiple states. So here is an example of doing some simple scraping to get all of the census tracts in the US. You can combine the geopandas dataframes the same as pandas dataframes using pd.concat:

# Example scraping all of the zip urls on a page
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests

def get_zip(url):
    front_page = requests.get(url,verify=False)
    soup = BeautifulSoup(front_page.content,'html.parser')
    zf = soup.find_all("a",href=re.compile(r"zip"))
    # Maybe should use href 
    zl = [os.path.join(url,i['href']) for i in zf]
    return zl

base_url = r'https://www2.census.gov/geo/tiger/TIGER2019/TRACT/'
res = get_zip(base_url)

geo_tract = []
for surl in res:
    geo_tract.append(gpd.read_file(surl))

geo_full = pd.concat(geo_tract)

# See State FIPS codes
# https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696

geo_full[geo_full['STATEFP'] == '01'].plot()
plt.show()

Unfortunately for the census data tables, such as https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/5_year_seq_by_state/Alabama/Tracts_Block_Groups_Only/, those zip files contain two files (an estimate file and a margin of error file), so you cannot just do pd.read_csv(url) for those tables. But for the shapefile zip files this appears to work just fine and dandy.

I am currently working on a project at work (but Gainwell has given me the thumbs up to open source it) to build tables to create the CDC’s Social Vulnerability Index, which I can build for multiple geographies in combo with the census data. So hopefully in the next few weeks will be able to share that work.

Running files locally in SPSS

Say I made alittle python script for a friend to scrape data from a website whenever they wanted updates. I write my python script, say scrape.py, and a run_scrape.bat file for my friend on their windows machine (or run_scrape.sh on Mac/Unix). And inside the bat file has the command:

python scrape.py

I tell my friend save those two files in whatever folder you want, you just need to double click the bat file and it will save the scraped data into the same folder. Here the bat file is run locally, it sets the current directory to wherever the bat file is located.

This is how the majority of code is packaged – can clone from github or email a zip file, and it will just work no matter where the local user saves those scripts. I need to have my friend have their python environment set up correctly, but most of the stuff I do I can say download Anaconda and click yes to setting python on the path and they are golden.

SPSS makes things more painful, say I added SPSS to my environment variable in my windows machine, and I run from the command prompt an SPSS production job:

cd "C:\Users\Andrew"
spss print_dir.spj" -production silent

And say the spj file, all it does is call a syntax show.sps which has as the only command SHOW DIR. This still prints out wherever SPSS is installed as the current working directory inside of the SPSS session. On my machine currently C:\Program Files\IBM\SPSS Statistics. So SPSS takes over the location of the current directory. Also we can open up the spj file (it is just a plain text xml file). Here is what a current spj file looks like for me (note it is all on one line as well!):

And that file also has several hard coded file locations. So to get the same behavior as python scrape.py earlier, we need to do dynamically set the paths in the production job as well, not just alter the command line scripts. This can be done with a little command line magic in windows, dynamically replacing the right text in the spj file. So in a bat file, you can do something like:

@echo on
set "base=%cd%"
:: code to define SPJ (SPSS production file)
echo ^<?xml version=^"1.0^" encoding=^"UTF-8^" standalone=^"no^"?^>^<job xmlns=^"http://www.ibm.com/software/analytics/spss/xml/production^" codepageSyntaxFiles=^"false^" print=^"false^" syntaxErrorHandling=^"continue^" syntaxFormat=^"interactive^" unicode=^"true^" xmlns:xsi=^"http://www.w3.org/2001/XMLSchema-instance^" xsi:schemaLocation=^"http://www.ibm.com/software/analytics/spss/xml/production http://www.ibm.com/software/analytics/spss/xml/production/production-1.4.xsd^"^>^<locale charset=^"UTF-8^" country=^"US^" language=^"en^"/^>^<output imageFormat=^"jpg^" imageSize=^"100^" outputFormat=^"text-codepage^" outputPath=^"%base%\job_output.txt^" tableColumnAutofit=^"true^" tableColumnBorder=^"^|^" tableColumnSeparator=^"space^" tableRowBorder=^"-^"/^>^<syntax syntaxPath=^"%base%\show.sps^"/^>^<symbol name=^"setdir^" quote=^"true^"/^>^</job^> > transfer_job.spj
"C:\Program Files\IBM\SPSS Statistics\stats.exe" "%base%\transfer_job.spj" -production silent -symbol @setdir "%base%"

It would be easier to use sed to find/replace the text for the spj file instead of the superlong one-liner on echo, but I don’t know if Window’s always has sed installed. Also note the escape characters (it is crazy how windows parses this single long line, apparently the max length is around 32k characters though).

You can see in the call to the production job, I pass a parameter, @setdir, and expand it out in the shell using %base%. In show.sps, I now have this line:

CD @setdir.

And now SPSS has set the current directory to wherever you have the .bat file and .sps syntax file saved. So now everything is dynamic, and runs wherever you have all the files saved. The only thing that is not dynamic in this setup is the location of the SPSS executable, stats.exe. So if you are sharing SPSS code like this, you will need to either tell your friend to add C:\Program Files\IBM\SPSS Statistics to their environment path, or edit the .bat file to the correct path, but otherwise this is dynamically run in the local folder with all the materials.