Use circles instead of choropleth for MSAs

We are homeschooling the kiddo at the moment (the plunge was reading by Bryan Caplan’s approach, and seeing with online schooling just how poor middle school education was). Wife is going through AP biology at the moment, and we looked up various job info on biomedical careers. Subsequently came across this gem of a map of MSA estimates from the Bureau of Labor Stats (BLS) Occupational Employment and Wage Stats series (OES).

I was actually mapping some metro stat areas (MSAs) at work the other day, and these are just terrifically bad geo areas to show via a choropleth map. All choropleth maps have the issue of varying size areas, but I never realized having somewhat regular borders (more straight lines) makes the state and county maps not so bad – these MSA areas though are tough to look at. (Wife says it scintillates for her if she looks too closely.)

There are various incredibly tiny MSAs next to giant ones that you will just never see in these maps (no matter what color scheme you use). Nevada confused for me quite a bit, until I zoomed in to see that there are 4 areas, Reno is just a tiny squib.

Another example is Boulder above Denver. (Look closely at the BLS map I linked, you can just make out Boulder if you squint, but I cannot tell what color it corresponds to in the legend.) The outline heavy OES maps, which are mostly missing data, are just hopeless to display like this effectively. Reno could be the hottest market for whatever job, and it will always be lost in this map if you show employment via the choropleth approach. So of course I spent the weekend hacking together some maps in python and folium.

The BLS has a public API, but I was not able to find the OES stats in that. But if you go through the motions of querying the data and muck around in the source code for those queries, you can see they have an undocumented API call to generate json to fill the tables. Then using this tool to convert the json calls to python (thank you Hacker News), I was able to get those tables into python.

I have these functions saved on github, so check out that source for the nitty gritty. But just here quickly, here is a replicated choropleth map, showing the total employees for bio jobs (you can go to here to look up the codes, or run my function bls_maps.ocodes() to get a pandas dataframe of those fields).

# Creating example bls maps
from bls_geo import *

# can check out https://www.bls.gov/oes/current/oes_stru.htm
bio = '172031'
bio_stats = oes_geo(bio)
areas = get_areas() # this takes a few minutes
state = state_albers()
geo_bio = merge_occgeo(bio_stats,areas)

ax = geo_bio.plot(column='Employment',cmap='inferno',legend=True,zorder=2)
state.boundary.plot(ax=ax,color='grey',linewidth=0.5,zorder=1)
ax.set_ylim(0.1*1e6,3.3*1e6)
ax.set_xlim(-0.3*1e7,0.3*1e7)   # lower 48 focus (for Albers proj)
ax.set_axis_off()
plt.show()

And that is not much better than BLSs version. For this data, if you are just interested in looking up or seeing the top metro areas, just doing a table, e.g. above geo_bio.to_excel('biojobs.xlsx'), works just as well as a map.

So I was surprised to see Minneapolis pop up at the top of that list (and also surprised Raleigh doesn’t make the list at all, but Durham has a few jobs). But if you insist on seeing spatial trends, I prefer to go the approach of mapping proportion or graduate circles, placing the points at the centroid of the MSA:

att = ['areaName','Employment','Location Quotient','Employment per 1,000 jobs','Annual mean wage']
form = ['',',.0f','.2f','.2f',',.0f']

map_bio = fol_map(geo_bio,'Employment',['lat', 'lon'],att,form)
#map_bio.save('biomap.html')
map_bio #if in jupyter can render like this

I am too lazy to make a legend, you can check out nbviewer to see an interactive Folium map, which I have tool tips (similar to the hover for the BLS maps).

Forgive my CSS/HTML skills, not sure how to make nicer popups. So you lose the exact areas these MSA cover in this approach, but I really only expect a general sense from these maps anyway.

These functions are general enough for whatever wage series you want (although these functions will likely break when the 2021 data comes out). So here is the OES table for data science jobs:

I feel going for the 90th percentile (mapping that to the 10 times programmer) is a bit too over the top. But I can see myself reasonably justifying 75th percentile. (Unfortunately these agg tables don’t have a way to adjust for years of experience, if you know of a BLS micro product I could do that with let me know!). So you can see here the somewhat inflated salaries for the SanFran Bay area, but not as inflated as many might have you think (and to be clear, these are for 2020 survey estimates).

If we look at map of data science jobs, varying the circles by that 75th annual wage percentile, it looks quite uniform. What happens is we have some real low outliers (wages under 70k), resulting in tiny circles (such as Athen’s GA). Most of the other metro regions though are well over 100k.

In more somber news, those interactive maps are built using Leaflet as the backend, which was create by a Ukranian citizen, Vladimir Agafonkin. We can do amazing things with open source code, but we should always remember it is on the backs of someones labor we are able to do those things.

Using Natural Frequencies to Reason about Tests

Hackernews shared an article about visualizing Bayes Theorem using Venn diagrams, with the common example of going from sensitivity/specificity of a test to switch the probability of ‘do I have cancer’. I actually do not find Bayes Theorem to be really helpful to think through this problem. I much prefer Gerd Gigerenzer’s approach using natural frequencies.

So here is an example, say we have a Covid rapid test that has a sensitivity of 90%. This metric is “if you have Covid, how often does the test say you have Covid”. In conditional probability terms we would write this P( Test Says + | Actually Have Covid ).

Often times we want to know the opposite conditional though, “if we test positive, what is the actual probability we have Covid?”. So we want to know P( Actually Have Covid | Test Says + ). This is switching the conditional from the earlier metric. To figure that out though, we need some more information though. One thing we need is an estimate of “Actually Have Covid” in the overall population. Pretend this is 5% for our scenario.

Another is the false positive rate of the test, P( Test Says + | Do Not Have Covid ). Lets say for our example this is 10%. Now how do we figure out P( Actually Have Covid | Test Says + )? Lets use this nice little tree diagram, where we consider a hypothetical scenario of 1000 people getting Covid rapid tests:

So first thing to note is that the first split in the tree is something that doesn’t have anything to do with the tests at all – it is the overall proportion of Covid cases in the population. Here 5% means out of our 1000 people that take rapid tests, we expect only 50 of them to actually have Covid.

The second level of the splits are the two data based pieces of information, the sensitivity is the split on the left side. The test captures 90% of the true positives, so 45 out of the 50 get a positive result on the rapid test. The right split is the false negative rate of 10%, of those 950 without Covid, 95 will get a false positive result.

So in the end, we have 45 true positives out of 45 + 95 total positive tests (the colored balls in the tree diagram). That is our personal estimate of I tested positive for Covid, do I actually have Covid, P( Actually Have Covid | Test Says + ). This estimate is 32%, or a metric we sometimes like to call the Positive Predictive Value of the test. I have an Excel spreadsheet where you can insert different overall proportions in the underlying population (the prevalence) as well as different estimates of a tests sensitivity/false-positive rate, and it spits out the nice diagram.

In the end even if you have a very sensitive test with a low false positive rate, if the prevalence is very low you will have a low positive predictive value. For example in my work, in predicting chronic offenders involved with shootings, among the top 1000 predictions the PPV was only 3%. Because shooting someone or being shot is very rare. If you compare those top 1000 they have around 200 times higher probability than a random person, but still overall that probability is quite low.

Note for data science folks that these different metrics all relate to the ROC curve. In my head I often translate sensitivity to “capture rate” – the proportion of true cases the test captures in the wild, I can never remember sensitivity. Sometimes in ROC curves I label this as the True Positive Rate, although software often uses 1 – Specificity vs Sensitivity.

Based on the above scenario you may be thinking to yourself ‘I can’t fill out all of the unknowns’, especially such as knowing the underlying prevalence of the outcome in the population. For overall prevalence’s you can often make reasonable guesses. For error rates for tests, that often takes some digging or guess work as to typical error rates though. So even if we don’t have uber precise estimates, we can still make reasonable decisions based on the range of likely scenarios filling in the info we need.

Knowing the Right Conditional Probability?

While in general I like the tree diagram approach, it doesn’t help with the more general issue of knowing the correct probability you care about to make decisions. Totally agree with Gigerenzer’s overall approach (he is basically glass half full compared to Kahneman/Tversky, if you present info the right way we don’t make so biased as decisions), but even scientific experts often make conditional probability mistakes.

Much of the public discussion about Covid in particular is making a specious set of probability estimates to justify positions. Just copy-pasting my Hackernews comment giving examples from the Rogan/Gupta interview, but think that is very symbolic overall of the news coverage:

I think a more basic problem is people don’t even know how to formulate meaningful probability hypothetical/counterfactuals to begin with (let alone update them). For Covid stuff pretty much all an individual should care about is:

P(Bad Stuff | I take action A) > P(Bad Stuff | I take action B)

So you take action B in this scenario (simplifying to ignore costs, many of these decisions are costless). We get a bunch of meaningless drivel in the news though that doesn’t help anyone make any meaningful probability estimates to help them make decisions. I think the Rogan/Gupta interview is a good example. We get various non-sense comparisons such as:

P(Bad Stuff | Covid, Person 5, No Vaccine) < P(Bad Stuff | Covid, Person 50, Vaccine)

[Rogan to Gupta why is it OK for 50 year old to feel safe and not a young kid without a vaccine? Irrelevant counter-factual unless someone invents a reverse aging treatment.]

P(Heart Inflammation | Covid Vaccine, Young Male) > P(Death | Covid, Young Male)

[Rogan saying side effects for young people outweigh benefits. This is true but death is quite a bit worse than the side effects, and this does not consider other Bad Stuff from Covid like long haulers.]

Knowing Bayes Theorem doesn’t help someone figure out the right probability statement they should be interested in to begin with.

The news overall is very frustrating, especially scientific coverage. Often times news entirely avoids any specific numeric estimate, but is often not even clear what the estimate is. A common one from news coverage is P(Bad Thing | ?????), I often can’t tell what composes the conditional on the right hand side. Sometimes it is those tested positive, sometimes it is the overall population in an area, sometimes it is those in the hospital, etc. Those different conditionals can greatly change how you interpret that information.

References

  • Gigerenzer, G. (2011). What are natural frequencies?. BMJ, 343.
  • Wheeler, A. P., Worden, R. E., & Silver, J. R. (2019). The accuracy of the violent offender identification directive tool to predict future gun violence. Criminal Justice and Behavior, 46(5), 770-788.

Regression Discontinuity Designs

Regression Discontinuity Designs (RDD) are one way to evaluate predictive model systems with causal outcomes associated with what you do with that information. For a hypothetical example, imagine you have a predictive model assessing the probability that you will be diagnosed with diabetes in the next two years. Those that score above 30% probability get assigned a caseworker, to try to prevent that individual from contracting diabetes. How do you know how effective that caseworker is in reducing diabetes in those high risk individuals?

The RDD design works like this – you have your running variable (here the predicted probability), and the threshold (over 30%) that gets assigned a treatment. You estimate the probability of the actual outcome (it may be other outcomes besides just future diabetes, such as the caseworker may simply reduce overall medical costs even if the person still ends up being diagnosed with diabetes). You then estimate the dip in the predicted outcome just before and just after the threshold. The difference in those two curves is the estimated effect.

Here is an example graph illustrating (with fake data I made, see the end of the post). The bump in the line (going from the blue to the red) is then the average treatment effect of being assigned a caseworker, taking into account the underlying trend that higher risk people here are more likely to have higher medical bills.

A few things to note about RDD – so there is a tension between estimating the underlying curve and the counterfactual bump at the threshold. Theoretically values closer to the threshold should be more relevant, so some (see the Wikipedia article linked earlier) try to estimate non-linear weighted curves, giving cases closer to the threshold higher weights. This often produces strange artifacts (that Andrew Gelman likes to point out on his blog) that can miss-estimate the RDD effect. This is clearly the case in noisy data if the line dips just before the threshold and curves up right after the threshold.

So you can see in my code at the end I prefer to estimate this using splines, as opposed to weighted estimators that have a bit of noise. (Maybe someday I will code up a solution to do out of sample predictive evaluations for this as an additional check.) And with this approach it is easy to incorporate other covariates (and look at treatment heterogeneity if you want). Note that while wikipedia says the weighted estimator is non-parametric this is laughably wrong (it is two straight lines in their formulation, a quite restrictive for actually) – while I can see some theoretical justification for the weighted estimators, in practice these are quite noisy and not very robust to minor changes in the weights (and you always need to make some parametric assumptions in this design, both for the underlying curve as well as the standard error around the threshold bump).

There are additional design details you need to be worried about, such as fuzzy treatments or self selection. E.g. if a person can fudge the numbers a bit and choose which side of the line they are on, it can cause issues with this design. In those cases there may be a reasonable workaround (e.g. an instrumental variable design), but the jist of the research design will be the same.

Last but not least, to actually conduct this analysis you need to cache the original continuous prediction. In trying to find real data for this blog post, many criminal justice examples of risk assessment people only end up saving the final categories (low, medium, high) and not the original continuous risk instrument.

If folks have a good public data example that I could show with real data please let me know! Scouting for a bit (either parole/probabation risk assessment, or spatial predictive policing) has not turned up any very good examples for me to construct examples with (also health examples would be fine as well).


This is the simulated data (in python), the RDD graph, and the statsmodel code for how I estimate the RDD bump effect. You could of course do more fancy things (such as penalize the derivatives for the splines), but this I think would be a good way to estimate the RDD effect in many designs where appropriate.

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

########################################
# Simulating data
np.random.seed(10)
n_cases = 3000 # total number of cases
# pretend this is predicted prob
prob = np.random.beta(3,10,size=n_cases)
# pretend this is med costs over a year
med_cost = 3000 + 5000*prob + -500*(prob > 0.3) + np.random.normal(0,500,n_cases)
df = pd.DataFrame(zip(prob,med_cost), columns=['Prob','MedCost'])
# could do something fancier with non-linear effects for prob
########################################

########################################
# Fitting regression model

# Knots are small distance from threshold
# (Could also do a knot right on threshold)
mod = smf.ols(formula='MedCost ~ bs(Prob,knots=[0.2,0.25,0.35,0.4]) + I(Prob > 0.3)', data=df)
res = mod.fit()
print(res.summary())
########################################

########################################
# Plotting fit

# Getting standard errors
prob_se = res.get_prediction().summary_frame()
prob_se['Prob'] = prob
prob_se.sort_values(by='Prob',inplace=True,ignore_index=True)
low = prob_se[prob_se['Prob'] <= 0.3].copy()
high = prob_se[prob_se['Prob'] > 0.3].copy()

# Getting effect for threshold bump
coef = res.summary2().tables[1]
ci = coef.iloc[1,4:6].astype(int).to_list()

fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(df['Prob'], df['MedCost'], c='grey',
           edgecolor='k', alpha=0.15, s=5, zorder=1)
ax.axvline(0.3, linestyle='solid', alpha=1.0, 
           color='k',linewidth=1, zorder=2)
ax.fill_between(low['Prob'],low['mean_ci_lower'],
                low['mean_ci_upper'],alpha=0.6,
                zorder=3, color='darkblue')
ax.fill_between(high['Prob'],high['mean_ci_lower'],
                high['mean_ci_upper'],alpha=0.6,
                zorder=3, color='red')
ax.set_xlabel('Predicted Prob Diabetes')
ax.set_ylabel('Medical Costs')
ax.set_title(f'RDD Reduced Cost Estimate {ci[0]} to {ci[1]} (95% CI)')
ax.text(0.3,6500,'Threshold',rotation=90, size=9,
         ha="center", va="center",
         bbox=dict(boxstyle="round", ec='k',fc='grey'))
plt.savefig('RDD.png', dpi=500, bbox_inches='tight')
########################################

Using precision to plan experiments (SPSS)

SPSS Version 28 has some new nice power analysis facilities. Here is some example code to test the difference in proportions, with sample sizes equal between the two groups, and the groups have an underlying proportion of 0.1 and 0.2.

POWER PROPORTIONS INDEPENDENT
  /PARAMETERS TEST=NONDIRECTIONAL SIGNIFICANCE=0.05 POWER=0.6 0.7 0.8 0.9 NRATIO=1 
    PROPORTIONS=0.1 0.2 METHOD=CHISQ ESTIMATE=NORMAL CONTINUITY=TRUE POOLED=TRUE
  /PLOT TOTAL_N.

So this tells you the sample size to get a statistically significant difference (at the 0.05 level) between two groups for a proportion difference test (here just a chi square test). And you can specify the solution for different power estimates (here a grid of 0.6 to 0.9 by 0.1), as well as get a nice plot.

So this is nice for academics planning factorial experiments, but I often don’t personally plan experiments this way. I typically get sample size estimates via thinking about precision of my estimates. In particular I think to myself, I want subgroup X to have a confidence interval width of no more than 5%. Even if you want to extend this to testing the differences between groups this is OK (but will be conservative), because even if two confidence intervals overlap the two groups can still be statistically different (see this Geoff Cumming reference).

So for example a friend the other day was asking about how to sample cases for an audit, and they wanted to do subgroup analysis in errors between different racial categories. But it was paper records, so they needed to just get an estimate of the total records to sample upfront (can’t really control the subgroup sizes). I suggested to estimate the precision they wanted in the smallest subgroup of interest for the errors, and base the total sample size of the paper audit on that. This is much easier to plan than worrying about a true power analysis, in which you also need to specify the expected differences in the groups.

So here is a macro I made in SPSS to generate precision estimates for binomial proportions (Clopper Pearson exact intervals). See my prior blog post for different ways to generate these intervals.

* Single proportion, precision.
DEFINE !PrecisionProp (Prop = !TOKENS(1)
                      /MinN = !TOKENS(1)
                      /MaxN = !TOKENS(1)
                      /Steps = !DEFAULT(100) !TOKENS(1) 
                      /DName = !DEFAULT("Prec") !TOKENS(1) )
INPUT PROGRAM.
COMPUTE #Lmi = LN(!MinN).
COMPUTE #Step = (LN(!MaxN) - #Lmi)/!Steps.
LOOP #i = 1 TO (!Steps + 1).
  COMPUTE N = EXP(#Lmi + (#i-1)*#Step).
  COMPUTE #Suc = N*!Prop.
  COMPUTE Prop = !Prop.
  COMPUTE Low90 = IDF.BETA(0.05,#Suc,N-#Suc+1).
  COMPUTE High90 = IDF.BETA(0.95,#Suc+1,N-#Suc).
  COMPUTE Low95 = IDF.BETA(0.025,#Suc,N-#Suc+1).
  COMPUTE High95 = IDF.BETA(0.975,#Suc+1,N-#Suc).
  COMPUTE Low99 = IDF.BETA(0.005,#Suc,N-#Suc+1).
  COMPUTE High99 = IDF.BETA(0.995,#Suc+1,N-#Suc).
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME !DName.
FORMATS N (F10.0) Prop (F3.2).
EXECUTE.
!ENDDEFINE.

This generates a new dataset, with a baseline proportion of 50%, and shows how the exact confidence intervals change with increasing the sample size. Here is an example generating confidence intervals (90%,95%, and 99%) for sample sizes from 50 to 3000 with a baseline proportion of 50%.

!PrecisionProp Prop=0.5 MinN=50 MaxN=3000.

And we can go and look at the generated dataset. For sample size of 50 cases, the 90% CI is 38%-62%, the 95% CI is 36%-64%, and the 99% CI is 32%-68%.

For sample sizes of closer to 3000, you can then see how these confidence intervals decrease in width to changes in 4 to 5%.

But I really just generated the data this way to do a nice error bar chart to visualize:

*Precision chart.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Prop N Low90 High90 Low95 High95 Low99 High99
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: N=col(source(s), name("N"))
  DATA: Prop=col(source(s), name("Prop"))
  DATA: Low90=col(source(s), name("Low90"))
  DATA: High90=col(source(s), name("High90"))
  DATA: Low95=col(source(s), name("Low95"))
  DATA: High95=col(source(s), name("High95"))
  DATA: Low99=col(source(s), name("Low99"))
  DATA: High99=col(source(s), name("High99"))
  DATA: N=col(source(s), name("N"))
  GUIDE: text.title(label("Base Rate 50%"))
  GUIDE: axis(dim(1), label("Sample Size"))
  GUIDE: axis(dim(2), delta(0.05))
  SCALE: linear(dim(1), min(50), max(3000))
  ELEMENT: area.difference(position(region.spread.range(N*(Low99+High99))), color.interior(color.grey), 
           transparency.interior(transparency."0.6"))
  ELEMENT: area.difference(position(region.spread.range(N*(Low95+High95))), color.interior(color.grey), 
           transparency.interior(transparency."0.6"))
  ELEMENT: area.difference(position(region.spread.range(N*(Low90+High90))), color.interior(color.grey), 
           transparency.interior(transparency."0.6"))
  ELEMENT: line(position(N*Prop))
END GPL.

So here the lightest gray area is the 99% CI, the second lightest is the 95%, and the darkest area is the 90% CI. So here you can make value trade offs, is it worth it to get an extra precision of under 10% by going from 500 samples to 1000 samples? Going from 1500 to 3000 you don’t gain as much precision as going from 500 to 1000 (diminishing returns).

It is easier to see the progression of the CI’s when plotting the sample size on a log scale (or sometimes square root), although often times costs of sampling are on the linear scale.

You can then change the macro arguments if you know your baseline is likely to not be 50%, but say here 15%:

!PrecisionProp Prop=0.15 MinN=50 MaxN=3000.

So you can see in this graph that the confidence intervals are smaller in width than the 50% – a baseline of 50% results in the largest variance estimates for binary 0/1 data, so anything close to 0% or 100% will result in smaller confidence intervals. So this means if you know you want a precision of 5%, but only have a rough guess as to the overall proportion, planning for 50% baseline is the worst case scenario.

Also note that when going away from 50%, the confidence intervals are asymmetric. The interval above 15% is larger than the interval below.

A few times at work I have had pilots that look something like “historical hit rates for these audits are 10%, I want the new machine learning model to increase that to 15%”.

So here I can say, if we can only easily use a sample of 500 cases for the new machine learning model, the precision I expect to be around 11% to 19%. So cutting a bit close to be sure it is actually above the historical baseline rate of 10%. To get a more comfortable margin we need sample sizes of 1000+ in this scenario.

The Big 2020 Homicide Increase in Context

Jeff Asher recently wrote about the likely 2020 increase in Homicides, stating this is an unprecedented increase. (To be clear, this is 2020 data! Homicide reporting data in the US is just a few months shy of a full year behind.)

In the past folks have found me obnoxious, as I often point to how homicide rates (even for fairly large cities), are volatile (Wheeler & Kovandzic, 2018). Here is an example of how I thought the media coverage of the 2015/16 homicide increase was overblown.

I actually later quantified this more formally with then students Haneul Yim and Jordan Riddell (Yim et al., 2020). We found the 2015 increase was akin to when folks on the news say a 1 in 100 year flood. So I was wrong in terms of it was a fairly substantive increase relative to typical year to year changes. Using the same methods, I updated the charts to 2020 data, and 2020 is obviously a much larger increase than the ARIMA model we fit based on historical data would expect:

Looking at historical data, people often argue “it isn’t as high as the early 90’s” – this is not the point though I really intended to make (it is kind of silly to make a normative argument about the right or acceptable number of homicides) – but I can see how I conflated those arguments. Looking at the past is also about understanding the historical volatility (what is the typical year to year change). Here this is clearly a much larger swing (up or down) than we would expect based on the series where we have decent US coverage (going back to 1960).


For thinking about crime spikes, I often come from a place in my crime analyst days where I commonly was posed with ‘crime is on the rise’ fear in the news, and needed to debunk it (so I could get back to doing analysis on actual crime problems, not imaginary ones). One example was a convenience store was robbed twice in the span of 3 days, and of course the local paper runs a story crime is on the rise. So I go and show the historical crime trends to the Chief and the Mayor that no, commercial robberies are flat. And even for that scenario there were other gas stations that had more robberies in toto when looking at the data in the past few years. So when the community police officer went to talk to that convenience store owner to simply lock up his cash in more regular increments, I told that officer to also go to other stores and give similar target hardening advice.

Another aspect of crime trends is not only whether a spike is abnormal (or whether we actually have an upward trend), but what causes it. I am going to punt on that – in short it is basically impossible in normal times to know what caused short term spikes absent identifying specific criminal groups (which is not so relevant for nationwide spikes, but even in large cities one active criminal group can cause observable spikes). We have quite a bit of crazy going on at the moment – Covid, BLM riots, depolicing – I don’t know what caused the increase and I doubt we will ever have a real firm answer. We cannot run an experiment to see why these increases occurred – it is mostly political punditry pinning it on one theory versus another.

For the minor bit it is worth – the time series methods I use here signal that the homicide series is ARIMA(1,1,0) – which means both an integrated random walk component and a auto-regressive component. Random walks will occur in macro level data in which the micro level data are a bunch of AR components. So this suggests a potential causal attribution to increased homicides is homicides itself (crime begets more crime). And this can cause run away effects of long upwards/downwards trends. I don’t know of a clear way though to validate that theory, nor any obvious utility in terms of saying what we should do to prevent increases in homicides or stop any current trends. Even if we have national trends, any intervention I would give a thumbs up to is likely to be local to a particular municipality. (Thomas Abt’s Bleeding Out is about the best overview of potential interventions that I mostly agree with.)

References

  • Wheeler, A. P., & Kovandzic, T. V. (2018). Monitoring volatile homicide trends across US cities. Homicide Studies, 22(2), 119-144.
  • Yim, H. N., Riddell, J. R., & Wheeler, A. P. (2020). Is the recent increase in national homicide abnormal? Testing the application of fan charts in monitoring national homicide trends over time. Journal of Criminal Justice, 66, 101656.

Notes for the 2019/2020 updated homicide data. 2019 data is available from the FBI page, 2020 homicide data I have taken from estimates at USA Facts and total USA pop is taken from Google search results.

R code and data to replicate the chart can be downloaded here.

KDE plots for predicted probabilities in python

So I have previously written about two plots post binary prediction models – calibration plots and ROC curves. One addition to these I am going to show are kernel density estimate plots, broken down by the observed value vs predicted value. One thing in particular I wanted to make these for is to showcase the distribution of the predicted probabilities themselves, which can be read off of the calibration chart, but is not as easy.

I have written about this some before – transforming KDE estimates from logistic to probability scale in R. I will be showing some of these plots in python using the seaborn library. It will be easier instead of transforming the KDE to use edge weighting statistics to get unbiased estimates near the borders for the way the seaborn library is set up.

To follow along, you can download the data I will be using here. It is the predicted probabilities from the test set in the calibration plot blog post, predicting recidivism using several different models.

First to start, I load my python libraries and set my matplotlib theme (which is also inherited by seaborn charts).

Then I load in my data. To make it easier I am just working with the test set and several predicted probabilities from different models.

import pandas as pd
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

#####################
# My theme

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}

matplotlib.rcParams.update(andy_theme)
#####################

And here I am reading in the data (just have the CSV file in my directory where I started python).

################################################################
# Reading in the data with predicted probabilites
# Test from https://andrewpwheeler.com/2021/05/12/roc-and-calibration-plots-for-binary-predictions-in-python/
# https://www.dropbox.com/s/h9de3xxy1vy6xlk/PredProbs_TestCompas.csv?dl=0

pp_data = pd.read_csv(r'PredProbs_TestCompas.csv',index_col=0)
print(pp_data.head())

print(pp_data.describe())
################################################################

So you can see this data has the observed outcome Recid30 – recidivism after 30 days (although again this is the test dataset). And then it also has the predicted probability for three different models (XGBoost, RandomForest, and Logit), and then demographic breakdowns for sex and race.

The plot I am interested in seeing is a KDE estimate for the probabilities, broken down by the observed 0/1 for recidivism. Here is the default graph using seaborn:

# Original KDE plot by 0/1
sns.kdeplot(data=pp_data, x="Logit", hue="Recid30", 
            common_norm=False, bw_method=0.15)

One problem you can see with this plot though is that the KDE estimates are smoothed beyond the data. You cannot have a predicted probability below 0 or above 1. Because we are using a gaussian kernel, we can just reweight observations that are close to the edge, and then clip the KDE estimate. So a predicted probability of 0 would get a weight of 1/0.5 – so it gets double the weight. Note to do this correctly, you need to set the bandwidth the same for the seaborn kdeplot as well as the weights calculation – here 0.15.

# Weighting and clipping
# Amount of density below 0 & above 1
below0 = norm.cdf(x=0,loc=pp_data['Logit'],scale=0.15)
above1 = 1- norm.cdf(x=1,loc=pp_data['Logit'],scale=0.15)
pp_data['edgeweight'] = 1/ (1 - below0 - above1)

sns.kdeplot(data=pp_data, x="Logit", hue="Recid30", 
            common_norm=False, bw_method=0.15,
            clip=(0,1), weights='edgeweight')

This results in quite a dramatic difference, showing the model does a bit better than the original graph. The 0’s were well discriminated, so have many very low probabilities that were smoothed outside the legitimate range.

Another cool plot you can do that again shows calibration is to use seaborn’s fill option:

cum_plot = sns.kdeplot(data=pp_data, x="Logit", hue="Recid30", 
                       common_norm=False, bw_method=0.15,
                       clip=(0,1), weights='edgeweight', 
                       multiple="fill", legend=True)
cum_plot.legend_._set_loc(4) #via https://stackoverflow.com/a/64687202/604456

As expected this shows an approximate straight line in the graph, e.g. 0.2 on the X axis should be around 0.2 for the orange area in the chart.

Next seaborn has another good function here, violin plots. Unfortunately you cannot pass a weight function here. But another option is to simply resample your data a large number of times, using the weights you provided earlier.

n = 1000000 #larger n will result in more accurate KDE
resamp_pp = pp_data.sample(n=n,replace=True, weights='edgeweight',random_state=10)

viol_sex = sns.violinplot(x="Sex", y="XGB", hue="Recid30",
                          data=resamp_pp, split=True, cut=0, 
                          bw=0.15, inner=None,
                          scale='count', scale_hue=False)
viol_sex.legend_.set_bbox_to_anchor((0.65, 0.95))

So here you can see we have more males in the sample, and they have a larger high risk blob that was correctly identified. Females have a risk profile more spread out, although there is a small clump of basically 0 risk that the model identifies.

You can also generate the graph so the areas for the violin KDE’s are normalized, so in both the original and resampled data we have fewer females, and more black individuals.

# Values for Sex for orig/resampled
print(pp_data['Sex'].value_counts(normalize=True))
print(resamp_pp['Sex'].value_counts(normalize=True))

# Values for Race orig/resampled
print(pp_data['Race'].value_counts(normalize=True))
print(resamp_pp['Race'].value_counts(normalize=True))

But if we set scale='area' in the chart the violins are the same size:

viol_race = sns.violinplot(x="Race", y="XGB", hue="Recid30",
                           data=resamp_pp, split=True, cut=0, 
                           bw=0.15, inner=None,
                           scale='area', scale_hue=True)
viol_race.legend_.set_bbox_to_anchor((0.81, 0.95))

I will have to see if I can make some time to contribute to seaborn to make it so you can pass in weights to the violinplot function.

ROC and calibration plots for binary predictions in python

When doing binary prediction models, there are really two plots I want to see. One is the ROC curve (and associated area under the curve stat), and the other is a calibration plot. I have written a few helper functions to make these plots for multiple models and multiple subgroups, so figured I would share, binary plots python code. To illustrate their use, I will use the same Compas recidivism data I have used in the past, (CSV file here). So once you have downloaded those two files you can follow along with my subsequent code.

Front Prep

First, I have downloaded the binary_plots.py file and the PreppedCompas.csv file to a particular folder on my machine, D:\Dropbox\Dropbox\Documents\BLOG\binary_plots. To import these functions, I append that path using sys, and change the working directory using os. The other packages are what I will be using the fit the models.

###############################################
# Front end prep

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import os
import sys

my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\binary_plots'
os.chdir(my_dir)

# Can append to path
sys.path.append(my_dir)
import binary_plots

np.random.seed(10) #setting the seed for the random
# split for train/test
###############################################

Next up I prepare the data, this is just boring stuff turning categorical variables into various dummies and making a train/test split for the data (which can be done in a variety of ways).

###############################################
# Prepping Compas Data

#For notes on data source, check out 
#https://github.com/apwheele/ResearchDesign/tree/master/Week11_MachineLearning
recid = pd.read_csv('PreppedCompas.csv')

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
                    'juv_fel_count','YearsScreening']].copy()
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")

print( recid['race'].value_counts() )
dum_race = pd.get_dummies(recid['race'])
# White for reference category
for d in list(dum_race):
    if d != 'Caucasion':
        recid_prep[d] = dum_race[d]

print( recid['marital_status'].value_counts() )
dum_mar = pd.get_dummies(recid['marital_status'])
recid_prep['Single'] = dum_mar['Single']
recid_prep['Married'] = dum_mar['Married'] + dum_mar['Significant Other']
# reference category is separated/unknown/widowed

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1].copy()
recid_test = recid_prep[recid_prep['Train'] == 0].copy()

#Independant variables
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis',
            'African-American','Asian','Hispanic','Native American','Other',
            'Single','Married']

# Dependent variable
y_var = 'Recid30'
###############################################

Next, the sklearn library makes it quite easy to fit a set of multiple models. Most of the time I start with XGBoost, random forests, and a normal logistic model with no coefficient penalty. I just stuff the base model object in a dictionary, pipe in the same training data, and fit the models. Then I can add in the predicted probabilities from each model into the test dataset. (These plots I show you should only show on the test dataset, of course the data will be calibrated on the training dataset.)

###############################################
# Training three different models, Logit,
# Random Forest, and XGBoost

final_models = {}
final_models['XGB'] = XGBClassifier(n_estimators=100, max_depth=5)
final_models['RF'] = RandomForestClassifier(n_estimators=1000, max_depth=10, min_samples_split=50)
final_models['Logit'] = LogisticRegression(penalty='none', solver='newton-cg')

# Iterating over each model and fitting on train
for nm, mod in final_models.items():
    mod.fit(recid_train[ind_vars], recid_train[y_var])

# Adding predicted probabilities back into test dataset
for nm, mod in final_models.items():
    # Predicted probs out of sample
    recid_test[nm] =  mod.predict_proba(recid_test[ind_vars])[:,1]
###############################################

This is fairly tiny data, so I don’t need to worry about how long this takes or run out of memory. I’d note you can do the same model, but different hyperparameters in this approach. Such as tinkering with the depth for tree based models is one I by default limit quite a bit.

AUC Plots

First, my goto metric to see the utility of a particular binary prediction model is the AUC stat. This has one interpretation in terms of the concordance stat, an AUC of 0.7 means if you randomly picked a 0 case and a 1 case, the 1 case would have a higher value 70% of the time. So AUC is all about how well your prediction discriminates between the two classes.

So with my binary_plots function, you can generate an ROC curve for the test data for a single column of predictions as so:

# A single column
binary_plots.auc_plot(recid_test, y_var, ['Logit'], save_plot='AUC1.png')

As I have generated predictions for multiple models, I have also generated a similar graph, but stuff the AUC stats in the matplotlib legend:

# Multiple columns to show different models
pred_prob_cols = list(final_models.keys()) #variable names
binary_plots.auc_plot(recid_test, y_var, pred_prob_cols, save_plot='AUC2.png')

It is also the case you want to do these plots for different subgroups of data. In recidivism research, we are often interested in sex and racial breakdowns. Here is the Logit model AUC broken down by Males (1) and Females (0).

# By subgroups in the data
binary_plots.auc_plot_long(recid_test, y_var, 'Logit', group='Male', save_plot='AUC3.png')

So this pulls the labels from the data, but you can pass in strings to get nicer labels. And finally, I show how to put both of these together, both by models and by subgroups in the data. Subgroups are different panels, and you can pass in a fontsize moniker to make the legends smaller for each subplot, and a size for each subplot (they are squares).

# Lets make nicer variable names for Male/Females and Racial Groups
recid_test['Sex'] = recid_test['Male'].replace({0: 'Female', 1:'Male'})
recid_test['Race'] = recid[recid_prep['Train'] == 0]['race']
recid_test['Race'] = recid_test['Race'].replace({'Hispanic': 'Other', 'Asian':'Other', 'Native American':'Other', 'African-American':'Black', 'Caucasian':'White'})

# Now can do AUC plot by gender and model type
binary_plots.auc_plot_wide_group(recid_test, y_var, pred_prob_cols, 'Sex', size=4, leg_size='x-small', save_plot='AUC4.png')

The plots have a wrap function (default wrap at 3 columns), so you can plot as many subgroups as you want. Here is an example combing the sex and race categories:

One limitation to note in these plots, ROC plots are normalized in a way that the thresholds for each subgroup may not be at the same area of the plot (e.g. a FPR of 0.1 for one subgroup implies a predicted probability of 30%, whereas for another subgroup it implies a predicted probability of 40%).

ROC/AUC is definitely not a perfect stat, most of the time we are only interested in the far left hand side of the ROC curve (how well we can identify high risk cases without a ton of false positives). That is why I think drawing the curves are important – one model may have a higher AUC, but it is in an area of the curve not relevant for how you will use the predictions in practice. (For tree based models with limited depth and limited variables, it can produce flat areas in the ROC curve for example.)

But I find the ROC curve/AUC metric the most useful default for both absolute comparisons (how well is this model doing overall), as well as relative model comparisons (is Model A better than Model B).

Most models I work with I can get an AUC of 0.7 without much work, and once I get an AUC of 0.9 I am in the clearly diminishing returns category to tinkering with my model (this is true for both criminology related models I work with, as well as healthcare related models in my new job).

This is of course data dependent, and even an AUC of 0.9 is not necessarily good enough to use in practice (you need to do a cost-benefit type analysis given how you will use the predictions to figure that out).

Calibration Charts

For those with a stat background, these calibration charts I show are a graphical equivalent of the Hosmer-Lemeshow test. I don’t bother conducting the Chi-square test, but visually I find them informative to not only see if an individual model is calibrated, but also to see the range of the predictions (my experience XGBoost will be more aggressive in the range of predicted probabilities, but is not always well calibrated).

So we have the same three types of set ups as with the ROC plots, a single predicted model:

# For single model
binary_plots.cal_data('XGB', y_var, recid_test, bins=60, plot=True, save_plot='Cal1.png')

For multiple models, I always do these on separate subplots, they would be too busy to superimpose. And because it is a single legend, I just build the data and use seaborn to do a nice small multiple. (All of these functions return the dataframe I use to build the final plot in long format.) The original plot was slightly noisy with 60 bins, so I reduce it to 30 bins here, but it is still slightly noisy (but each model is pretty well calibrated). XGBoost has a wider range of probabilities, random forests lowest bin is around 0.1 and max is below 0.8. Logit has lower probabilities but none above 0.8.

# For multiple models
binary_plots.cal_data_wide(pred_prob_cols, y_var, recid_test, bins=30, plot=True, save_plot='Cal2.png')

For a single model, but by subgroups in the data. The smaller other race group is more noisy, but again each model appears to be approximately calibrated.

# For subgroups and one model
binary_plots.cal_data_group('XGB', y_var, 'Race', recid_test, bins=20, plot=True, save_plot='Cal3.png')

And a combo of subgroup data and multiple models. Again the smaller subgroup Females appear more noisy, but all three models appear to be doing OK in this quick example.

# For subgroups and multiple models
binary_plots.cal_data_wide_group(pred_prob_cols, y_var, 'Sex', recid_test, bins=20, plot=True, save_plot='Cal4.png')

Sometimes people don’t bin the data (Peter Austin likes to do a smoothed plot), but I find the binned data easier to follow and identify deviations above/below predicted probabilities. In real life you often have some fallout/dropoff if there is feedback between the model and how other people respond to the model (e.g. the observed is always 5% below the predicted).

Down the rabbit hole with R functions

I had a friend the other day ask me about modifying the plot that goes with R’s boxCox function. In particular they had multiple plots, and wanted to make the Y axes consistent between the different dependent variables. So for a typical R base plot call, you can specify ylim = c(whatever_low, whatever_high), but if you look at function in the end it does not let you do this yourself (it fixes ylim based on the log-likelihood range.

library(car)
data(trees)
# Making a second Y variable for illustration later
trees$V2 <- trees$Volume*2 + 3*rnorm(nrow(trees))

# Original function, https://rdrr.io/rforge/car/man/boxCox.html
orig_output <- with(trees, boxCox(Volume ~ log(Height) + log(Girth), data = trees))

So if we look at the orig_output object, it gives us the x and y values for the above plot, but it does not give us the dashed line locations in the plot.

Typically here I would type out boxCox without the parenthesis at the prompt to get the function definition. That does not quite work here, as it is unhelpful and just gets us the message useMethod(boxCox). From here we can do the function method(boxCox) to help slightly more – we can see that the boxCox function really has 3 different functions, that depend on the original input.

Here we are specifying the formula interface to the function call, so lets look at getAnywhere(boxCox.formula):

Well, that is not very helpful, lets look at getAnywhere(boxCox.default) instead:

Ok, that is what we are going for. If you look into the function, at the very end you will see how it draws those dashed reference lines (anything drawn with lty = 2 in the code).

So what is happening here is that the different boxCox function calls are all daisy chained together, and it goes from formula -> lm object -> the original boxCox function. Now that we can see the function, we can make some small changes to have it return the locations of the vertical/horizontal reference lines that we want (or we could change it to accept a ylim argument directly). I name this new function boxCox.new.

# Modifying the function to return all the info you need
boxCox.new <- function(object, lambda = seq(-2, 2, 1/10), plotit = TRUE, interp = plotit, 
    eps = 1/50, xlab = NULL, ylab = NULL, family = "bcPower", 
    param = c("lambda", "gamma"), gamma = NULL, grid = TRUE, 
    ...) 
{
    if (class(object)[1] == "mlm") 
        stop("This function is for univariate response only")
    param <- match.arg(param)
    ylab <- if (is.null(ylab)) {
        if (family != "bcnPower") 
            "log-likelihood"
        else {
            if (param == "gamma") {
                expression(max(logL[gamma](lambda, gamma)))
            }
            else {
                expression(max[lambda](logL(lambda, gamma)))
            }
        }
    }
    else ylab
    xlab <- if (is.null(xlab)) {
        if (param == "lambda") 
            expression(lambda)
        else expression(gamma)
    }
    else xlab
    #fam <- matchFun(family) #Needed to change this to base function
    fam <- match.fun(family)
    if (is.null(object$y) || is.null(object$qr)) 
        stop(paste(deparse(substitute(object)), "does not have both 'qr' and 'y' components"))
    y <- object$y
    n <- length(y)
    xqr <- object$qr
    xl <- loglik <- if (family != "bcnPower") 
        as.vector(lambda)
    else {
        if (param == "lambda") 
            as.vector(lambda)
        else {
            if (!is.null(gamma)) 
                as.vector(gamma)
            else {
                p1 <- powerTransform(object, family = "bcnPower")
                gam <- p1$gamma
                se <- sd(y)
                seq(max(0.01, gam - 3 * se), gam + 3 * se, length = 100)
            }
        }
    }
    m <- length(xl)
    if (family != "bcnPower") {
        for (i in 1L:m) {
            yt <- fam(y, xl[i], j = TRUE)
            loglik[i] <- -n/2 * log(sum(qr.resid(xqr, yt)^2))
        }
    }
    else {
        lambda.1d <- function(gamma) {
            fn <- function(lam) bcnPowerllik(NULL, y, NULL, lambda = lam, 
                gamma = gamma, xqr = xqr)$llik
            f <- optimize(f = fn, interval = c(-3, 3), maximum = TRUE)
            f$objective
        }
        gamma.1d <- function(lambda) {
            fn <- function(gam) bcnPowerllik(NULL, y, NULL, lambda = lambda, 
                gamma = gam, xqr = xqr)$llik
            f <- optimize(f = fn, interval = c(0.01, max(y)), 
                maximum = TRUE)
            f$objective
        }
        for (i in 1L:m) {
            loglik[i] <- if (param == "lambda") 
                gamma.1d(loglik[i])
            else lambda.1d(loglik[i])
        }
    }
    if (interp) {
        sp <- spline(xl, loglik, n = 100)
        xl <- sp$x
        loglik <- sp$y
        m <- length(xl)
    }
    if (plotit) {
        mx <- (1L:m)[loglik == max(loglik)][1L]
        Lmax <- loglik[mx]
        lim <- Lmax - qchisq(19/20, 1)/2
        # Adding in vector to contain x functions location and top line
        xF <- c()
        xT <- c()
        plot(xl, loglik, xlab = xlab, ylab = ylab, type = "n", 
            ylim = range(loglik, lim))
        if (grid) {
            grid(lty = 1, equilogs = FALSE)
            box()
        }
        lines(xl, loglik)
        plims <- par("usr")
        abline(h = lim, lty = 2)
        y0 <- plims[3L]
        scal <- (1/10 * (plims[4L] - y0))/par("pin")[2L]
        scx <- (1/10 * (plims[2L] - plims[1L]))/par("pin")[1L]
        text(xl[1L] + scx, lim + scal, " 95%")
        la <- xl[mx]
        if (mx > 1 && mx < m) 
            segments(la, y0, la, Lmax, lty = 2)
            xF <- c(xF, la)
            xT <- c(xT, Lmax)
        ind <- range((1L:m)[loglik > lim])
        if (loglik[1L] < lim) {
            i <- ind[1L]
            x <- xl[i - 1] + ((lim - loglik[i - 1]) * (xl[i] - 
                xl[i - 1]))/(loglik[i] - loglik[i - 1])
            segments(x, y0, x, lim, lty = 2)
            xF <- c(xF, x)
            xT <- c(xT, lim)
        }
        if (loglik[m] < lim) {
            i <- ind[2L] + 1
            x <- xl[i - 1] + ((lim - loglik[i - 1]) * (xl[i] - 
                xl[i - 1]))/(loglik[i] - loglik[i - 1])
            segments(x, y0, x, lim, lty = 2)
            xF <- c(xF, x)
            xT <- c(xT, lim)
        }
    # See definitions of hline, vlines, vtop, ybase, just returning that info
    return(list(x = xl, y = loglik, hline = lim, vlines = xF, vtop = xT, ybase = y0))
    }
    list(x = xl, y = loglik)
}

But this won’t work offhand with just calling boxCox.new with our same prior function calls, so we need to just entirely replace the original boxCox.default function for our daisy chain of function references to work. Here can use the assignInNamespace function to effectively overwrite the original.

# Need to do this to get it to work with lm objects
assignInNamespace("boxCox.default",boxCox.new,ns="car")

r1 <- with(trees, boxCox(Volume ~ log(Height) + log(Girth), data = trees))
r2 <- with(trees, boxCox(V2 ~ log(Height) + log(Girth), data = trees))

And now if we inspect either r1 or r2 you can see it returns the info we want.

And now we build own our set of plots. I don’t have the nice text annotations (or the default grid lines), but leave that to the reader to do that extra work.

par(mfrow=c(2,1), mai = c(1, 1, 0.2, 1))
plot(r1$x,r1$y,ylim=c(-160,-70), type='l', xaxp = c(-160,-70, 8),
     xlab=expression(lambda),ylab='log-Likelihood')
# You need to specify the bottom of the segment to match your limit
abline(h = r1$hline, lty = 2)
segments(r1$vlines, -160, r1$vlines, r1$vtop, lty = 2)
plot(r2$x, r2$y,ylim=c(-160,-70), type='l', xaxp = c(-160,-70, 8),
     xlab=expression(lambda),ylab='log-Likelihood')
segments(r2$vlines, -160, r2$vlines, r2$vtop, lty = 2)
abline(h = r2$hline, lty = 2)

I have done this previously for default plots in base R that I wanted to make myself in ggplot, which you could do here as well and do a facetted plot instead of the par deal with multiple rows (ggplot takes care of the spacing a bit nicer). But that is too much work for this quick tip to cajole those different data frames to do the facets for ggplot.

Updated SPSS Chart Template (V26) and Chart Notes

So I was helping someone out the other day with SPSS chart templates, and figured it would be a good opportunity to update mine. I have a new template version for V26, V26_ChartStyle.sgt. I also have some code to illustrate the template, plus a few random SPSS charting tips along the way.

For notes about chart templates, I have written about it previously, and Ruben has a nice tutorial on his site as well. For those not familiar, SPSS chart templates specify the default looks of the chart, very similar to CSS for HTML. So for example, if you want your labels to be a particular font, or you want the background for the chart to be light grey, or you want the gridlines to be dashed, are all examples you can specify in a chart template.

It is plain text XML file under the hood – unfortunately there is not any official documentation about what is valid from IBM SPSS that I am aware of, so to amend it just takes trial and error. So in case anyone from SPSS pays attention here, if you have other docs to note let me know!

Below I will walk through my updated template and various charts to illustrate the components of it.

Walkthrough Template Example

So first to start out, if you want to specify a new template, you can either do it via the GUI (Edit -> Options -> Charts Tab), or via syntax such as SET CTEMPLATE='data\template.sgt'. Here I make some fake data to illustrate various chart types.

SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 20.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Test.
COMPUTE Group = TRUNC(RV.UNIFORM(1,7)).
COMPUTE Pair = RV.BERNOULLI(0.5).
COMPUTE X = RV.UNIFORM(0,1).
COMPUTE Y = RV.UNIFORM(0,1).
COMPUTE Time = MOD(Id-1,10) + 1.
COMPUTE TGroup = TRUNC((Id-1)/10).
FORMATS Id Group Pair TGroup Time (F2.0) X Y (F2.1).
VALUE LABELS Group 
  1 'A' 
  2 'B'
  3 'C'
  4 'D'
  5 'E'
  6 'F'
.
VALUE LABELS Pair
  0 'Group One'
  1 'Group Two'
.
VALUE LABELS TGroup
 0 'G1'
 1 'G2'
.
EXECUTE.

Now to start out, I show off my new color palette using a bar chart. I use a color palette derived from a Van Gogh’s bedroom as my default set. An idea I got from Sidonie Christophe (and I used this palette generator website). I also use a monospace font, Consolas, for all of the text (SPSS pulls from the system fonts, and I am a windows guy). The default font sizes are too small IMO, especially for presentations, so the X/Y axes tick labels are 14pt.

*Bar graph to show colors, title, legend.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Group MEAN(X)[name="MEAN"]
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Group=col(source(s), name("Group"), unit.category())
  DATA: X=col(source(s), name("MEAN"))
  GUIDE: axis(dim(1), label("Group"))
  GUIDE: axis(dim(2), label("Mean X"))
  GUIDE: legend(aesthetic(aesthetic.color.interior), label("Group"))
  GUIDE: text.title(label("Main Title"))
  GUIDE: text.subtitle(label("Subtitle"))
  SCALE: cat(dim(1), include("1", "2", "3", "4", "5", "6"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(Group*X), shape.interior(shape.square), color.interior(Group))
END GPL.

The legend I would actually prefer to have an outline box, but it doesn’t behave so nicely when I use a continuous aesthetic and pads too much. I will end the blog post with things I wish I could do with templates and/or GPL. (I wished for documentation to help me with the template for legends for example, but wish I could specify the location of the legend in inline GPL code.)

The next chart shows a scatterplot. One thing I make sure in the template is to not prevent you specifying what you want in inline GPL that is possible. So for example you could specify a default scatterplot of size 12 and the inside is grey, but as far as I can tell that prevents you from changing the color later on. Also I show a trick with using the subtitle to make a Y axis label at the top of the chart, a trick I saw originally from Naomi Robbins. (She actually prefers the text orientation on the side running north/south if you do that approach, but I prevented that in this template, I really dislike having to turn my head to read it!)

* Scatterplot, showing Y axis trick with subtitle.
* Would prefer the setStyle subtype="simple" type="scatter" works as default.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE
  /FITLINE TOTAL=NO.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  GUIDE: axis(dim(1), label("X Label"))
  GUIDE: axis(dim(2))
  GUIDE: text.subtitle(label("  Y Label"))
  ELEMENT: point(position(X*Y), size(size."12"), color.interior(color."bebebe"))
END GPL.

The next chart shows off my default data labels. I have the labels positioned at the center point, and also inherit the color from the data element. So one trick I like to use is to use the polygon element to explicitly set the locations of labels (and you can draw the polygons transparent, so you only see the labels in the end). So if you want to put labels at the top of bars (or above a line graph) like Excel does, you can just shift them up a smidge.

*Label trick for bar graphs.
GGRAPH 
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Group COUNT()[name="COUNT"]
  /GRAPHSPEC SOURCE=INLINE. 
BEGIN GPL 
  SOURCE: s=userSource(id("graphdataset")) 
  DATA: Group=col(source(s), name("Group"), unit.category()) 
  DATA: COUNT=col(source(s), name("COUNT")) 
  TRANS: ext = eval(COUNT + 0.2)
  GUIDE: axis(dim(1)) 
  GUIDE: axis(dim(2)) 
  GUIDE: text.subtitle(label("Count")) 
  SCALE: cat(dim(1), include("1", "2", "3", "4", "5", "6")) 
  SCALE: linear(dim(2), include(0)) 
  ELEMENT: interval(position(Group*COUNT), color.interior(color."bebebe")) 
  ELEMENT: polygon(position(Group*ext), label(COUNT), color.interior(color.red),
                    transparency.interior(transparency."1"), transparency.exterior(transparency."1")) 
END GPL.

Next up is a histogram. SPSS has an annoying habit of publishing a statistics summary frame with mean/standard deviation when you make a histogram. Unfortunately the only solution I have found to this is to still generate the frame, but it is invisible so doesn’t show anything. It still takes up space in the chart area though, so the histogram will be shrunk to be somewhat smaller in width. Also I was unable to find a solution to not print the Frequency numbers with decimals here. (You can use setTickLabelFormat to say no decimals, but then that applies to all charts by default. SPSS typically chooses the numeric format from the data, but here since it calculates it inline GPL, there is no way to specify it in advance that I know of.)

* Histogram, no summary frame.
* Still not happy with Frequency numbers, stat summary is there but is empty.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Y
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Y=col(source(s), name("Y"))
  GUIDE: axis(dim(1), label("Y"))
  GUIDE: axis(dim(2), label("Frequency"))
  GUIDE: text.title(label("Simple Histogram of Y"))
  ELEMENT: interval(position(summary.count(bin.rect(Y, binWidth(0.1), binStart(-0.05) ))),
                    color.interior(color."bebebe"), color.exterior(color.white), 
                    transparency.interior(transparency."0.35"), transparency.exterior(transparency."0.35"))
END GPL.

The next few charts I will illustrate how SPSS pulls the axes label formats from the data itself. So first I create a few new variables for percentages, dollars, and long values with comma groupings. Additionally a new part of this template is I was able to figure out how to set the text style for small multiple panels (they are ticked like tick marks for X/Y axes). So this illustrates my favorite way to mark panels.

* Small multiple columns, illustrating different variable formats.
COMPUTE XPct = X*100.
COMPUTE YDollar = Y * 5000.
COMPUTE YComma = Y * 50000.
*Can also do PCT3.1, DOLLAR5.2, COMMA 4.1, etc.
FORMATS XPct (PCT3) YDollar (DOLLAR4) YComma (COMMA7).
EXECUTE. 

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=XPct YDollar Pair
  /GRAPHSPEC SOURCE=INLINE
  /FITLINE TOTAL=NO.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("XPct"))
  DATA: Y=col(source(s), name("YDollar"))
  DATA: Pair=col(source(s), name("Pair"), unit.category())
  GUIDE: axis(dim(1), label("X Label"))
  GUIDE: axis(dim(2), label("Y Label"))
  GUIDE: axis(dim(3), opposite())
  ELEMENT: point(position(X*Y*Pair), size(size."12"), color.interior(color."bebebe"))
END GPL.

That is to make panels in columns, but you can also do it in rows. And again I prevent all the text I can from turning vertical up/down, and force it to write the text horizontally.

* Small multiple rows.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X YComma Pair
  /GRAPHSPEC SOURCE=INLINE
  /FITLINE TOTAL=NO.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("YComma"))
  DATA: Pair=col(source(s), name("Pair"), unit.category())
  GUIDE: axis(dim(1), label("X Label"))
  GUIDE: axis(dim(2), label("Y Label"))
  GUIDE: axis(dim(4), opposite())
  ELEMENT: point(position(X*Y*1*Pair))
END GPL.

And finally if you do wrapped panels and set the facet label to opposite, it puts the grid label on the top of the panel.

* Small multiple wrap.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y Group
  /GRAPHSPEC SOURCE=INLINE
  /FITLINE TOTAL=NO.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: Group=col(source(s), name("Group"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2))
  GUIDE: axis(dim(3), opposite())
  ELEMENT: point(position(X*Y*Group))
END GPL.

Next up I show how my default colors do with line charts, I typically tell people to avoid yellow for lines, but this tan color does alright I think. (And Van Gogh was probably color blind, so it works out well for that.) I have not tested out printing in grey-scale, I don’t think it will work out well for that beyond just the first two colors.

* Multiple line chart (long format).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Time Y TGroup
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Time=col(source(s), name("Time"))
  DATA: Y=col(source(s), name("Y"))
  DATA: TGroup=col(source(s), name("TGroup"), unit.category())
  SCALE: linear(dim(1), min(1))
  GUIDE: axis(dim(1), delta(1))
  GUIDE: axis(dim(2))
  GUIDE: legend(aesthetic(aesthetic.color.interior))
  GUIDE: legend(aesthetic(aesthetic.size), label("Size Y"))
  GUIDE: text.subtitle(label("   Y"))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("0", "1"))
  ELEMENT: line(position(Time*Y), color.interior(TGroup))
  ELEMENT: point(position(Time*Y), shape(TGroup), color.interior(TGroup),
                    color.exterior(color.white), size(size."10")) )
END GPL.

A nice approach that forgoes the legend though with line charts is to label the ends of the lines, and I show that below. Also the data above is in long format, and when superimposing points does not quite work out perfectly (G2 should always be above G1, but sometimes the G1 point is above the G2 line). Drawing each individual line in wide format though you can prevent that from happening (but results in more work to write the GPL, need several ELEMENT statements for a single line). I also show how to use splines here, which can sometimes help disentangle the spaghetti lines, but user beware, the interpolated spline values can be misleading (one of the reasons I like superimposing the point markers as well).

* Multiple line chart with end labels (wide format).
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK
  /LastTime = MAX(Id).
IF Id = LastTime IdLabel = Id.
FORMATS IdLabel (F2.0).
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Id X Y IdLabel
   MISSING=VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Id=col(source(s), name("Id"))
  DATA: IdLabel=col(source(s), name("IdLabel")) 
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: TGroup=col(source(s), name("TGroup"), unit.category())
  SCALE: linear(dim(1), min(1))
  SCALE: linear(dim(2), max(1.08))
  GUIDE: axis(dim(1), delta(1))
  GUIDE: axis(dim(2))
  SCALE: cat(aesthetic(aesthetic.color.interior), include("0", "1"))
  ELEMENT: line(position(smooth.spline(Id*Y)), color(color.red))
  ELEMENT: line(position(smooth.spline(Id*X)), color(color.blue))
  ELEMENT: point(position(Id*Y), color.interior(color.red), color.exterior(color.white),
                    size(size."9"), shape(shape.circle))
  ELEMENT: point(position(Id*X), color.interior(color.blue), color.exterior(color.white),
                    size(size."8"), shape(shape.square))
  ELEMENT: point(position(IdLabel*Y), color(color.red), label("Y"), 
                    transparency.interior(transparency."1"), transparency.exterior(transparency."1"))
  ELEMENT: point(position(IdLabel*X), color(color.blue), label("X"), 
                    transparency.interior(transparency."1"), transparency.exterior(transparency."1"))
END GPL.

A final example I illustrate is an error bar chart, but also include a few different notes about line breaks. You can use a special \n character in labels to break lines where you want. Also had a request for labeling the ends of the chart as well. Here I fudge that look by adding in a bunch of white space. This takes trial-and-error to figure out the right number of spaces to include, and can change if the Y axes labels change length, but is the least worst way I can think to do such a task. For error bars and bar graphs, it is also often easier to generate them going vertical, and just use COORD: transpose() to make them horizontal if you want.

* Error Bar chart.
VALUE LABELS Pair
  0 'Group\nOne'
  1 'Group\nTwo'
.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Pair MEANCI(Y, 95)[name="MEAN_Y" LOW="MEAN_Y_LOW" 
    HIGH="MEAN_Y_HIGH"] MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Pair=col(source(s), name("Pair"), unit.category())
  DATA: MEAN_Y=col(source(s), name("MEAN_Y"))
  DATA: LOW=col(source(s), name("MEAN_Y_LOW"))
  DATA: HIGH=col(source(s), name("MEAN_Y_HIGH"))
  COORD: transpose()
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Low Anchor                                                    High Anchor", "\nLine 2"))
  GUIDE: text.title(label("Simple Error Bar Mean of Y by Pair\n[Transposed]"))
  GUIDE: text.footnote(label("Error Bars: 95% CI"))
  SCALE: cat(dim(1), include("0", "1"), reverse() )
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(region.spread.range(Pair*(LOW+HIGH))), shape.interior(shape.ibeam))
  ELEMENT: point(position(Pair*MEAN_Y), size(size."12"))
END GPL.

Going to back to some scatterplots, I will illustrate a continuous legend example using size of points in a scatterplot. For size elements it typically makes sense to use a square root scale instead of a linear one (SPSS default power scale is to x^0.5, so a square root). If you go to edit the chart and click on the legend, you will see here what I mean by the excessive padding of white space at the bottom. (Also wish you could control the breaks that are shown in inline GPL, breaks for non-linear scales though are no doubt tricky.)

* Continuous legend example.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE
  /FITLINE TOTAL=NO.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  GUIDE: axis(dim(1), label("X Label"))
  GUIDE: axis(dim(2))
  GUIDE: text.subtitle(label("  Y Label"))
  GUIDE: legend(aesthetic(aesthetic.size), label("SizeLab"))
  SCALE: linear(dim(2), max(1.05))
  SCALE: pow(aesthetic(aesthetic.size), aestheticMinimum(size."6px"), 
               aestheticMaximum(size."30px"), min(0), max(1))
  ELEMENT: point(position(X*Y), size(Y), color.interior(color."bebebe"))
END GPL.

And you can also do multiple legends as well. SPSS does a good job blending them together in this example, but to label the group you need to figure out which hierarchy wins first I guess.

* Multiple legend example.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y Pair
  /GRAPHSPEC SOURCE=INLINE
  /FITLINE TOTAL=NO.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: Pair=col(source(s), name("Pair"), unit.category())
  GUIDE: axis(dim(1), label("X Label"))
  GUIDE: axis(dim(2))
  GUIDE: text.subtitle(label("  Y Label"))
  GUIDE: legend(aesthetic(aesthetic.color.interior), label("Color/Shape Lab"))
  GUIDE: legend(aesthetic(aesthetic.size), label("Size & Trans. Lab"))
  SCALE: linear(dim(2), max(1.05))
  SCALE: linear(aesthetic(aesthetic.size), aestheticMinimum(size."6px"), 
               aestheticMaximum(size."30px"), min(0), max(1))
  SCALE: linear(aesthetic(aesthetic.transparency.interior), aestheticMinimum(transparency."0"), 
               aestheticMaximum(transparency."0.6"), min(0), max(1))
  ELEMENT: point(position(X*Y), transparency.interior(Y), 
                            shape(Pair), size(Y), color.interior(Pair))
END GPL.

But that is not too shabby for just out of the box and no post-hoc editing.

Wishes for GPL and Templates

So I have put in my comments on things I wish I could do via chart templates. For a recap some documentation on what is possible, turning off the summary statistics for histograms entirely (not just making it invisible), padding for continuous legends, and making the setStyle defaults for the various charts styleOnly are a few examples. And then there are some parts I wish you could do in inline GPL, such as setting the location of the legend. Also I would not mind having access to any of these elements in inline GPL as well, such as say setting the number format in a GUIDE statement I think would make sense. (I have no idea how hard/easy these things though are under the hood.) But no documentation for templates is a real hassle when trying to edit your own.

Do y’all have any other suggestions for a chart template? Other default charts I should check out to see how my template does? Other suggested color scales? Let me know your thoughts!

Crime analysis dashboards in Tableau

So previously I have rewritten a few of my Crime Analysis tutorials (in Excel) to show how to use Tableau.

It takes too much work to do a nice tutorial like that with no clear end user, so I will just post some further examples I have been constructing to self-teach myself Tableau. To see my current workbook, you can download the files here.

The real benefit of Tableau over static charts in Excel (or whatever statistical program), is you can do interactive filtering and brushing/linking. So here is an example GIF showing how you can superimpose the weekly & seasonal chart I showed earlier, along with additional charts. Here instead of a dropdown to filter by different crime types, I show how you can use a Treemap as a filter. You can also select either one element or multiple elements, so first I show selecting different types of larceny (orange), then I show selecting all of the Part 2 nuisance crimes.

The Treemap idea is courtesy of Jerry Ratcliffe and Grant Drawve, and one of my co-workers used it like this in a Tableau dashboard to give me this idea. Here the different colors represent Part 2 disorder crimes (Blue), Property Crimes (orange), and Violent Crimes (Red). While you cannot see labels for each one, it does has tooltips, so in the end you can see what individual cells contain when you also consider the interactivity component.

You can mash-up additional tables, graphs, and maps as well. Here is another example using Compstat like tables for crime totals by year, a table of counts of crime per street (would prefer to do individual addresses, but the Burlington CAD data I used to illustrate does not have individual addresses) filtered to the top 30, and a point map. You can select any one graphic and it subsets the others.

While Tableau has maps I am not real bemused by them offhand. Points maps are no big deal, but with many points they become inscrutable. You can do a kernel density map, but it is very difficult to make it look reasonable depending on the filtering/zoom. If Tableau implements something like Leaflets cluster marker for point maps I think that would be a bit more friendly.

Dashboards no doubt are a trade-off with space. You can only reasonably put so much in a limited space. But brushing/linking between graphics is a really big different between Tableau and other traditional static graphics. It may not always be necessary, but it can sometimes be useful.

Next up I have a few ideas to make a predictive model monitoring dashboard in Tableau.