Conformal Sets Part 2, Estimating Precision

After publishing my post yesterday, in which I said you could not estimate the false positive rate using conformal sets, I realized there was a way given the nature of if you know some parts of the contingency table you can estimate the other parts. In short if you know the proportion of the outcome in your sample, which you can use the same calibration sample to estimate (and should be reasonable given the same exchangeability assumption for conformal sets to work to begin with) you can estimate the false positive rate (or the precision of your estimate given a particular threshold).

Here we know, given a particular threshold, the percentage estimates for the cells below:

           True
          0     1 
       -----------
Pred 0 | tn  | fn |
       ------------
     1 | fp  | tp |
       ------------
         X0    X1

I added two details to the table, X0 and X1, which I take to be the column counts for the cells. So imagine we have a result where you have 90% coverage for the positive class, and 60% coverage for the negative class. Also assume that the proportion of 1’s is 20%. So we have this percentage table:

           True
          0     1 
       -----------
Pred 0 | 60%  | 10% |
       ------------
     1 | 40%  | 90% |
       ------------

Now pretend to translate to counts, where we have 80 0’s and 20 1’s (my 20% specified above):

           True
          0     1 
       -----------
Pred 0 | 48  |  2 |
       ------------
     1 | 32  | 18 |
       ------------
         80    20

So for our false positive estimate, we have 32/(32 + 18) = 0.64. Pretend our fp,tp etc. are our original percent metrics, we could then write our table as:

            True
          0         1 
       -----------
Pred 0 | tn*X0  | fn*X1 |
       ------------
     1 | fp*X0  | tp*X1 |
       ------------
         X0        X1

And so our false positive equation is then:

    fp*X0
--------------
(fp*X0 + tp*X1)

We can make this more dimensionless, by setting X0 = 1, and then writing X0 = m*X1 = 1. Then we can update the above equation to simply be:

    fp*X0                    fp
--------------       =>  ----------
  fp*X0 + tp*X0*m         fp + tp*m

The factor m is prop/(1 - prop), where prop is the proportion of 1s, here 0.2, so m is 0.2/0.8 = 0.25. So if we do 0.4/(0.4 + 0.9*0.25) = 0.64. And that is our false positive estimate for that threshold across the sample, and our precision estimate is the complement, so 36%.

I have updated my code on github, but here is an example class to help calculate all of these metrics. It is short enough to put right on the blog:

import numpy as np
from scipy.stats import ecdf

class ConfSet:
    def __init__(self,y_true,y_score):
        self.p1 = y_score[y_true == 1]
        self.p0 = y_score[y_true == 0]
        self.prop = y_true.mean()
        self.m = self.prop/(1 - self.prop)
        self.ecdf1 = ecdf(self.p1)
        self.ecdf0 = ecdf(self.p0)
    def Cover1(self,k):
        res = np.percentile(self.p1,100-k)
        return res
    def Cover0(self,k):
        res = np.percentile(self.p0,k)
        return res
    def PCover(self,p):
        cov1 = self.ecdf1.sf.evaluate(p)
        cov0 = self.ecdf0.cdf.evaluate(p)
        # false positive estimate
        fp = 1 - cov0
        fprate = fp/(fp + cov1*self.m)
        return cov0, cov1, fprate

So based on your calibration sample, you pass in the predicted probabilities and true outcomes. After that you can either calculate recall (or zero class recall) using the Cover methods. Or given a particular threshold you can calculate the different metrics. Here is an example using this method (with the same data from yesterday) to replicate the metrics in that post (for coverage for 0, I am always working with the predicted probability for the positive class, not its complement):

And this provides a much better out of sample estimator of false positive rates than using the PR curve directly.

The method makes it easy to do a grid search over different thresholds and calculating metrics:

And here is an ugly graph to show that the out of sample matches up very closely to the estimates:

I debated on adding a method to solve for the false positive rate, but I think just doing a grid search may be the better approach. I am partially concerned in small samples the false positive estimate will not be monotonic (a similar problem happens in AUC calculations with censored data). But here in this sample at this grid level it is monotonic – and most realistic cases I have personally worrying about the third digit in predicted probabilities is just noise at that point.

Conformal Sets and Setting Recall

I had a friend the other day interested in a hypothesis along the lines of “I think the mix of crime at a location is different”, in particular they think it will be pushed to more lower level property (and fewer violent) based on some local characteristics. I had a few ideas on this – Brantingham (2016) and Lentz (2018) have examples of creating a permutation type test. And I think I could build a regression multinomial type model (similar to Wheeler et al. 2018) to generate a surface of crime category prediction types over a geographic area (e.g. area A has a mix of 50% property and 50% violent, and area B has a mix of 10% violent and 90% property).

Another approach though is pure machine learning and using conformal sets. I have always been confused about them (see my comment on Gelman’s blog) – reading some more about conformal sets though my comments on Andrew Gelman’s post are mostly confused but partly right. In short you can set recall on a particular class using conformal sets, but you cannot set precision (or equivalently set the false positive rate). So here are my notes on that.

For a CJ application of conformal sets, check out Kuchibhotla & Berk (2023). The idea is that you are predicting categorical classes, in the Berk paper it is recidivism classification with three categories {violent,non-violent,no recidivism}. Say we had a prediction for an individual for the three categories as {0.1,0.5,0.4} – you may say that this person has the highest predicted category of non-violent. Conformal sets are different, in that they can return multiple categories based on a decision threshold, e.g. predict {non-violent,no-recidivism} in this example.

My comment on Gelman’s blog is confused, in that I always thought “why not just take the probabilities to get the conformal set”, so if I wanted a conformal set of 90%, the non-violent and no recidivism probabilities add up to 90%, so wouldn’t they count? But that is not what conformal sets give you, conformal sets only make sense in the repeated frequentist sense I do this thing over and over again, what happens. So with conformal sets so you get Prob(Predict > Threshold | True ) = 0.95, or whatever conformal proportion you want. (I like to call this “coverage”, so out of those True outcomes, what threshold will cover 95% of them.)

This blog post with the code examples really helped my understanding, and how conformal sets can be applied to individual categories (which I think makes more sense than the return multiple labels scenario).

I have code to replicate on github, using data from the NIJ recidivism competition (Circo & Wheeler, 2022) as an example. See some of my prior posts for the feature engineering example, but I use the out of bag trick for random forests in lieu of having a separate calibration sample.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# NIJ Recidivism data with some feature engineering
pdata = pd.read_csv('NIJRecid.csv') # NIJ recidivism data

# Train/test split and fit model
train = pdata[pdata['Training_Sample'] == 1]
test = pdata[pdata['Training_Sample'] == 0]

yvar = 'Recidivism_Arrest_Year1'
xvar = list(pdata)[2:]

# Random forest, need to set OOB to true
# for conformal (otherwise need to use a seperate calibration sample)
rf = RandomForestClassifier(max_depth=5,min_samples_leaf=100,random_state=10,n_estimators=1000,oob_score=True)
rf.fit(train[xvar],train[yvar])

# Out of bag predictions
probs = rf.oob_decision_function_

Now I have intentionally made this as simple as possible (the towards data science post has a small sample quantile correction, plus has a habit of going back and forth between P(y=1) and 1 - P(y=1)). But to get a conformal threshold in this scenario to set the recall at 95% is quite simple:

# conditional predictions for actual 1's
p1 = probs[train[yvar]==1,1]

# recall 95% coverage
k = 95
cover95 = np.percentile(p1,100-k)
print(f'Threshold to have conformal set of {k}% for capturing recidivism')
print(f'{cover95:,.3f}')

# Now can check out of sample
ptest = rf.predict_proba(test[xvar])
out_cover = (ptest[test[yvar]==1,1] > cover95).mean()
print(f'\nOut of sample coverage at {k}%')
print(f'{out_cover:,.3f}')

And this results in for this data:

So this is again recall, or I like to call it the capture rate of the true positives. It is true_positives / (true_positives + false_negatives). This threshold value is estimated purely based on the calibration sample (or here the OOB estimates). The model I will show is not very good, but the conformal sets you still get good performance. So this is quite helpful, having a good estimator (based on exchangeability, so no drift over time). I think in practice though that will not be bad (I by default auto-retrain models I put into production on a regular schedule, e.g. retrain once a month), so I don’t bother monitoring drift.

You can technically do this for each class, so you can have a recall set for the true negatives as well:

# can also set the false negative rate in much the same way
p0 = probs[train[yvar]==0,0]

# false negative rate set to 5%
k = 95
cover95 = np.percentile(p0,100-k)
print(f'Threshold (for 0 class) to have conformal set of {k}% for low risk')
print(f'{cover95:,.3f}')

# Now can check out of sample
out_cover = (ptest[test[yvar]==0,0] > cover95).mean()
print(f'\nOut of sample coverage at {k}%')
print(f'{out_cover:,.3f}')

Note that the threshold here is for P(y=0),

Threshold (for 0 class) to have conformal set of 95% for low risk
0.566

Out of sample coverage at 95%
0.953

Going back to the return multiple labels idea, in this example for predictions (for the positive class) we would have this breakdown (where 1 – 0.566 = 0.434):

P < 0.19 = {no recidivism}
0.19 < P < 0.434 = {no recidivism,recidivism}
0.434 < P = {recidivism}

Which I don’t think is helpful offhand, but it would not be crazy for someone to want to set the recall (for either class on its own) as a requirement in practice. So say we have a model that predicts some very high risk event (say whether to open an investigation into potential domestic terrorist activity). We may want the recall for the positive class to be very high, so even if it is a lot of nothing burgers, we have some FBI agent at least give some investigation into the predicted individuals.

For the opposite scenario, say we are doing release on recognizance for pre-trial in lieu of bail. We want to say, of those who would not go onto recidivate, we only want to “falsely hold pretrial” 5%, so this is a 95% conformal set of True Negative/(True Negative + False Positive) = 0.95. This is what you get in the second example above for no-recidivism.

Note that this is not the false positive rate, which is False Positive/(True Positive + False Positive), which as far as I can tell you cannot determine via conformal sets. If I draw my contingency table (use fp as false positive, tn as true negative, etc.) Conformal sets condition on the columns, whereas the false positive rate conditions on the second row.

          True
          0     1 
       -----------
Pred 0 | tn  | fn |
       ------------
     1 | fp  | tp |
       ------------

So what if you want to set the false positive rate? In batch I know how to set the false positive rate, but this random forest model happens to not be very well calibrated:

# This models calibration is not very good, it is overfit
dfp = pd.DataFrame(probs,columns=['Pred0','Pred1'],index=train.index)
dfp['y'] = train[yvar]
dfp['bins'] = pd.qcut(dfp['Pred1'],10)
dfp.groupby('bins')[['y','Pred1']].sum()

So I go for a more reliable logistic model, which does result in more calibrated predictions in this example:

# So lets do a logit model to try to set the false positive rate
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve

# Making a second calibration set
train1, cal1 = train_test_split(train,train_size=10000)
logitm = LogisticRegression(random_state=10,penalty=None,max_iter=100000)
logitm.fit(train1[xvar],train1[yvar])
probsl = logitm.predict_proba(cal1[xvar])

# Can see here that the calibration is much better
dflp = pd.DataFrame(probsl,columns=['Pred0','Pred1'],index=cal1.index)
dflp['y'] = cal1[yvar]
dflp['bins'] = pd.qcut(dflp['Pred1'],10)
dflp.groupby('bins')[['y','Pred1']].sum()

Now the batch way to set the false positive rate, given you have a well calibrated model is as follows. Sort your batch according the predicted probability of the positive class in descending value. Pretend we have a simple set of four cases:

Prob
 0.9
 0.8
 0.5
 0.1

Now if we set the threshold to be 0.6, we would then have {0.9,0.8} as our two predictions, we then estimate that the false positive rate would be (0.1 + 0.2)/2 = 0.3/2, If we set the threshold to be 0.4, we would have a false positive rate estimate of (0.1 + 0.2 + 0.5)/3 = 0.8/3. So this relies on a batch of characteristics that we are predicting, and is not determined beforehand (this is the idea I use in this post on prioritizing audits):

# The batch way to set the false positive rate
ptestl = logitm.predict_proba(test[xvar])
dftp = pd.DataFrame(ptestl,columns=['Pred0','Pred1'],index=test.index)
dftp['y'] = test[yvar]

dftp.sort_values(by='Pred1',ascending=False,inplace=True)
dftp['PredictedFP'] = (1 - dftp['Pred1']).cumsum()
dftp['AcutalFP'] = (dftp['y'] == 0).cumsum()
dftp['CumN'] = np.arange(dftp.shape[0]) + 1
dftp['PredRate'] = dftp['PredictedFP']/dftp['CumN']
dftp['ActualRate'] = dftp['AcutalFP']/dftp['CumN']
dftp.iloc[range(1000,7001,1000)]

What happens if we try to estimate where to set the threshold in the training/calibration data though? NOTE: I have a new blog post showing how to construct a more appropriate estimate of the false positive rate ENDNOTE. In practice, we often need to make decisions one at a time, in the parole case it is not like we hold all parolees in the queue for a month to save and batch process them. So lets use our precision in the calibration sample to get a threshold:

# Using precision to set the threshold (based on calibration set)
fp_set = 0.45
pr_data = precision_recall_curve(cal1[yvar], probsl[:,1])
loc = np.arange(pr_data[0].shape[0])[pr_data[0] > fp_set].min()
thresh_fp = pr_data[2][loc]

print(f'Threshold estimate for FP rate at {fp_set}')
print(f'{thresh_fp:,.3f}')

print(f'\nActual FP rate in test set at threshold {thresh_fp:,.3f}')
test_fprate = 1 - test[yvar][ptest[:,1] > thresh_fp].mean()
print(f'{test_fprate:,.3f}') # this is not a very good estimate!

Which gives us very poor out of sample estimates – we set the false positive rate to 45%, but ends up being 55%:

Threshold estimate for FP rate at 0.45
0.333

Actual FP rate in test set at threshold 0.333
0.549

So I am not sure what the takeaway from that is, whether we need to be doing something else to estimate the false positive rate (like an online learning approach that Chohlas-Wood et al. (2021) discuss). A takeaway though from the NIJ competition I have is that false positives tend to be a noisy measure (and FP for fairness between groups just exacerbates the problem), so maybe we just shouldn’t be worried about false positives at all. In many CJ scenarios, we do not get any on-policy feedback on false positives – think the bail case where you have ROR vs held pre-trial, you don’t observe false positives in that scenario in practice.

Conformal sets though, if you want recall for particular classes are the way to go though. You can also do them for subsets of data, e.g. different conformal thresholds for male/female, minority/white. So have an easy way to accomplish a fairness ideal with post processing. And I may do a machine learning approach to help that friend out with the crime mix in places idea as well (Wheeler & Steenbeek, 2021).

References

Notes on MMc queues

Recently had a project related to queues at work, so wanted to put some of my notes in a blog post. For a bit of up-front, the notation MMc refers to a queuing system with multiple servers (c), and the inputs are Poisson distributed (the first M), and have exponential service rates M (these Ms can be different though). That is a mouthful, but basically saying events that arrive in independently and have a left skewed distribution of times it takes to resolve those events. (That may seem like a lot of assumptions, they are often reasonable though for many systems, and if not deviations may not be that big of deal to the estimates in practice.)

Main reason for blog post is that the vast majority of stuff online is about MM1 queue systems, so systems that only have 1 server. I basically never deal with this situation. The formulas for multiple servers are much more complicated, so took me a bit to gather code examples and verify correctness. These are notes based on that work.

So for up-front, the group I was dealing with at work had a fundamental problem, their throughput was waaay too small. In this notation, we have:

  • Number of arrivals per time period, N
  • Mean time it takes to exit the queue, S
  • Number of servers, c

So first, you need to have N*S < c! This is simple accounting, so say we are talking about police calls for service, you have on average 5 calls per hour, and they take on average 0.5 hours (30 minutes) to handle. You then need more than 5*0.5 = 2.5 officers to handle this, so a minimum of 3 officers. If you don’t have 3 officers, the queue will grow, you won’t be able to handle all of the calls.

At work I was advising a situation where they were chronically too low of staff serving for a particular project, and it has ballooned over an extended period of time to create an unacceptable backlog. So think S is really tiny and N is very large – at first the too small of servers could cycle through the tickets, but the backlog just slowly grew, and then after months, they had unacceptable wait times. This is a total mess – there is no accounting trick to solve this, you need c > N*S. It makes no sense to talk about anything else like average wait time in the queue unless that condition is met.

OK, so we know you need c > N*S, a common rule of thumb is that capacity should not be over 80%, so that is c > (N*S)/0.8. (This is not for policing, but more common for call centers, see also posts on Erlang-C formulas.) The idea behind 80% it is at the point where wait times (being held in the queue) start to grow.

If you want to get more into the nitty gritty though, such as calculating the actual probability in the queue, average wait time, etc., then you will want to dig into the MMc queue lit. Here I have posted some python notes (that is itself derivative work others have posted). Hoping just posting and giving my thumbs up makes it easier for others.

So first here is an example of using those functions to estimate our queue example above. Note you need to give the inverse of the mean service time for this function.

# queuing functions in python
from queue import MMc, nc

N = 6    # 6 calls per hour
S = 0.5  # calls take 30 minutes to resolve
c = 7    # officers taking calls

# This function expects inverse service average
qS = MMc(N,1/S,c)

# Now can get stats of interest

# This is the probability that when a call comes
# in, it needs to wait for an officer
qS.getQueueProb()

And this prints out 0.0376.... So when a call comes in, we have a 3% probability of having to wait in the queue for an officer to respond. How about how long on average a call will wait in the queue?

# This is how long a call on average needs
# to wait in the queue in minutes
qS.getAvgQueueTime()*60

And this gives 0.28.... The multiplication by 60 goes from hours to minutes, and we are waiting less than 1 minute on average. This seems good, but somewhat counter-intuitively, this is an average of a bunch of calls answered immediately, plus the 3.8% of calls that are held for some time. We can calculate the estimate of if a call is held, how long will it be held on average:

# If a call is queued however, how long to wait?
qS.getAvgQueueTime_Given()*60

And this is a less rosy 17.5 minutes! Queues are tricky. Unless you have a lot of extra capacity, there are going to be wait times. We can also calculate how often all officers will be idle in this setup.

# Idle time, no one taking any calls
qS.getIdleProb()

And this gives rounded 0.05, so we have only 5% idle time in this set up. This is not that helpful though for police planning, you want individual officers to have capacity to do proactive work, that is more you want officers to only spend 40-60% on responding to calls for service, so that suggests c > (N*S)/0.5 is where you want to be. Which is where we are at in this scenario with 7 officers. This is the probability all 7 officers will be idle at once, which does not really matter.

Now you can technically just run this through multiple values of c to get this, but Rosetti (2021) has listed an approximate square root staffing formula that given an input probability wait in the queue, how many servers do you need. So here is that function:

# If you want probability of holding in the queue to only be 3%
est_serv = nc(N,S,0.03)
print(est_serv)

Which prints out 6.387..., so since you need to take the ceiling of this, you will need to 7 officers to keep to that probability (agreeing with the MMc object above).

In terms of values, the nc function will work with very large/small N and S inputs just fine. The MMc function also looks fine, minus one submethod uses a factorial, .getPk (so cannot have very large inputs to that method), but the rest is OK. So if you wanted to do nc(very_big,very_small,0.1) that is fine and should be no floating point issues.

The nc function relies on scipy, but the MMc class is all base python (just the math library). So the MMc functions can really be embedded in any particular python application you want with no real problem.

Rough Estimates for Spatial Police Planning

I have prior work on spatial allocation of patrol units with workload equality constraints (Wheeler, 2018). But generally, you need to first estimate how many units you will have, and after that you can worry about optimally distributing them. The reason for this is that the number of units is much more important, too few and you will have more queuing, in which case the spatial arrangement does not matter at all. Larson & Stevenson (1972) estimate optimal spatial allocation only beats random allocation by 25%.

So for police response times you can think about time waiting in queue, time spent driving to the event, and time spent resolving the event (time to dispatch tends to be quite trivial, but is sometimes included in the wait in the queue part, Verlaan & Ruiter, 2023).

There is somewhat of a relationship between the above “service” time, fewer people have to drive farther, and so service time goes up. But there happens to some simple rules of thumb, if you have N patrol units, you can calculate (2/3)*sqrt(Square Miles)/sqrt(N) = average distance traveled in miles for your jurisdiction (Stenzel, 1993, see page 135 in the PDF). Then you can translate that miles driven to time, by say taking an average of 45 miles per hour. Given a fixed N, you can then just add this into the service time estimate for your given jurisdiction to get a rough estimate of more officers will reduce response times by X amount.

It ends up being though this tends to be trivial relation to the waiting in the queue time (or the typical it takes 30 minutes to resolve police incidents on average). So it is often more important to get rough estimates for that if you want to reduce wait times for calls for service. And this does not even take into account priority levels in calls, but to start simpler folks should figure out a minimum to handle the call stack (whether in policing or in other areas) and then go onto more complicated scenarios.

References

Power for analyzing Likert items

First for some other updates of interest to folks on the blog. On CRIME De-Coder a blog post, You should be geocoding crime data locally. I give python code to create a local geocoding engine using the arcpy library.

This is a bit more techy, so would typically post this on this blog instead of the CRIME De-Coder one. But, currently the web is sorely lacking in good advice for local geocoding solutions. Vast majority of sites discuss online geocoding APIs (e.g. google or the census geocoder), which I guess are common for web-apps, but they do not make sense for crime analysis. For the few webpages that are actually relevant to describe local solutions (that do not involve calling an online web API), all the exmples use PostGIS that I am aware of. PostGIS is both very difficult to setup and has worse results compared to ESRI. So I know ESRI is a paid for service, but they have reasonable academic and small business pricing (and most police departments already have access), so to me this is a reasonable use case. If you need to geocode 100k cases, the license fee for ESRI is worth it at that point relative to using the web engines.

Definitely do not spend thousands of dollars if you need to batch geocode a few million records. That is something that is worth getting in touch with me about. And so hopefully that gets picked up by search engines and drives a bit more traffic to my consulting website.

A second example I posted some python code to help construct network experiments. So the idea here is you want to spread out the treated nodes so you have a specific allocation of treated, connected to treated (what I call spillover here), and those not connected to treated (the leftover control group). This python code constructs linear programs to accomplish certain treated/not-touched proportions. So this graph shows if you choose to treat 1 person, but have constraints on 1,2,3 leftover.

And then you can apply this to bigger networks, here the network is 311 nodes, and 90 are treated and I want a total of 150 not treated.

Idea derivative from some work Bruce Desmarais discussed on Twitter, but also have thought about this in some discussion with Barak Ariel in focused deterrence style interventions. So hopefully that comes in handy.

My linear programming formulation is not as svelte as the optimal treatment assignment with spillovers formulation, it is 3*N + 2*E decision variables and 5*N + 2*E constraints (where N is the number of nodes and E is the number of un-directed edges). I have a feeling my formulation is redundant, so if I write my constraints smarter can be more like 2N decision variables and 2N + E constraints.

But for my examples I show it solves quite fast as is (and maybe solvers get rid of the cruft in pre-solve), so not worried about that at the moment. Don’t know the typical size networks people use, but I suspect it will work just fine and dandy on typical machines for networks even as large as 10k. (Generally if I keep the model to under 100k decision variables it is only a few minutes to solve the types of problems I show on this blog.)

Power with Likert items

The other day for a grant application we needed to conduct power analysis. Our design was t-test of mean differences for a simple treated/control group randomized experiment with the outcome being a Likert score survey response. (I know, most of the time people create latent scores with Likert items, analyzing the individual items is fine IMO and simpler to specify for a pre-registration analysis.) I am sure others have needed to do similar things, but I could not find code online to help out with this. So I scripted up a simulation in R to do this, and figured sharing would be useful.

So the rub with Likert data, and why you can’t use typical power calculations, is that they have ceiling effects. If most people answer on average 4.5 out of the your scale up to 5, it is difficult to go much higher. Here I simulate data that has that skew (so ceiling effects come into play), and then go through the motions of doing the t-test. So first for some setup, I have a function that rounds and clips data to limited sets of integers, plus some plotting functions.

# power analysis of Likert data
library(ggplot2)

# custom theme
theme_cdc <- function(){
  theme_bw() %+replace% theme(
     text = element_text(size = 16),
     panel.grid.major= element_line(linetype = "longdash"),
     panel.grid.minor= element_blank()
) }

set.seed(10) # setting the random seed

# rounding/clipping data to integers 
clipint <- function(x,min=1,max=5){
    rx <- round(x)
    rx <- ifelse(rx < min,min,rx)
    rx <- ifelse(rx > max,max,rx)
    return(rx)
}

This following function generates the p-values and standard errors, what I will use later in my simulation. Here I use a t-test of mean differences, but it would be fairly easy to say swap out with an ordinal logistic regression if you prefer that. Probably the bigger deal is the simulation generates data using a normal distribution, and then post rounds/clip the data. There is probably a smarter way to generate the data according to the logistic model and ordinal intercepts (Frank Harrell discusses such things a bit on his blog). But this at least takes into account that the data will be skewed, even in the control group, to have more positive outcomes and thus take the ceiling into account.

# this uses OLS to do t-test of mean differences
# generates normal data, but then rounds/clips
sim_ols <- function(n,eff=0.5,int=4,sd=1){
    df <- data.frame(1:n)
    df$treat <- sample(c(0,1),n,replace=TRUE)
    df$latent <- int + eff*df$treat + rnorm(n,sd=sd)
    df$likert <- clipint(df$latent)
    m1 <- lm(likert ~ treat,data=df)
    cd <- coef(summary(m1))
    pval <- cd[2,4]
    se <- cd[2,2]
    return(c(pval,se))
}

Now we can move onto the simulations, this evaluates sample sizes from 100 to 2000 (in increments of 50), effect sizes of 0.1, 0.3, and 0.5, and repeats the simulations 10k times. I then see the power (how often the two-tailed p-value is less than 0.05), as well as the standard error (precision) of the estimates. Effect sizes are in terms of the original Likert scale values, what I take to be much easier to reason about. (I have seen power analyses here use Cohen’s D, which you really can’t get a very large Cohen’s D value due to ceiling effects with this data.)

# running power estimates over different
# sample sizes and effect sizes
samp_sizes <- seq(100,2000,50)
eff_sizes <- c(0.1,0.3,0.5)
rep_size <- 10000
df <- expand.grid(samps_sizes=samp_sizes,eff_sizes=eff_sizes,pow=c(NA),se=c(NA))
for (i in 1:nrow(df)){
    ss <- df[i,1]
    es <- df[i,2]
    stats <- replicate(rep_size,sim_ols(n=ss,eff=es))
    smean <- rowMeans(stats)
    df[i,3] <- mean(stats[1,] < 0.05) # alpha 0.05
    df[i,4] <- smean[2]
}

df$eff_sizes <- as.factor(df$eff_sizes)

The graph of the power shows what you would expect, so with a few hundred samples you can determine an effect size of 0.5, but with a smaller effect size (on the order of 0.1) you will need more than 2k samples.

# Graph of power
powg <- ggplot(data=df,aes(x=samps_sizes,y=pow)) + 
        geom_line(aes(color=eff_sizes)) + 
        geom_point(pch=21,color='white',size=2,aes(fill=eff_sizes)) +
        labs(x='Sample Sizes',y=NULL,title='Power Estimates') +
        scale_y_continuous(breaks=seq(0,1,0.1)) + 
        scale_x_continuous(breaks=seq(100,2000,250)) + 
        scale_color_discrete(name="Effect Sizes") +
        scale_fill_discrete(name="Effect Sizes") + 
        theme_cdc()

Unfortunately, in reality with most survey measures of police data, e.g. rate your officer 1 to 5, a 0.5 effect is a really large increase. In my mapping attitudes paper, some demographics shift global attitudes at max by 0.2, and I doubt most interventions will be that dramatic. So I like plotting the precision of the estimator, which the effect size doesn’t really make a dent here (it could with more severe ceiling effects).

# Graph of Standard Errors (for Precision)
precg <- ggplot(data=df,aes(x=samps_sizes,y=se,color=eff_sizes)) + 
         geom_line(aes(color=eff_sizes)) + 
         geom_point(pch=21,color='white',size=2,aes(fill=eff_sizes)) +
         labs(x='Sample Sizes',y=NULL,title='Precision Estimates') +
         scale_x_continuous(breaks=seq(100,2000,250)) + 
         scale_color_discrete(name="Effect Sizes") +
         scale_fill_discrete(name="Effect Sizes") + 
         theme_cdc()

With field experiments when considering post police contacts (general attitude surveys you have more wiggle room, but still they cost money to survey) you really can’t control the sample size. You are at the whims of whatever events happen in the police departments daily duties. So the best you can do is make approximate plans for “how long am I going to collect measures before I analyze the data”, and “how reasonably precise will my estimates be”.

This particular grant I make arguments we care more about a non-inferiority type test (e.g. can be sure attitudes are not worse, within a particular error level, given our treatment is more cost-effective than business as usual). But if we did an intervention specifically intended to improve attitudes, you may be talking more like 5,000+ contacts to detect an effect of 0.1 (given likely sample non-response), which is still a big effect.

You would gain power by doing a scale (e.g. summing together multiple items or conducting a factor analysis), assuming the intervention effects the underlying latent trait, and piecemeal individual items. But that will have to be for another day simulating data to do that end-to-end analysis.

Despite not being an academic anymore, I have in the hopper more grant collaborations than I did when I was a professor. The NIJ season is winding down, so probably too late to collaborate on any more of those. But if you have other ideas and need quant help with your projects, always feel free to reach out. I enjoy doing these side projects with academics when reasonable funding is available.

Grabbing the NHAMCS emergency room data in python

So Katelyn Jetelina on her blog, The Local Epidemiologist, had a blog post (with Heidi Moseson) on how several papers examining mifepristone related to emergency room (ER) visits were retracted. (Highly recommend Katelyn’s blog, I really enjoy the mix of virology and empirical data discussions you can’t get from other outlets.)

This reminded me I had on the todo list examining the CDC’s NHAMCS (National Hospital Ambulatory Medical Care Survey) data. This is a sample of ER visit data collated by the CDC. I previously putzed with this data to illustrate predictive models for wait times, and I was interested in examining gun violence survival rates in this dataset.

I had the idea with checking out gun violence in this data after seeing Jeff Brantingham’s paper showing gun shot survival rates in California have been decreasing, and ditto for Chicago via work by Jen’s Ludwig and Jacob Miller. It is not uber clear though if this is a national pattern, Jeff Asher does not think so for example. So I figured the NHAMCS would be a good way to check national survival rates, and maybe see if any metro areas were diverging over time.

Long story short, the NHAMCS data is waaay too small of sample to look at rare outcomes like gun violence. (So probably replicating the bad studies Katelyn mentions in her blog are not worth the effort, they will be similarly rare). But the code is concise enough to share in a quick blog post for others if interested. Looking at the data the other day, I realized you could download SPSS/SAS/Stata files instead of the fixed width from the CDC website. This is easier than my prior post, as you can read those different files into python directly without having to code all of the variable fields from the fixed width file.

So for some upfront, the main library you need is pandas (as well as pyreadstat installed). The rest is just stuff that comes with pythons standard library. The NHAMCS files are zipped SPSS files, so a bit more painful to download but not that much of an issue. (Unfortunately you cannot just read in memory, like I did with Excel/csv here, I have to save the file to disk and then read it back.)

import pandas as pd
import zipfile
from io import BytesIO
import requests
from os import path, remove

# This downloads zip file for SPSS
def get_spss(url,save_loc='.',convert_cat=False):
    ext = url[-3:]
    res = requests.get(url)
    if ext == 'zip':
        zf = zipfile.ZipFile(BytesIO(res.content))
        spssf = zf.filelist[0].filename
        zz = zf.open(spssf)
        zs = zz.read()
    else:
        zs = BytesIO(res.content)
        spssf = path.basename(url)
    sl = path.join('.',spssf)
    with open(sl, "wb") as sav:
        sav.write(zs)
    df = pd.read_spss(sl,convert_categoricals=convert_cat)
    remove(sl)
    return df

Now that we have our set up, we can just read in each year. Note 2005!

# creating urls
base_url = 'https://ftp.cdc.gov/pub/health_statistics/nchs/dataset_documentation/NHAMCS/spss/'
files = ['ed02-spss.zip',
         'ed03-spss.zip',
         'ed04-spss.zip',
         'ed05-sps.zip',
         'ed06-spss.zip',
         'ed07-spss.zip',
         'ed08-spss.zip',
         'ed09-spss.zip',
         'ed2010-spss.zip',
         'ed2011-spss.zip',
         'ed2012-spss.zip',
         'ed2013-spss.zip',
         'ed2014-spss.zip',
         'ed2015-spss.zip',
         'ed2016-spss.zip',
         'ed2017-spss.zip',
         'ed2018-spss.zip',
         'ed2019-spss.zip',
         'ed2019-spss.zip',
         'ed2020-spss.zip',
         'ed2021-spss.zip']
urls = [base_url + f for f in files]

def get_data():
    res_data = []
    for u in urls:
        res_data.append(get_spss(u))
    for r in res_data:
        r.columns = [v.upper() for v in list(r)]
    vars = []
    for i,d in enumerate(res_data):
        year = i + 2001
        vars += list(d)
    vars = list(set(vars))
    vars.sort()
    vars = pd.DataFrame(vars,columns=['V'])
    for i,d in enumerate(res_data):
        year = i + 2001
        uc = [v.upper() for v in list(d)]
        vars[str(year)] = 1*vars['V'].isin(uc)
    return res_data, vars

rd, va = get_data()
all_data = pd.concat(rd,axis=0,ignore_index=True)

Note that the same links with the zipped up sav files also have .sps files, so you can see how the numeric variables are encoded. Or pass in the argument convert_cat=True to the get_spss function and it will turn the data into strings based on the labels.

You can check out which variables are available for which years via the va dataframe. They are quite consistent though. The bigger pain is that for older years, we have ICD9 codes, and for more recent years we have ICD10. So it takes a bit of work to normalize between the two (for ICD10, just looking at the first 3 is ok, for ICD9 you need to look at all 5 though). It is similar to NIBRS crime data, a single event can have different codes associated with it, so you need to look across all of them to identify whether any of the codes are associated with gun assaults.

# Assaulting Gun Violence for ICD9/ICD10
# ICD9, https://www.aapc.com/codes/icd9-codes-range/165/
# ICD9, https://www.cdc.gov/nchs/injury/ice/amsterdam1998/amsterdam1998_guncodes.htm
# ICD10, https://www.icd10data.com/ICD10CM/Codes/V00-Y99/X92-Y09
gv = {'Handgun': ['X93','9650'],
      'Longgun': ['X94','9651','9652','9653'],
      'Othergun': ['X95','9654']}

any_gtype = gv['Handgun'] + gv['Longgun'] + gv['Othergun']
gv['Anygun'] = any_gtype

fields = ['CAUSE1','CAUSE2','CAUSE3']

all_data['Handgun'] = 0
all_data['Longgun'] = 0
all_data['Othergun'] = 0
all_data['Anygun'] = 0


for f in fields:
    for gt, gl in gv.items():
        all_data[gt] = all_data[gt] + 1*all_data[f].isin(gl) + 1*all_data[f].str[:3].isin(gl)

for gt in gv.keys():
    all_data[gt] = all_data[gt].clip(0,1)

There are ranging between 10k and 40k rows in each year, but overall there are very few observations of assaultive gun violence. So even with over 500k rows across the 19 years, there are fewer than 200 incidents of people going to the ER because of a gun assault.

# Not very many, only a handful each
all_data[gv.keys()].sum(axis=0)

# This produces
# Handgun      20
# Longgun      11
# Othergun    139
# Anygun      170

These are far too few in number to estimate the survival rate changes over time. So the Brantingham or Ludwig analysis that looks at larger register data of healthcare claims, or folks looking at reported crime incident data, is likely to be much more reasonable to estimate those trends. If you do a groupby per year this becomes even more stark:

# Per year it is quite tiny
all_data.groupby('YEAR')[list(gv.keys())].sum()

#         Handgun  Longgun  Othergun  Anygun
# YEAR
# 2002        1        0        12      13
# 2003        5        2         4      11
# 2004        1        3        12      16
# 2005        2        1         7      10
# 2007        1        0        14      15
# 2008        2        2        10      14
# 2009        0        0        12      12
# 2010        1        0        10      11
# 2011        2        1         9      12
# 2012        1        0         6       7
# 2013        0        0         0       0
# 2014        1        0         2       3
# 2015        0        0         6       6
# 2016        0        0         5       5
# 2017        0        0         0       0
# 2018        0        1         4       5
# 2019        0        0         6       6
# 2020        0        0         9       9
# 2021        1        0         4       5

While using weights you can get national level estimates, the standard errors are based on the observed number of cases. Which in retrospect I should have realized – gun violence is pretty rare, so rates in the 1 to 100 per 100,000 would be the usual range. If anything these are maybe a tinge higher than I should have guessed (likely due to how CDC does the sampling).

To be able to do the analysis of survival rates I want, the sample sizes here would need to be 100 times larger than they are. Which would require something more akin to NIBRS reporting by hospitals directly, instead of having CDC do boots on the ground samples. Which is feasible of course (no harder for medical providers to do this than police departments), see SPARCS with New York data for example.

But perhaps others can find this useful. It may be easier to do things more like 1/100 events and analyze them. The data has quite a few variables, like readmission due to other events, public/private insurance, different drugs, and then of course all the stuff that is recorded via ICD10 codes (which is both health events as well as behavioral). So probably not a large enough sample to do analysis of other criminal justice related health care incidents, but they do add up to big victim costs to the state that are easy to quantify, as the medicaid population is a large chunk of that.

Matching mentors to mentees using OR-tools

So the other day on Hackernews, a startup had a question in their entry to interview:

You have m mentees – You have M mentors – You have N match scores between mentees and mentors, based on how good of a match they are for mentorship. – mxM > N, because not every mentee is eligible to match with every mentor. It depends on seniority constraints. – Each mentee has a maximum mentor limit of 1 – Each mentor has a maximum mentee limit of l, where 0<l<4 How would you find the ideal set of matches to maximize match scores across all mentees and mentors?

Which I am not interested in applying to this start-up (it is in Canada), but it is a fun little distraction. I would use linear programming to solve this problem – it is akin to an assignment matching problem.

So here to make some simple data, I have two mentors and four mentees. Each mentee has a match score to its mentor, and not all mentees have a match score with each mentor.

# using googles or-tools
from ortools.linear_solver import pywraplp

# Create data
menA = {'a':0.2, 'b':0.4, 'c':0.5}
menB = {'b':0.3, 'c':0.4, 'd':0.5}

So pretend each mentor can be assigned two mentees, what is the optimal pairing? If you do a greedy approach, you might say, for menA lets do the highest, b then c. And then for mentor B, we only have left mentee d. This is a total score of 0.4 + 0.5 + 0.5 = 1.4. If we do greedy the other way, we get mentor B assigned c and d, and then mentor A can have a and b assigned, with a score of 0.4 + 0.5 + 0.4 + 0.2 = 1.5. And this ends up being the optimal approach in this scenario.

Here I am going to walk through creating this model in google’s OR tools package. First we create a few different data structures, one is a set of pairs of mentor/mentees, and these will end up being our decision variables. The other is a set of the unique mentees (and we will need a set of the unique mentors as well) later for the constraints.

# a few different data structures
# for our linear programming problem
dat = {'A':menA,
       'B':menB}

mentees = []

pairs = {}
for m,vals in dat.items():
    for k,v in vals.items():
        mentees.append(k)
        pairs[f'{m},{k}'] = v

mentees = set(mentees)

Now we can create our model, and here we are maximizing the matches. Edit: Note that GLOP is not an MIP solver, leaving the blog post as since in my experiments it does always return 0/1’s (some LP formulations will always return binary, I do not know if this is always true for this one or not).  If you want to be sure to use integer variables though, use pywraplp.Solver.CreateSolver("SAT") instead of GLOP.

# Create model, variables
solver = pywraplp.Solver.CreateSolver("GLOP")
matchV = [solver.IntVar(0, 1, p) for p in pairs.keys()]

# set objective, maximize matches
objective = solver.Objective()
for m in matchV:
    objective.SetCoefficient(m, pairs[m.name()])

objective.SetMaximization()

Now we have two sets of constraints. One set is that for each mentor, they are only assigned two individuals. And then for each mentee, they are only assigned one mentor.

# constraints, mentors only get 2
Mentor_constraints = {d:solver.RowConstraint(0, 2, d) for d in dat.keys()}
# mentees only get 1
Mentee_constraints = {m:solver.RowConstraint(0, 1, m) for m in mentees}
for m in matchV:
    mentor,mentee = m.name().split(",")
    Mentor_constraints[mentor].SetCoefficient(m, 1)
    Mentee_constraints[mentee].SetCoefficient(m, 1)

Now we are ready to solve the problem,

# solving the model
status = solver.Solve()
print(objective.Value())

# Printing the results
for m in matchV:
    print(m,m.solution_value())

Which then prints out:

>>> print(objective.Value())
1.5
>>>
>>>
>>> # Printing the results
>>> for m in matchV:
...     print(m,m.solution_value())
...
A,a 1.0
A,b 0.0
A,c 1.0
B,b 1.0
B,c 0.0
B,d 1.0

So here it actually chose a different solution than I listed above, but is also maximizing the match scores at 1.5 total. Mentor A with {a,c} and Mentor B with {b,d}.

Sometimes people are under the impression that linear programming is only for tiny data problems. Here the problem size grows depending on how filled in the mentor/mentee pairs are. So if you have 1000 mentors, and they each have 50 potential mentees, the total number of decision variables will be 50,000. Which can still be solved quite fast. Note the problem does not per se grow with more mentees per se, since it can be a sparse problem.

To show this a bit easier, here is a function to return the nice matched list:

def match(menDict,num_assign=2):
    # prepping the data
    mentees,mentors,pairs = [],[],{}
    for m,vals in menDict.items():
        for k,v in vals.items():
            mentees.append(str(k))
            mentors.append(str(m))
            pairs[f'{m},{k}'] = v
    mentees,mentors = list(set(mentees)),list(set(mentors))
    print(f'Total decision variables {len(pairs.keys())}')
    # creating the problem
    solver = pywraplp.Solver.CreateSolver("GLOP")
    matchV = [solver.IntVar(0, 1, p) for p in pairs.keys()]
    # set objective, maximize matches
    objective = solver.Objective()
    for m in matchV:
        objective.SetCoefficient(m, pairs[m.name()])
    objective.SetMaximization()
    # constraints, mentors only get num_assign
    Mentor_constraints = {d:solver.RowConstraint(0, num_assign, f'Mentor_{d}') for d in mentors}
    # mentees only get 1 mentor
    Mentee_constraints = {m:solver.RowConstraint(0, 1, f'Mentee_{m}') for m in mentees}
    for m in matchV:
        mentor,mentee = m.name().split(",")
        #print(mentor,mentee)
        Mentor_constraints[mentor].SetCoefficient(m, 1)
        Mentee_constraints[mentee].SetCoefficient(m, 1)
    # solving the model
    status = solver.Solve()
    # figuring out whom is matched
    matches = {m:[] for m in mentors}
    for m in matchV:
        if m.solution_value() > 0.001:
            mentor, mentee = m.name().split(",")
            matches[mentor].append(mentee)
    return matches

I have the num_assigned as a constant for all mentors, but it could be variable as well. So here is an example with around 50k decision variables:

# Now lets do a much larger problem
from random import seed, random, sample
seed(10)
mentors = list(range(1000))
mentees = list(range(4000))

# choose random 50 mentees
# and give random weights
datR = {}
for m in mentors:
    sa = sample(mentees,k=50)
    datR[m] = {s:random() for s in sa}

# This still takes just a few seconds on my machine
# 50k decision variables
res = match(datR,num_assign=4)

And this takes less than 2 seconds on my desktop, which much of that is probably setting up the problem.

For very big problems, you can separate out the network into connected components, and then run the linear program on each seperate component. Only in the case with dense and totally connected networks will you need to worry much about running out of memory I suspect.


So I have written in the past how I stopped doing homework assignments for interviews. More recently at work we give basic pair coding sessions – something like a python hello world problem, and then a SQL question that requires knowing GROUP BY. If they get that far (many people who say they have years of data science experience fail at those two), we then go onto making environments, git, and describing some machine learning methods.

You would be surprised how many data scientists I interview only know how to code in Jupyter notebooks and cannot do a hello world program from the command line. It is part of the reason I am writing my intro python book. Check out the table of contents and first few chapters below. Only one more final chapter on an end-to-end project before the book will be released on Amazon. (So this is close to the last opportunity to purchase as the lower price before then.)

Harmweighted hotspots, using ESRI python API, and Crime De-Coder Updates

Haven’t gotten the time to publish a blog post in a few. There has been a ton of stuff I have put out on my Crime De-Coder website recently. For some samples since I last mentioned here, have published four blog posts:

  • on what AI regulation in policing would look like
  • high level advice on creating dashboards
  • overview of early warning systems for police
  • types of surveys for police departments

For surveys a few different groups have reached out to me in regards to the NIJ measuring attitudes solicitation (which is essentially a follow up of the competition Gio and myself won). So get in touch if interested (whether a PD or a research group), may try to coordinate everyone to have one submission instead of several competing ones.

To keep up with everything, my suggestion is to sign up for the RSS feed on the site. If you want an email use the if this than that service. (I may have to stop doing my AltAc newsletter emails, it is so painful to send 200 emails and I really don’t care to sign up for another paid for service to do that.)

I also have continued the AltAc newsletter. Getting started with LLMs, using secrets, advice on HTML, all sorts of little pieces of advice every other week.

I have created a new page for presentations. Including, my recent presentation at the Carolina Crime Analysis Association Conference. (Pic courtesy of Joel Caplan who was repping his Simsi product – thank you Joel!)

If other regional IACA groups are interested in a speaker always feel free to reach out.

And finally a new demo on creating a static report using quarto/python. It is a word template I created (I like often generating word documents that are easier to post-hoc edit, it is ok to automate 90% and still need a few more tweaks.)

Harmweighted Hotspots

If you like this blog, also check out Iain Agar’s posts, GIS/SQL/crime analysis – the good stuff. Here I wanted to make a quick note about his post on weighting Crime Harm spots.

So the idea is that when mapping harm spots, you could have two different areas with same high harm, but say one location had 1 murder and one had 100 thefts. So if murder harm weight = 100 and theft harm weight = 1, they would be equal in weight. Iain talks about different transformations of harm, but another way to think about it is in terms of variance. So here assuming Poisson variance (although in practice that is not necessary, you could estimate the variance given enough historical time series data), you would have for your two hotspots:

Hotspot1: mean 1 homicide, variance 1
Hotspot2: mean 100 thefts, variance 100

Weight of 100 for homicides, 1 for theft

Hotspot1: Harmweight = 1*100 = 100
          Variance = 100^2*1 = 10,000
          SD = sqrt(10,000) = 100

Hotspot2: Harmweight = 100*1 = 100
          Variance = 1^2*100 = 100
          SD = sqrt(100) = 10

When you multiply by a constant, which is what you are doing when multiplying by harm weights, the relationship with variance is Var(const*x) = const^2*Var(x). The harm weights add variance, so you may simple add a penalty term, or rank by something like Harmweight - 2*SD (so the lower end of the harm CI). So in this example, the low end of the CI for Hotspot 1 is 0, but the low end of the CI for Hotspot2 is 80. So you would rank Hotspot2 higher, even though they are the same point estimate of harm.

The rank by low CI is a trick I learned from Evan Miller’s blog.

You could fancy this up more with estimating actual models, having multiple harm counts, etc. But this is a quick way to do it in a spreadsheet with just simple counts (assuming Poisson variance). Which I think is often quite reasonable in practice.

Using ESRI Python API

So I knew you could use python in ESRI, they have a notebook interface now. What I did not realize is now with Pro you can simply do pip install arcgis, and then just interact with your org. So for a quick example:

from arcgis.gis import GIS

# Your ESRI url
gis = GIS("https://modelpd.maps.arcgis.com/", username="user_email", password="???yourpassword???")
# For batch geocoding, probably need to do GIS(api_key=<your api key>)

This can be in whatever environment you want, so you don’t even need ArcGIS installed on the system to use this. It is all web-api’s with Pro. To geocode for example, you would then do:

from arcgis.geocoding import geocode, Geocoder, get_geocoders, batch_geocode

# Can search to see if any nice soul has published a geocoding server

arcgis_online = GIS()
items = arcgis_online.content.search('geocoder north carolina', 'geocoding service', max_items=30)

# And we have four
#[<Item title:"North Carolina Address Locator" type:Geocoding Layer owner:ecw31_dukeuniv>,
# <Item title:"Southeast North Carolina Geocoding Service" type:Geocoding Layer owner:RaleighGIS>, 
# <Item title:"Geocoding Service - AddressNC " type:Geocoding Layer owner:nconemap>, 
# <Item title:"ArcGIS World Geocoding Service - NC Extent" type:Geocoding Layer owner:NCDOT.GOV>]

geoNC = Geocoder.fromitem(items[0]) # lets try Duke
#geoNC = Geocoder.fromitem(items[-1]) # NCDOT.GOV
# can also do directly from URL
# via items[0].url
# url = 'https://utility.arcgis.com/usrsvcs/servers/8caecdf6384144cbafc9d56944af1ccf/rest/services/World/GeocodeServer'
# geoNC = Geocoder(url,gis)

# DPAC
res = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1)
print(res[0])

Note you cannot cache the geocoding results. To do that, you need to use credits and probably sign in via a token and not a username password.

# To cache, need a token
r2 = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1,for_storage=True)

# If you have multiple addresses, use batch_geocode, again need a token
#dc_res = batch_geocode(FullAddressList, geocoder=geoNC) 

Geocoding to this day is still such a pain. I will need to figure out if you can make a local geocoding engine with ESRI and then call that through Pro (I mean I know you can, but not sure pricing for all that).

Overall being able to work directly in python makes my life so much easier, will need to dig more into making some standard dashboards and ETL processes using ESRI’s tools.

I have another post that has been half finished about using the ESRI web APIs, hopefully will have time to put that together before another 6 months passes me by!

Sentence embeddings and caching huggingface models in github actions

For updates on Crime De-Coder:

This is actually more born out from demand interacting with crime analysis groups – it doesn’t really make sense for me to swoop in and say “do X/Y/Z, here is my report”. My experience doing that with PDs is not good – my recommendations often do not get implemented. So it is important to upskill your current analysts to be able to conduct more rigorous analysis over time.

Some things it is better to have an external person do the report (such as program evaluation, cost-benefit analysis, or racial bias metrics). But things like generating hotspots and chronic offender lists, those should be something your crime analysis unit learns how to do.

You can of course hire me to do those things, but if you don’t train your local analysts to take over the job, you have to perpetually hire me to do that. Which may make sense for small departments without an internal crime analysis unit. But for large agencies you should be using your crime analysts to do reports like that.


This post, I also wanted to share some work on NLP. I have not been immune at work given all the craze. I am more impressed with vector embeddings and semantic search though than I am with GenAI applications. GenAI I think will be very useful, like a more advanced auto-complete, but I don’t think it will put us all out of work anytime soon.

Vector embeddings, for those not familiar, are taking a text input and turning it into numbers. You can then do nearness calculations with the numbers. So say you had plain text crime narrative notes on modus operandi, and wanted to search “breaking car window and stealing purse” – you don’t want to do an exact search on that phrase, but a search for text that is similar. So the vector similarity search may return a narrative that is “break truck back-window with pipe, take backpack”. Not exactly the same, but semantically very similar.

Sentencetransformers makes this supersimple (no need to use external APIs for this).

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Base query
query = ['breaking car window and stealing purse']

# Example MO
mo = ['break truck back window with pipe, steal backpack',
      'pushing in a window AC unit, steal TV and sneakers',
      'steal car from cloned fab, joyride']

#Compute embedding for both lists
qemb = model.encode(query, convert_to_tensor=True)
moemb = model.encode(mo, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(moemb, qemb)
print(cosine_scores)

So you have your query, and often people just return the top-k results, since to know “how close is reasonably close” is difficult in practice (google pretty much always returns some results, even if they are off the mark!).

I am quite impressed with the general models for this (even very idiosyncratic documents and jargon it tends to do quite well). But if you want to embed specifically trained models on different tasks, I often turn to simpletransformers. Here is a model based on clinical case notes.

from simpletransformers.language_representation import RepresentationModel

model = RepresentationModel("bert", "medicalai/ClinicalBERT")
sent1 = ["Procedure X-ray of wrist"] # needs to be a list
wv1 = model.encode_sentences(sent1, combine_strategy="mean")

For another example, I shared a github repo, hugging_cache, where I did some work on how to cache huggingface models in a github actions script. This is useful for unit tests, so you don’t need to redownload the model everytime.

Github cache size for free accounts is 10 gigs (and you need the environment itself, which is a few gigs). So huggingface models up to say around 4 (maybe 5) gigs will work out OK in this workflow. So not quite up to par with the big llama models, but plenty large enough for these smaller embedding models.

It is not very difficult to build a local, in-memory vector search database. Which will be sufficient for many individuals. So a crime analyst could build a local search engine for use – redo vector DB on a batch job, then just have a tool/server to do the local lookup. (Or just use a local sqlite database is probably sufficient for a half dozen analysts/detectives may use this tool every now and then.)

Forecasts need to have error bars

Richard Rosenfeld in the most recent Criminologist published a piece about forecasting national level crime rates. People complain about the FBI releasing crime stats a year late, academics are worse; Richard provided “forecasts” for 2021 through 2025 for an article published in late 2023.

Even ignoring the stalecasts that Richard provided – these forecasts had/have no chance of being correct. Point forecasts will always be wrong – a more reasonable approach is to provide the prediction intervals for the forecasts. Showing error intervals around the forecasts will show how Richard interpreting minor trends is likely to be misleading.

Here I provide some analysis using ARIMA models (in python), to illustrate what reasonable forecast error looks like in this scenario, code and data on github.

You can get the dataset on github, but just some upfront with loading the libraries I need and getting the data in the right format:

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt

# via https://www.disastercenter.com/crime/uscrime.htm
ucr = pd.read_csv('UCR_1960_2019.csv')
ucr['VRate'] = (ucr['Violent']/ucr['Population'])*100000
ucr['PRate'] = (ucr['Property']/ucr['Population'])*100000
ucr = ucr[['Year','VRate','PRate']]

# adding in more recent years via https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/docApi
# I should use original from counts/pop, I don't know where to find those though
y = [2020,2021,2022]
v = [398.5,387,380.7]
p = [1958.2,1832.3,1954.4]
ucr_new = pd.DataFrame(zip(y,v,p),columns = list(ucr))
ucr = pd.concat([ucr,ucr_new],axis=0)
ucr.index = pd.period_range(start='1960',end='2022',freq='A')

# Richard fits the model for 1960 through 2015
train = ucr.loc[ucr['Year'] <= 2015,'VRate']

Now we are ready to fit our models. To make it as close to apples-to-apples as Richard’s paper, I just fit an ARIMA(1,1,2) model – I do not do a grid search for the best fitting model (also Richard states he has exogenous factors for inflation in the model, which I do not here). Note Richard says he fits an ARIMA(1,0,2) for the violent crime rates in the paper, but he also says he differenced the data, which is an ARIMA(1,1,2) model:

# Not sure if Richard's model had a trend term, here no trend
violent = ARIMA(train,order=(1,1,2),trend='n').fit()
violent.summary()

This produces the output:

                               SARIMAX Results
==============================================================================
Dep. Variable:                  VRate   No. Observations:                   56
Model:                 ARIMA(1, 1, 2)   Log Likelihood                -242.947
Date:                Sun, 19 Nov 2023   AIC                            493.893
Time:                        19:33:53   BIC                            501.923
Sample:                    12-31-1960   HQIC                           496.998
                         - 12-31-2015
Covariance Type:                  opg
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.4545      0.169     -2.688      0.007      -0.786      -0.123
ma.L1          1.1969      0.131      9.132      0.000       0.940       1.454
ma.L2          0.7136      0.100      7.162      0.000       0.518       0.909
sigma2       392.5640    104.764      3.747      0.000     187.230     597.898
===================================================================================
Ljung-Box (L1) (Q):                   0.13   Jarque-Bera (JB):                 0.82
Prob(Q):                              0.72   Prob(JB):                         0.67
Heteroskedasticity (H):               0.56   Skew:                            -0.06
Prob(H) (two-sided):                  0.23   Kurtosis:                         2.42
===================================================================================

So some potential evidence of over-differencing (with the negative AR(1) coefficient). Looking at violent.test_serial_correlation('ljungbox') there is no significant serial auto-correlation in the residuals. One could use some sort of auto-arima approach to pick a “better” model (it clearly needs to be differenced at least once, also maybe should also be modeling the logged rate). But there is not much to squeeze out of this – pretty much all of the ARIMA models will produce very similar forecasts (and error intervals).

So in the statsmodels package, you can append new data and do one step ahead forecasts, so this is comparable to Richard’s out of sample one step ahead forecasts in the paper for 2016 through 2020:

# To make it apples to apples, only appending through 2020
av = (ucr['Year'] > 2015) & (ucr['Year'] <= 2020)
violent = violent.append(ucr.loc[av,'VRate'], refit=False)

# Now can show insample predictions and forecasts
forecast = violent.get_prediction('2016','2025').summary_frame(alpha=0.05)

If you print(forecast) below are the results. One of the things I want to note is that if you do one-step-ahead forecasts, here the years 2016 through 2020, the standad error is under 20 (this is well within Richard’s guesstimate to be useful it needs to be under 10% absolute error). When you start forecasting multiple years ahead though, the error compounds over time. So to forecast 2022, you need a forecast of 2021. To forecast 2023, you need to forecast 21,22 and then 23, etc.

VRate        mean    mean_se  mean_ci_lower  mean_ci_upper
2016   397.743461  19.813228     358.910247     436.576675
2017   402.850827  19.813228     364.017613     441.684041
2018   386.346157  19.813228     347.512943     425.179371
2019   379.315712  19.813228     340.482498     418.148926
2020   379.210158  19.813228     340.376944     418.043372
2021   412.990860  19.813228     374.157646     451.824074
2022   420.169314  39.803285     342.156309     498.182318
2023   416.906654  57.846105     303.530373     530.282936
2024   418.389557  69.535174     282.103120     554.675994
2025   417.715567  80.282625     260.364513     575.066620

The standard error scales pretty much like sqrt(steps*se^2) (it is additive in the variance). Richard’s forecasts do better than mine for some of the point estimates, but they are similar overall:

# Richard's estimates
forecast['Rosenfeld'] = [399.0,406.8,388.0,377.0,394.9] + [404.1,409.3,410.2,411.0,412.4]
forecast['Observed'] = ucr['VRate']

forecast['MAPE_Andy'] = 100*(forecast['mean'] - forecast['Observed'])/forecast['Observed']
forecast['MAPE_Rick'] = 100*(forecast['Rosenfeld'] - forecast['Observed'])/forecast['Observed']

And this now shows for each of the models:

VRate        mean  mean_ci_lower  mean_ci_upper  Rosenfeld    Observed  MAPE_Andy  MAPE_Rick
2016   397.743461     358.910247     436.576675      399.0  397.520843   0.056002   0.372095
2017   402.850827     364.017613     441.684041      406.8  394.859716   2.023785   3.023931
2018   386.346157     347.512943     425.179371      388.0  383.362999   0.778155   1.209559
2019   379.315712     340.482498     418.148926      377.0  379.421097  -0.027775  -0.638103
2020   379.210158     340.376944     418.043372      394.9  398.500000  -4.840613  -0.903388
2021   412.990860     374.157646     451.824074      404.1  387.000000   6.715985   4.418605
2022   420.169314     342.156309     498.182318      409.3  380.700000  10.367563   7.512477
2023   416.906654     303.530373     530.282936      410.2         NaN        NaN        NaN
2024   418.389557     282.103120     554.675994      411.0         NaN        NaN        NaN
2025   417.715567     260.364513     575.066620      412.4         NaN        NaN        NaN

So MAPE in the held out sample does worse than Rick’s models for the point estimates, but look at my prediction intervals – the observed values are still totally consistent with the model I have estimated here. Since this is a blog and I don’t need to wait for peer review, I can however update my forecasts given more recent data.

# Given updated data until end of series, lets do 23/24/25
violent = violent.append(ucr.loc[ucr['Year'] > 2020,'VRate'], refit=False)
updated_forecast = violent.get_forecast(3).summary_frame(alpha=0.05)

And here are my predictions:

VRate        mean    mean_se  mean_ci_lower  mean_ci_upper
2023   371.977798  19.813228     333.144584     410.811012
2024   380.092102  39.803285     302.079097     458.105106
2025   376.404091  57.846105     263.027810     489.780373

You really need to graph these out to get a sense of the magnitude of the errors:

Note how Richard’s 2021 and 2022 forecasts and general increasing trend have already been proven to be wrong. But it really doesn’t matter – any reasonable model that admitted uncertainty would never let one reasonably interpret minor trends over time in the way Richard did in the criminologist article to begin with (forecasts for ARIMA models are essentially mean-reverting, they will just trend to a mean term in a short number of steps). Richard including exogenous factors actually makes this worse – as you need to forecast inflation and take that forecast error into account for any multiple year out forecast.

Richard has consistently in his career overfit models and subsequently interpreted the tea leaves in various macro level correlations (Rosenfeld, 2018). His current theory of inflation and crime is no different. I agree that forecasting is the way to validate criminological theories – picking up a new pet theory every time you are proven wrong though I don’t believe will result in any substantive progress in criminology. Most of the short term trends criminologists interpret are simply due to normal volatility in the models over time (Yim et al., 2020). David McDowall has a recent article that is much more measured about our cumulative knowledge of macro level crime rate trends – and how they can be potentially related to different criminological theories (McDowall, 2023). Matt Ashby has a paper that compares typical errors for city level forecasts – forecasting several years out tends to product quite inaccurate estimates, quite a bit larger than Richard’s 10% is useful threshold (Ashby, 2023).

Final point that I want to make is that honestly it doesn’t even matter. Richard can continue to keep making dramatic errors in macro level forecasts – it doesn’t matter if he publishes estimates that are two+ years old and already wrong before they go into print. Because unlike what Richard says – national, macro level violent crime forecasts do not help policy response – why would Pittsburgh care about the national level crime forecast? They should not. It does not matter if we fit models that are more accurate than 5% (or 1%, or whatever), they are not helpful to folks on the hill. No one is sitting in the COPS office and is like “hmm, two years from now violent crime rates are going up by 10, lets fund 1342 more officers to help with that”.

Richard can’t have skin the game for his perpetual wrong macro level crime forecasts – there is no skin to have. I am a nerd so I like looking at numbers and fitting models (or here it is more like that XKCD comic of yelling at people on the internet). I don’t need to make up fairy tale hypothetical “policy” applications for the forecasts though.

If you want a real application of crime forecasts, I have estimated for cities that adding an additional home or apartment unit increases the number of calls per service by about 1 per year. So for growing cities that are increasing in size, that is the way I suggest to make longer term allocation plans to increase police staffing to increase demand.

References

Fitting a gamma distribution (with standard errors) in python

Over on Crime De-Coder, I have up a pre-release of my book, Data Science for Crime Analysis with Python.

I have gotten asked by so many people “where to learn python” over the past year I decided to write a book. Go check out my preface on why I am not happy with current resources in the market, especially for beginners.

I have the preface + first two chapters posted. If you want to sign up for early releases, send me an email and will let you know how to get an early copy of the first half of the book (and send updates as they are finished).

Also I have up the third installment of the AltAc newsletter. Notes about the different types of healthcare jobs, and StatQuest youtube videos are a great resource. Email me if you want to sign up for the newsletter.


So the other day I showed how to fit a beta-binomial model in python. Today, in a quick post, I am going to show how to estimate standard errors for such fitted models.

So in scipy, you have distribution.fit(data) to fit the distribution and return the estimates, but it does not have standard errors around those estimates. I recently had need for gamma fits to data, in R I would do something like:

# this is R code
library(fitdistrplus)
x <- c(24,13,11,11,11,13,9,10,8,9,6)
gamma_fit <- fitdist(x,"gamma")
summary(gamma_fit)

And this pipes out:

> Fitting of the distribution ' gamma ' by maximum likelihood
> Parameters :
>        estimate Std. Error
> shape 8.4364984  3.5284240
> rate  0.7423908  0.3199124
> Loglikelihood:  -30.16567   AIC:  64.33134   BIC:  65.12713
> Correlation matrix:
>           shape      rate
> shape 1.0000000 0.9705518
> rate  0.9705518 1.0000000

Now, for some equivalent in python.

# python code
from scipy.optimize import minimize
from scipy.stats import gamma

x = [24,13,11,11,11,13,9,10,8,9,6]

res = gamma.fit(x,floc=0)
print(res)

And this gives (8.437256958682587, 0, 1.3468401423927603). Note that python uses the scale version, so 1/scale = rate, e.g. 1/res[-1] results in 0.7424786123640676, which agrees pretty closely with R.

Now unfortunately, there is no simple way to get the standard errors in this framework. You can however minimize the parameters yourself and get the things you need – here the Hessian – to calculate the standard errors yourself.

Here to make it easier apples to apples with R, I estimate the rate parameter version.

# minimize negative log likelihood
def gll(parms,data):
    alpha, rate = parms
    ll = gamma.logpdf(data,alpha,loc=0,scale=1/rate)
    return -ll.sum()

result = minimize(gll,[8.,0.7],args=(x),method='BFGS')
se = np.sqrt(np.diag(result.hess_inv))
# printing results
np.stack([result.x,se],axis=1)

And this gives:

array([[8.43729465, 3.32538824],
       [0.74248193, 0.30255433]])

Pretty close, but not an exact match, to R. Here the standard errors are too low (maybe there is a sample correction factor?).

Some quick tests with this, using this scale version tends to be a bit more robust in fitting than the rate version.

The context of this, a gamma distribution with a shape parameter greater than 1 has a hump, and less than 1 is monotonically decreasing. This corresponds to the “buffer” hypothesis in criminology for journey to crime distributions. So hopefully some work related to that coming up soonish!


And doing alittle more digging, you can get “closed form” estimates of the parameters based on the Fisher Information matrix (via Wikipedia).

# Calculating closed form standard
# errors for gamma via Fisher Info
from scipy.special import polygamma
from numpy.linalg import inv
import numpy as np

def trigamma(x):
    return float(polygamma(1,x))

# These are the R estimates
shape = 8.4364984
rate = 0.7423908
n = 11 # 11 observations

fi = np.array([[trigamma(shape), -1/rate],
               [-1/rate, shape/rate**2]])

inv_fi = inv(n*fi)  # this is variance/covariance
np.sqrt(np.diagonal(inv_fi))

# R reports      3.5284240 0.3199124
# python reports 3.5284936 0.3199193

I do scare quotes around closed form, as it relies on the trigamma function, which itself needs to be calculated numerically I believe.

If you want to go through the annoying math (instead of using the inverse above), you can get a direct one liner for either.

# Closed for for standard error of shape
np.sqrt(shape / (n*(trigamma(shape)*shape - 1)))
# 3.528493591892887

# This works for the scale or the rate parameterization
np.sqrt((trigamma(shape))/ (n*rate**-2*(trigamma(shape)*shape - 1)))
# 0.31995664847081356

# This is the standard error for the scale version
scale = 1/rate
np.sqrt((trigamma(shape))/ (n*scale**-2*(trigamma(shape)*shape - 1)))
# 0.5803962127677769