All posts in category Python

Fitting a pytorch model

Out of the box when fitting pytorch models we typically run through a manual loop. So typically something like this:

# Example fitting a pytorch model
# mod is the pytorch model object
opt = torch.optim.Adam(mod.parameters(), lr=1e-4)
crit = torch.nn.MSELoss(reduction='mean')
for t in range(20000):
    opt.zero_grad()
    y_pred = mod(x)   #x is tensor of independent vars
    loss = crit(y_pred,y) #y is tensor of outcomes
    loss.backward()
    opt.step()

And this would use backpropogation to adjust our model parameters to minimize the loss function, here just the mean square error, over 20,000 iterations. Best practices are to both evaluate the loss in-sample and wait for it to flatten out, as well as evaluate out of sample.

I recently wrote some example code to make this process somewhat more like the sklearn approach, where you instantiate an initial model object, and then use a mod.fit(X, y) function call to fit the pytorch model. For an example use case I will just use a prior Compas recidivism data I have used for past examples on the blog (see ROC/Calibration plots, and Balancing False Positives). Here is the prepped CSV file to download to follow along.

So first, I load the libraries and then prep the recidivism data before I fit my predictive models.

###############################################
# Front end libraries/data prep

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch

# Setting seeds
torch.manual_seed(10)
np.random.seed(10)

# Prepping the Compas data and making train/test
recid = pd.read_csv('PreppedCompas.csv')

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
                    'juv_fel_count','YearsScreening']].copy()
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")
dum_race = pd.get_dummies(recid['race'])

# White for reference category
for d in list(dum_race):
    if d != 'Caucasion':
        recid_prep[d] = dum_race[d]

# reference category is separated/unknown/widowed
dum_mar = pd.get_dummies(recid['marital_status'])
recid_prep['Single'] = dum_mar['Single']
recid_prep['Married'] = dum_mar['Married'] + dum_mar['Significant Other']

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1].copy()
recid_test = recid_prep[recid_prep['Train'] == 0].copy()

#Independant variables
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis',
            'African-American','Asian','Hispanic','Native American','Other',
            'Single','Married']

# Dependent variable
y_var = 'Recid30'
###############################################

Now next part is more detailed, but it is the main point of the post. Typically we will make a pytorch model object something like this. Here I have various switches, such as the activation function (tanh or relu or pass in your own function), or the final function to limit predictions to 0/1 (either sigmoid or clamp or again pass in your own function).

# Initial pytorch model class
class logit_pytorch(torch.nn.Module):
    def __init__(self, nvars, device, activate='relu', bias=True,
                 final='sigmoid'):
        """
        Construct parameters for the coefficients 
        activate - either string ('relu' or 'tanh', 
                   or pass in your own torch function
        bias - whether to include bias (intercept) in model
        final - use either 'sigmoid' to squash to probs, or 'clamp'
                or pass in your own torch function
        device - torch device to construct the tensors
                 default cuda:0 if available
        """
        super(logit_pytorch, self).__init__()
        # Creating the coefficient parameters
        self.coef = torch.nn.Parameter(torch.rand((nvars,1),
                    device=device)/10)
        # If no bias it is 0
        if bias:
            self.bias = torch.nn.Parameter(torch.zeros(1,
                    device=device))
        else:
            self.bias = torch.zeros(1, device=device)
        # Various activation functions
        if activate == 'relu':
            self.trans = torch.nn.ReLU()
        elif activate == 'tanh':
            self.trans = torch.nn.Tanh()
        else:
            self.trans = activate
        if final == 'sigmoid':
            self.final = torch.nn.Sigmoid()
        elif final == 'clamp':
            # Defining my own clamp function
            def tclamp(input):
                return torch.clamp(input,min=0,max=1)
            self.final = tclamp
        else: 
            # Can pass in your own function
            self.final = final
    def forward(self, x):
        """
        predicted probability
        """
        output = self.bias + torch.mm(x, self.trans(self.coef))
        return self.final(output)

To use this though again we need to specify the number of coefficients to create, and then do a bunch of extras like the optimizer, and stepping through the function (like described at the beginning of the post). So here I have created a second class that behaves more like sklearn objects. I create the empty object, and only when I pass in data to the .fit() method it spins up the actual pytorch model with all its tensors of the correct dimensions.

# Creating a class to instantiate model to data and then fit
class pytorchLogit():
    def __init__(self, loss='logit', iters=25001, 
                 activate='relu', bias=True, 
                 final='sigmoid', device='gpu',
                 printn=1000):
        """
        loss - either string 'logit' or 'brier' or own pytorch function
        iters - number of iterations to fit (default 25000)
        activate - either string ('relu' or 'tanh', 
                   or pass in your own torch function
        bias - whether to include bias (intercept) in model
        final - use either 'sigmoid' to squash to probs, or 'clamp'
                or pass in your own torch function. Should not use clamp
                with default logit loss
        opt - ?optimizer? should add an option for this
        device - torch device to construct the tensors
                 default cuda:0 if available
        printn - how often to check the fit (default 1000 iters)
        """
        super(pytorchLogit, self).__init__()
        if loss == 'logit':
            self.loss = torch.nn.BCELoss()
            self.loss_name = 'logit'
        elif loss == 'brier':
            self.loss = torch.nn.MSELoss(reduction='mean')
            self.loss_name = 'brier'
        else:
            self.loss = loss
            self.loss_name = 'user defined function'
        # Setting the torch device
        if device == 'gpu':
            try:
                self.device = torch.device("cuda:0")
                print(f'Torch device GPU defaults to cuda:0')
            except:
                print('Unsuccessful setting to GPU, defaulting to CPU')
                self.device = torch.device("cpu")
        elif device == 'cpu':
            self.device = torch.device("cpu")
        else:
            self.device = device #can pass in whatever
        self.iters = iters
        self.mod = None
        self.activate = activate
        self.bias = bias
        self.final = final
        self.printn = printn
        # Other stats to carry forward
        self.loss_metrics = []
        self.epoch = 0
    def fit(self, X, y, outX=None, outY=None):
        x_ten = torch.tensor(X.to_numpy(), dtype=torch.float,
                             device=self.device)
        y_ten = torch.tensor(pd.DataFrame(y).to_numpy(), dtype=torch.float,
                             device=self.device)
        # Only needed if you pass in an out of sample to check as well
        if outX is not None:
            x_out_ten = torch.tensor(outX.to_numpy(), dtype=torch.float,
                             device=self.device)
            y_out_ten = torch.tensor(pd.DataFrame(outY).to_numpy(), dtype=torch.float,
                             device=self.device)
        self.epoch += 1
        # If mod is not already created, create a new one, else update prior
        if self.mod is None:
            loc_mod = logit_pytorch(nvars=X.shape[1], activate=self.activate, 
                                    bias=self.bias, final=self.final, 
                                    device=self.device)
            self.mod = loc_mod
        else:
            loc_mod = self.mod
        opt = torch.optim.Adam(loc_mod.parameters(), lr=1e-4)
        crit = self.loss
        for t in range(self.iters):
            opt.zero_grad()
            y_pred = loc_mod(x_ten)
            loss = crit(y_pred,y_ten)
            if t % self.printn == 0:
                if outX is not None:
                    pred_os = loc_mod(x_out_ten)
                    loss_os = crit(pred_os,y_out_ten)
                    res_tup = (self.epoch, t, loss.item(), loss_os.item())
                    print(f'{t}: insample {res_tup[2]:.4f}, outsample {res_tup[3]:.4f}')
                else:
                    res_tup = (self.epoch, t, loss.item(), None)
                    print(f'{t}: insample {res_tup[2]:.5f}')
                self.loss_metrics.append(res_tup)
            loss.backward()
            opt.step()
    def predict_proba(self, X):
        x_ten = torch.tensor(X.to_numpy(), dtype=torch.float,
                             device=self.device)
        res = self.mod(x_ten)
        pp = res.cpu().detach().numpy()
        return np.concatenate((1-pp,pp), axis=1)
    def loss_stats(self, plot=True, select=0):
        pd_stats = pd.DataFrame(self.loss_metrics, columns=['epoch','iteration',
                                                            'insamploss','outsamploss'])
        if plot:
            pd_stats2 = pd_stats.rename(columns={'insamploss':'In Sample Loss', 'outsamploss':'Out of Sample Loss'})
            pd_stats2 = pd_stats2[pd_stats2['iteration'] > select].copy()
            ax = pd_stats2[['iteration','In Sample Loss','Out of Sample Loss']].plot.line(x='iteration', 
                            ylabel=f'{self.loss_name} loss')
            plt.show()
        return pd_stats

Again it allows you to pass in various extras, which here are just illustrations for binary predictions (like the loss function as the Brier score or the more typical log-loss). It also allows you to evaluate the fit for just in-sample, or for out of sample data as well. It also allows you to specify the number of iterations to fit.

So now that we have all that work done, here as some simple examples of its use.

# Creating a model and fitting
mod = pytorchLogit()
mod.fit(recid_train[ind_vars], recid_train[y_var])

So you can see that this is very similar now to sklearn functions. It will print at the console fit statistics over the iterations:

So it defaults to 25k iterations, and you can see that it settles down much before that. I created a predict_proba function, same as most sklearn model objects for binary predictions:

# Predictions out of sample
predprobs = mod.predict_proba(recid_test[ind_vars])
predprobs # 1st column is probability 0, 2nd prob 1

And this returns a numpy array (not a pytorch tensor). Although you could modify to return a pytorch tensor if you wanted it to (or give an option to specify which).

Here is an example of evaluating out of sample fit as well, in addition to specifying a few more of the options.

# Evaluating predictions out of sample, more iterations
mod2 = pytorchLogit(activate='tanh', iters=40001, printn=100)
mod2.fit(recid_train[ind_vars], recid_train[y_var], recid_test[ind_vars], recid_test[y_var])

I also have an object function, .loss_stats(), which gives a nice graph of in-sample vs out-of-sample loss metrics.

# Making a nice graph
dp = mod2.loss_stats()

We can also select the loss function to only show later iterations, so it is easier to zoom into the behavior.

# Checking out further along
mod2.loss_stats(select=10000)

And finally like I said you could modify some of your own functions here. So instead of any activation function I pass in the identity function – so this turns the model into something very similar to a vanilla logistic regression.

# Inserting in your own activation (here identity function)
def ident(input):
    return input

mod3 = pytorchLogit(activate=ident, iters=40001, printn=2000)
mod3.fit(recid_train[ind_vars], recid_train[y_var], recid_test[ind_vars], recid_test[y_var])

And then if you want to access the coefficients weights, it is just going down the rabbit hole to the pytorch object:

# Can get the coefficients/intercept
print( mod3.mod.coef )
print( mod3.mod.bias )

This type of model can of course be extended however you want, but modifying the pytorchLogit() and logit_pytorch class objects to specify however detailed switches you want. E.g. you could specify adding in hidden layers.

One thing I am not 100% sure the best way to accomplish is loss functions that take more parameters, as well as the best way to set up the optimizer. Maybe use *kwargs for the loss function. So for my use cases I have stuffed extra objects into the initial class, so they are there later if I need them.

Also here I would need to think more about how to save the model to disk. The model is simple enough I could dump the tensors to numpy, and on loading re-do them as pytorch tensors.

1 Comment

by Andy Wheeler on May 24, 2021 • Permalink

Posted in data science, Python

Tagged logistic-regression, pytorch

Posted by Andy Wheeler on May 24, 2021

https://andrewpwheeler.com/2021/05/24/fitting-a-pytorch-model/

ROC and calibration plots for binary predictions in python

When doing binary prediction models, there are really two plots I want to see. One is the ROC curve (and associated area under the curve stat), and the other is a calibration plot. I have written a few helper functions to make these plots for multiple models and multiple subgroups, so figured I would share, binary plots python code. To illustrate their use, I will use the same Compas recidivism data I have used in the past, (CSV file here). So once you have downloaded those two files you can follow along with my subsequent code.

Front Prep

First, I have downloaded the binary_plots.py file and the PreppedCompas.csv file to a particular folder on my machine, D:\Dropbox\Dropbox\Documents\BLOG\binary_plots. To import these functions, I append that path using sys, and change the working directory using os. The other packages are what I will be using the fit the models.

###############################################
# Front end prep

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import os
import sys

my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\binary_plots'
os.chdir(my_dir)

# Can append to path
sys.path.append(my_dir)
import binary_plots

np.random.seed(10) #setting the seed for the random
# split for train/test
###############################################

Next up I prepare the data, this is just boring stuff turning categorical variables into various dummies and making a train/test split for the data (which can be done in a variety of ways).

###############################################
# Prepping Compas Data

#For notes on data source, check out 
#https://github.com/apwheele/ResearchDesign/tree/master/Week11_MachineLearning
recid = pd.read_csv('PreppedCompas.csv')

#Preparing the variables I want
recid_prep = recid[['Recid30','CompScore.1','CompScore.2','CompScore.3',
                    'juv_fel_count','YearsScreening']].copy()
recid_prep['Male'] = 1*(recid['sex'] == "Male")
recid_prep['Fel'] = 1*(recid['c_charge_degree'] == "F")
recid_prep['Mis'] = 1*(recid['c_charge_degree'] == "M")

print( recid['race'].value_counts() )
dum_race = pd.get_dummies(recid['race'])
# White for reference category
for d in list(dum_race):
    if d != 'Caucasion':
        recid_prep[d] = dum_race[d]

print( recid['marital_status'].value_counts() )
dum_mar = pd.get_dummies(recid['marital_status'])
recid_prep['Single'] = dum_mar['Single']
recid_prep['Married'] = dum_mar['Married'] + dum_mar['Significant Other']
# reference category is separated/unknown/widowed

#Now generating train and test set
recid_prep['Train'] = np.random.binomial(1,0.75,len(recid_prep))
recid_train = recid_prep[recid_prep['Train'] == 1].copy()
recid_test = recid_prep[recid_prep['Train'] == 0].copy()

#Independant variables
ind_vars = ['CompScore.1','CompScore.2','CompScore.3',
            'juv_fel_count','YearsScreening','Male','Fel','Mis',
            'African-American','Asian','Hispanic','Native American','Other',
            'Single','Married']

# Dependent variable
y_var = 'Recid30'
###############################################

Next, the sklearn library makes it quite easy to fit a set of multiple models. Most of the time I start with XGBoost, random forests, and a normal logistic model with no coefficient penalty. I just stuff the base model object in a dictionary, pipe in the same training data, and fit the models. Then I can add in the predicted probabilities from each model into the test dataset. (These plots I show you should only show on the test dataset, of course the data will be calibrated on the training dataset.)

###############################################
# Training three different models, Logit,
# Random Forest, and XGBoost

final_models = {}
final_models['XGB'] = XGBClassifier(n_estimators=100, max_depth=5)
final_models['RF'] = RandomForestClassifier(n_estimators=1000, max_depth=10, min_samples_split=50)
final_models['Logit'] = LogisticRegression(penalty='none', solver='newton-cg')

# Iterating over each model and fitting on train
for nm, mod in final_models.items():
    mod.fit(recid_train[ind_vars], recid_train[y_var])

# Adding predicted probabilities back into test dataset
for nm, mod in final_models.items():
    # Predicted probs out of sample
    recid_test[nm] =  mod.predict_proba(recid_test[ind_vars])[:,1]
###############################################

This is fairly tiny data, so I don’t need to worry about how long this takes or run out of memory. I’d note you can do the same model, but different hyperparameters in this approach. Such as tinkering with the depth for tree based models is one I by default limit quite a bit.

AUC Plots

First, my goto metric to see the utility of a particular binary prediction model is the AUC stat. This has one interpretation in terms of the concordance stat, an AUC of 0.7 means if you randomly picked a 0 case and a 1 case, the 1 case would have a higher value 70% of the time. So AUC is all about how well your prediction discriminates between the two classes.

So with my binary_plots function, you can generate an ROC curve for the test data for a single column of predictions as so:

# A single column
binary_plots.auc_plot(recid_test, y_var, ['Logit'], save_plot='AUC1.png')

As I have generated predictions for multiple models, I have also generated a similar graph, but stuff the AUC stats in the matplotlib legend:

# Multiple columns to show different models
pred_prob_cols = list(final_models.keys()) #variable names
binary_plots.auc_plot(recid_test, y_var, pred_prob_cols, save_plot='AUC2.png')

It is also the case you want to do these plots for different subgroups of data. In recidivism research, we are often interested in sex and racial breakdowns. Here is the Logit model AUC broken down by Males (1) and Females (0).

# By subgroups in the data
binary_plots.auc_plot_long(recid_test, y_var, 'Logit', group='Male', save_plot='AUC3.png')

So this pulls the labels from the data, but you can pass in strings to get nicer labels. And finally, I show how to put both of these together, both by models and by subgroups in the data. Subgroups are different panels, and you can pass in a fontsize moniker to make the legends smaller for each subplot, and a size for each subplot (they are squares).

# Lets make nicer variable names for Male/Females and Racial Groups
recid_test['Sex'] = recid_test['Male'].replace({0: 'Female', 1:'Male'})
recid_test['Race'] = recid[recid_prep['Train'] == 0]['race']
recid_test['Race'] = recid_test['Race'].replace({'Hispanic': 'Other', 'Asian':'Other', 'Native American':'Other', 'African-American':'Black', 'Caucasian':'White'})

# Now can do AUC plot by gender and model type
binary_plots.auc_plot_wide_group(recid_test, y_var, pred_prob_cols, 'Sex', size=4, leg_size='x-small', save_plot='AUC4.png')

The plots have a wrap function (default wrap at 3 columns), so you can plot as many subgroups as you want. Here is an example combing the sex and race categories:

One limitation to note in these plots, ROC plots are normalized in a way that the thresholds for each subgroup may not be at the same area of the plot (e.g. a FPR of 0.1 for one subgroup implies a predicted probability of 30%, whereas for another subgroup it implies a predicted probability of 40%).

ROC/AUC is definitely not a perfect stat, most of the time we are only interested in the far left hand side of the ROC curve (how well we can identify high risk cases without a ton of false positives). That is why I think drawing the curves are important – one model may have a higher AUC, but it is in an area of the curve not relevant for how you will use the predictions in practice. (For tree based models with limited depth and limited variables, it can produce flat areas in the ROC curve for example.)

But I find the ROC curve/AUC metric the most useful default for both absolute comparisons (how well is this model doing overall), as well as relative model comparisons (is Model A better than Model B).

Most models I work with I can get an AUC of 0.7 without much work, and once I get an AUC of 0.9 I am in the clearly diminishing returns category to tinkering with my model (this is true for both criminology related models I work with, as well as healthcare related models in my new job).

This is of course data dependent, and even an AUC of 0.9 is not necessarily good enough to use in practice (you need to do a cost-benefit type analysis given how you will use the predictions to figure that out).

Calibration Charts

For those with a stat background, these calibration charts I show are a graphical equivalent of the Hosmer-Lemeshow test. I don’t bother conducting the Chi-square test, but visually I find them informative to not only see if an individual model is calibrated, but also to see the range of the predictions (my experience XGBoost will be more aggressive in the range of predicted probabilities, but is not always well calibrated).

So we have the same three types of set ups as with the ROC plots, a single predicted model:

# For single model
binary_plots.cal_data('XGB', y_var, recid_test, bins=60, plot=True, save_plot='Cal1.png')

For multiple models, I always do these on separate subplots, they would be too busy to superimpose. And because it is a single legend, I just build the data and use seaborn to do a nice small multiple. (All of these functions return the dataframe I use to build the final plot in long format.) The original plot was slightly noisy with 60 bins, so I reduce it to 30 bins here, but it is still slightly noisy (but each model is pretty well calibrated). XGBoost has a wider range of probabilities, random forests lowest bin is around 0.1 and max is below 0.8. Logit has lower probabilities but none above 0.8.

# For multiple models
binary_plots.cal_data_wide(pred_prob_cols, y_var, recid_test, bins=30, plot=True, save_plot='Cal2.png')

For a single model, but by subgroups in the data. The smaller other race group is more noisy, but again each model appears to be approximately calibrated.

# For subgroups and one model
binary_plots.cal_data_group('XGB', y_var, 'Race', recid_test, bins=20, plot=True, save_plot='Cal3.png')

And a combo of subgroup data and multiple models. Again the smaller subgroup Females appear more noisy, but all three models appear to be doing OK in this quick example.

# For subgroups and multiple models
binary_plots.cal_data_wide_group(pred_prob_cols, y_var, 'Sex', recid_test, bins=20, plot=True, save_plot='Cal4.png')

Sometimes people don’t bin the data (Peter Austin likes to do a smoothed plot), but I find the binned data easier to follow and identify deviations above/below predicted probabilities. In real life you often have some fallout/dropoff if there is feedback between the model and how other people respond to the model (e.g. the observed is always 5% below the predicted).

4 Comments

by Andy Wheeler on May 12, 2021 • Permalink

Posted in data science, Data Visualization, Python

Tagged binary-predictions, machine-learning

Posted by Andy Wheeler on May 12, 2021

https://andrewpwheeler.com/2021/05/12/roc-and-calibration-plots-for-binary-predictions-in-python/

Python f string number formatting and SPSS break long labels

Another quick blog post, as moving is not 100% crazy all the time now, but I need a vacation after all that work. So two things in this blog post: formatting numeric f strings in python, and breaking long labels in SPSS meta-data.

Python f-string numeric formatting

This is super simple, but I can never remember it (so making a quick blog post for my own reference). As of python 3.6, you can use f-strings to do simple text substitution. So if you do:

x = 2/3
sub_str = f'This proportion is {x}'
print(sub_str)

Then we will get printed out This proportion is 0.6666666666666666. So packing global items inside of {} expands within the f string. While for more serious string subsitution (like creating parameterized SQL queries), I like to use string templates, these f-strings are very nice to print short messages to the console or make annotations in graphs.

Part of this note is that I never remember how to format these strings. If you are working with integers it is not a big deal, but as you can see above I often do not want to print out all those decimals inside my particular message. A simple way to format the strings are:

f'This proportion is {x:.2f}'

And this prints out to two decimal places 'This proportion is 0.67'. If you have very big numbers (say revenue), you can do something like:

f'This value is ${x*10000:,.0f}'

Which prints out 'This value is $6,667' (so you can modify objects in place, to say change a proportion to a percentage).

Note also to folks that you can have multi-line f-strings by using triple quotes, e.g.:

f'''This is a super
long f-string for {x:.2f}
on multiple lines!'''

But one annoying this is that you need to keep the whitespace correct inside of functions even inside the triple string. So those are cases I like using string templates. But another option is to break up the string and use line breaks via \n.

long_str = (f'This is line 1\n'
            f'Proportion is {x:.1f}\n'
            f'This is line 3')
print(long_str)

Which prints out:

This is line 1
Proportion is 0.7
This is line 3

You could do the line breaks however, either at the beginning of each line or at the end of each line.

SPSS break long labels

This was in reference to a project where I was working with survey data, and for various graphs I needed to break up long labels. So here is an example to illustrate the problem.

* Creating simple data to illustrate.
DATA LIST FREE / X Y(2F1.0).
BEGIN DATA
1 1
2 2
3 3
4 4
END DATA.
DATASET NAME LongLab.
VALUE LABELS X
  1 'This is a reallllllllly long label'
  2 'short label'
  3 'Super long unnecessary label that is long'
  4 'Again another long label what is up with this'
.
VARIABLE LABELS
  X 'Short variable label'
  Y 'This is also a super long variable label that is excessive!'
.
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="g" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("g"))
  DATA: X=col(source(s), name("X"), unit.category())
  DATA: Y=col(source(s), name("Y"))
  COORD: rect(dim(1,2), transpose())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Value"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(X*Y))
END GPL.

So you can see, SPSS shrinks the data to accommodate the long labels. (I don’t know how to control the behavior in the graph or the chart template itself, so not sure why only this gets wrapped for the first label.) So we can use the \n line break trick again in SPSS to get these to split where we prefer. Here are some python functions to do the splitting (which I am sure can be improved upon), as well as to apply the splits to the current SPSS dataset. You can decide the split where you want the line to be broken, and so if a word goes above that split level it wraps to the next line.

* Now some python to wrap long labels.
BEGIN PROGRAM PYTHON3.
import spss, spssaux

# Splits a long string with line breaks
def long_str(x,split):
    split_str = x.split(" ")
    cum = len(split_str[0])
    cum_str = split_str[0]
    for s in split_str[1:]:
        cum += len(s) + 1
        if cum <= split:
            cum_str += " " + s
        else:
            cum_str += r"\n" + s
            cum = len(s)
    return cum_str

# This grabs all of the variables in the current SPSS dataset
varList = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]

# This looks at the VALUE LABELS and splits them up on multiple lines
def split_vallab(vList, lsplit):
    vardict = spssaux.VariableDict()
    for v in vardict:
        if v in vList:
            vls= v.ValueLabels.keys()
            if vls:
                for k in vls:
                    ss = long_str(v.ValueLabels[k], lsplit)
                    if ss != v.ValueLabels[k]:
                        vn = v.VariableName
                        cmd = '''ADD VALUE LABELS %(vn)s %(k)s \'%(ss)s\'.''' % ( locals() )
                        spss.Submit(cmd)

# I run this to split up the value labels
split_vallab(varList, 20)

# This function is for VARIABLE LABELS
def split_varlab(vList,lsplit):
    for i,v in enumerate(vList):
        vlab = spss.GetVariableLabel(i)
        if len(vlab) > 0:
            slab = long_str(vlab, lsplit)
            if slab != vlab:
                cmd = '''VARIABLE LABELS %(v)s \'%(slab)s\'.''' % ( locals() )
                spss.Submit(cmd)

# I don't run this right now, as I don't need it
split_varlab(varList, 30)
END PROGRAM.

And now we can re-run our same graph command, and it is alittle nicer:

GGRAPH
  /GRAPHDATASET NAME="g" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("g"))
  DATA: X=col(source(s), name("X"), unit.category())
  DATA: Y=col(source(s), name("Y"))
  COORD: rect(dim(1,2), transpose())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Value"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(X*Y))
END GPL.

And you can also go to the variable view to see my inserted line breaks:

SPSS still does some auto-intelligence when to wrap lines in tables/graphs (so if you do DISPLAY DICTIONARY. it will still wrap the X variable label in my default tables, even though I have no line break). But this gives you at least a slight bit of more control over charts/tables.

Leave a comment

by Andy Wheeler on May 5, 2021 • Permalink

Posted in Python, SPSS

Tagged string

Posted by Andy Wheeler on May 5, 2021

https://andrewpwheeler.com/2021/05/05/python-f-string-number-formatting-and-spss-break-long-labels/

Some ACS download helpers and Research Software Papers

The blog has been a bit sparse recently, as moving has been kicking my butt (hanging up curtains and recycling 100 boxes today!). So just a few quick notes.

Downloading ACS Data

First, I have posted some helper functions to work with American Community Survey data (ACS) in python. For a quick overview, if you import/define those functions, here is a quick example of downloading the 2019 Texas micro level files (for census tracts and block groups) from the census FTP site. Can pipe in another year (if available) and and whatever state into the function.

# Python code to download American Community Survey data
base = r'??????' #put your path here where you want to download data
temp = os.path.join(base,'2019_5yr_Summary_FileTemplates')
data = os.path.join(base,'tables')

get_acs5yr(2019,'Texas',base)

Some locations have census tract data to download, I think the FTP site is the only place to download block group data though. And then based on those files you downloaded, you can then grab the variables you want, and here I show selecting out the block groups from those fields:

interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B11003_013','B14006_002','B01001_003','B23025_005',
            'B22010_002','B16002_004','GEOID','NAME']
labs, comp_tabs = merge_tabs(interest,temp,data)
bg = comp_tabs['NAME'].str.find('Block Group') == 0

Then based on that data, I have an additional helper function to calculate proportions given two lists of the numerators and denominators that you want:

top = ['B17010_002',['B11003_016','B11003_013'],'B08141_002']
bot = ['B17010_001',        'B11002_001'       ,'B08141_001']
nam = ['PovertyFamily','SingleHeadwithKids','NoCarWorkers']
prep_sdh = prop_prep(bg, top, bot, nam)

So here to do Single Headed Households with kids, you need to add in two fields for the numerator ['B11003_016','B11003_013']. I actually initially did this example with census tract data, so not sure if all of these fields are available at the block group level.

I have been doing some work on demographics looking at the social determinants of health (see SVI data download, definitions), hence the work with census data. I have posted my prior example fields I use from the census, but criminologists may just use the social-vulnerability-index from the CDC – it is essentially the same as how people typically define social disorganization.

Peer Review for Criminology Software

Second, jumping the gun a bit on this, but in the works is an overlay journal for CrimRxiv. Part of the contributions we will accept are software contributions, e.g. if you write an R package to do some type of analysis function common in criminology.

It is still in the works, but we have some details up currently and a template for submission (I need to work on a markdown template, currently just a word doc). High level I wanted something like the Journal of Statistical Software or the Journal of Open Source Software (I do not think the level of detail of JSS is necessary, but wanted an example use case, which JoSS does not have).

Just get in touch if you have questions whether your work is on topic. Aim is to be more open to contributions at first. Really excited about this, as publicly sharing code is currently a thankless prospect. Having a peer reviewed venue for such code contributions for criminologists fills a very important role that traditional journals do not.

Future Posts?

Hopefully can steal some time to continue writing posts here and there, but will definitely be busy getting the house in order in the next month. Hoping to do some work on mapping grids and KDE in python/geopandas, and writing about the relationship between healthcare data and police incident report data are two topics I hope to get some time to work on in the near future for the blog.

If folks have requests for particular topics on the blog though feel free to let me know in the comments or via email!

Leave a comment

by Andy Wheeler on April 22, 2021 • Permalink

Posted in Criminal Justice, data science, Python, scholarly

Tagged census, research-software

Posted by Andy Wheeler on April 22, 2021

https://andrewpwheeler.com/2021/04/22/some-acs-download-helpers-and-research-software-papers/

Minimum detectable effect sizes for place based designs

So I was reading Blattman et al.’s (2018) work on a hot spot intervention in Bogotá the other day. It is an excellent piece, but in a supplement to the paper Blattman makes the point that while his study is very high powered to detect spillovers, most other studies are not. I am going to detail here why I disagree with his assessment on that front.

In appendix A he has two figures, one for the direct effect comparing the historical hot spot policing studies (technically he uses the older 2014 Braga study, but here is the cite for the update Braga et al., 2020).

The line signifies a Cohen’s D of 0.17, and here is the same graph for the spillover estimates:

So you can see Blattman’s study in total number of spatial units of analysis breaks the chart so to speak. You can see however there are plenty of hot spot studies in either chart that reported statistically significant differences, but do not meet the 0.17 threshold in Chris’s chart. How can this be? Well, Chris is goal switching a bit here, he is saying using his estimator the studies appear underpowered. The original studies on the graph though did not necessarily use his particular estimator.

The best but not quite perfect analogy I can think of is this. Imagine I build a car that gets better gas mileage compared to the current car in production. Then someone critiques this as saying the materials that go into production of the car have worse carbon footprints, so my car is actually worse for the environment. It would be fine to argue a different estimate of total carbon footprint is reasonable (here Chris could argue his estimator is better than the ones the originally papers used). It is wrong though to say you don’t actually improve gas mileage. So it is wrong for Chris to say the original articles are underpowered using his estimator, they may be well powered using a different estimator.

Indeed, using either my WDD estimator (Wheeler & Ratcliffe, 2018) or Wilson’s log IRR estimator (Wilson, 2021), I will show how power does not grow with more experimental units, but with a larger baseline number of crimes for those estimators. They both only have two spatial units of analysis, so in Chris’s chart will never gain more power.

One way I think about the issue for spatial designs is this – you could always split up a spatial lattice into ever finer and finer spatial units of analysis. For example Chris could change his original design to use addresses instead of street segments, and split up the spillover buffers into finer slices as well. Do you gain something for doing nothing though? I doubt it.

I describe in my dissertation how finer spatial units of analysis allow you to check for finer levels of spatial spillovers, e.g. can check if crime spills over from the back porch to the front stoop (Wheeler, 2015). But when you do finer spatial units, you get more cold floor effects as well due to the limited nature of crime counts – they cannot go below 0. So designs with lower baseline crime rates tend to show lower power (Hinkle et al., 2013).

MDE for the WDD and log IRR

For minimum detectable effect (MDE) sizes for OLS type estimators, you need to specify the variance you expect the underlying treated/control groups to have. With the count type estimators I will show here, the variance is fixed according to the count. So all I need to specify is the alpha level of the test. Here I will do a default of 0.05 alpha level (with different lines for one-tailed vs two-tailed). The other assumption is the distribution of crime counts between treated/control areas. Here I assume they are all equal, so 4 units (pre/post and treated/control). For my WDD estimator this actually does not matter, for the later IRR estimator though it does (so the lines won’t really be exact for his scenario).

So here is the MDE for mine and Jerry’s WDD estimator:

What this means is that if you have an average of 20 crimes in the treated/control areas for each time period separately, you would need to find a reduction of 15 crimes to meet this threshold MDE for a one-tailed. It is pretty hard when starting with low baselines! For an example close to this, if the treated area went from 24 to 9, and the control area was 24 to 24, this would meet the minimal treated reduction of 15 crimes in this example.

And here is the MDE for the log IRR estimator. The left hand Y axis has the logged effect, and the right hand side has the exponentiated IRR (incident rate ratio).

Since the IRR is commonly thought of as a percent reduction, this suggests even with baselines of 200 crimes, for Wilson’s IRR estimator you need percent reductions of over 20% relative to control areas.

So I have not gone through the more recent Braga et al. (2020) meta-analysis. I do not know if they have the data readily available to draw the points on this plot the same as in the Blattman article. To be clear, it may be Blattman is right and these studies are underpowered using either his or my estimator, I am not sure. (I think they probably are quite underpowered to detect spillover, since this presumably will be an even smaller amount than the direct effect. But that would not explain estimates of diffusion of benefits commonly found in these studies!)

I also do not know if one estimator is clearly better or not – for example Blattman could use my estimator if he simply pools all treated/control areas. This is not obviously better than his approach though, and foregoes any potential estimates of treatment effect variance (I will be damned if I can spell that word starting with het even close enough for autocorrect). But maybe the pooled estimate is OK, Blattman does note that he has cold floor effects in his linear estimator – places with higher baselines have larger effects. This suggests Wilson’s log IRR estimator with the pooled data may be just fine and dandy for example.

Python code

Here is the python code in its entirety to generate the above two graphs. You can see the two functions to calculate the MDE given an alpha level and average crime counts in each area if you are planning your own study, the wdd_mde and lirr_mde functions.

'''
Estimating minimum detectable effect sizes
for place based crime interventions

Andy Wheeler
'''

import numpy as np
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
import os
my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\min_det_effect'
os.chdir(my_dir)

#########################################################
#Settings for matplotlib

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}

matplotlib.rcParams.update(andy_theme)
#########################################################

#########################################################
# Functions for MDE for WDD and logIRR estimator


def wdd_mde(avg_counts,alpha=0.05,tails='two'):
    se = np.sqrt( avg_counts*4 )
    if tails == 'two':
        a = 1 - alpha/2
    elif tails == 'one':
        a = 1 - alpha
    z = norm.ppf(a)
    est = z*se
    return est

def lirr_mde(avg_counts,alpha=0.05,tails='two'):
    se = np.sqrt( (1/avg_counts)*4 )
    if tails == 'two':
        a = 1 - alpha/2
    elif tails == 'one':
        a = 1 - alpha
    z = norm.ppf(a)
    est = z*se
    return est

# Generating regular grid from 10 to 200
cnts = np.arange(10,201)
wmde1 = wdd_mde(cnts, tails='one')
wmde2 = wdd_mde(cnts)

imde1 = lirr_mde(cnts, tails='one')
imde2 = lirr_mde(cnts)

# Plot for WDD MDE
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(cnts, wmde1,color='k',linewidth=2, label='One-tailed')
ax.plot(cnts, wmde2,color='blue',linewidth=2, label='Two-tailed')
ax.set_axisbelow(True)
ax.set_xlabel('Average Number of Crimes in Treated/Control')
ax.set_ylabel('Crime Count Reduction')
ax.legend(loc='upper left')
plt.xticks(np.arange(0,201,20))
plt.yticks(np.arange(10,61,5))
plt.title("WDD MDE alpha level 0.05")
plt.savefig('WDD_MDE.png', dpi=500, bbox_inches='tight')

# Plot for IRR MDE
fig, ax = plt.subplots(figsize=(8,6))
ax2 = ax.secondary_yaxis("right", functions=(np.exp, np.log))
ax.plot(cnts,-1*imde1,color='k',linewidth=2, label='One-tailed')
ax.plot(cnts,-1*imde2,color='blue',linewidth=2, label='Two-tailed')
ax.set_axisbelow(True)
ax.set_xlabel('Average Number of Crimes in Treated/Control')
ax.set_ylabel('log IRR')
ax.set_ylim(-0.16, -1.34)
ax.legend(loc='upper right')
ax.set_yticks(-1*np.arange(0.2,1.31,0.1))
ax2.set_ylabel('IRR')
ax2.grid(False)
plt.xticks(np.arange(0,201,20))
plt.title("IRR MDE alpha level 0.05")
plt.savefig('IRR_MDE.png', dpi=500, bbox_inches='tight')

#########################################################

References

Blattman, C., Green, D., Ortega, D., & Tobón, S. (2018). Place-based interventions at scale: The direct and spillover effects of policing and city services on crime (No. w23941). National Bureau of Economic Research.
Braga, A. A., & Weisburd, D. L. (2020). Does Hot Spots Policing Have Meaningful Impacts on Crime? Findings from An Alternative Approach to Estimating Effect Sizes from Place-Based Program Evaluations. Journal of Quantitative Criminology, Online First.
Hinkle, J. C., Weisburd, D., Famega, C., & Ready, J. (2013). The problem is not just sample size: The consequences of low base rates in policing experiments in smaller cities. Evaluation Review, 37(3-4), 213-238.
Wheeler, A.P. (2015). What we can learn from small units of analysis. Dissertation, SUNY Albany.
Wheeler, A.P., & Ratcliffe, J.H. (2018). A simple weighted displacement difference test to evaluate place based crime interventions. Crime Science, 7(1), 11.
Wilson, D. B. (2021). The relative incident rate ratio effect size for count-based impact evaluations: When an odds ratio is not an odds ratio. Journal of Quantitative Criminology, Online First.

Leave a comment

by Andy Wheeler on April 2, 2021 • Permalink

Posted in Crime Analysis, Python

Tagged hot spots, irr, quasi-experiment, wdd

Posted by Andy Wheeler on April 2, 2021

https://andrewpwheeler.com/2021/04/02/minimum-detectable-effect-sizes-for-place-based-designs/

Comparing the WDD vs the Wilson log IRR estimator

So this is maybe my final post on the WDD estimator for the time being (Wheeler & Ratcliffe, 2018). Recently David Wilson had an article in JQC that proposes a different estimator using the same basic information, just pre-post crime counts for treated and control areas (Wilson, 2021). So say we had the table:

         Pre   Post
Treated   50     30
Control   60     55

So in this scenario, my WDD estimate is -20 in the treated area, and -5 in the control area, so the overall estimate is -20 – -5 = -15.

30 - 50 - (55 - 60) = -15

So an estimated reduction of -15 crimes overall. David’s estimator is the logged incident rate ratio (IRR), and so is just like above, except logs all of the values:

log(30) - log(50) - ( log(55) - log(60) ) = -0.4238142

This is a logged incident rate adjustment, so most of the time people exponentiate this value, which is exp(-0.4238142) = 0.6545455. So this suggests crime is reduced by approximately 35% in the treated area relative to the control area in this hypothetical. Or another way to write it is (30/50)/(55/60) = 0.6545455.

So instead of a linear estimate of the total numbers of crimes reduced, this is an estimate of the overall rate reduction. So this begs the question when would you prefer my WDD vs the IRR? I will try to answer that below – in short I think David’s estimator makes sense for meta-analyses (as I have said before in reference to the work in Braga & Weisburd, 2020). But for an individual agency doing an experimental evaluation I much prefer my estimator. The skinny of this logic is that we only really care about the overall crime reduction estimate from a cost-benefit analysis perspective. Backing out this total crime reduction count estimate from David’s IRR estimate can result in some funny business for an individual study.

Identifying Assumptions

So there are really two different assumptions my WDD estimator and David’s IRR estimator make. To generate a standard error estimate around the point estimate for either estimator, both require the data are Poisson distributed. So that makes no difference between the two. The assumption that really distinguishes between the WDD and the IRR estimate is the parallel trends assumption. The WDD assumes parallel trends are on the linear scale, whereas the IRR assumes parallel trends are on the ratio scale.

What exactly does this mean? Imagine we have a treated and control area, but look at the crime trends per time period before the treatment occurred. This set of areas has a set of parallel trends on the linear scale:

Time Treated Control
 0     50      60
 1     40      50
 2     45      55
 3     50      60

When the treated area goes down by 10 crimes, the control area goes down by 10 crimes. That is a parallel on the linear scale. Whereas this scenario is parallel on the ratio scale:

Time Treated Control
 0     50      60
 1     40      48
 2     45      54
 3     50      60

When crime goes down by 20% in the treated area, it goes down by 20% in the control area.

So while this gives a potential way to say you should use the WDD (parallel on the linear scale), or the IRR (parallel on the ratio scale), in practice it is not so simple. For one thing, if you only has the pre/post counts of crime, you cannot distinguish between these two scenarios. You can only tell in the case you have historical data to examine.

For a second part of this, you typically can choose your own control area (see for example the synthetic control estimator). So in most scenarios you could choose a control area to obey the linear or the ratio parallel trends assumption if you wanted to. However it may be in many scenarios there is a natural/easy control area, and you may see what is a better fit in that case for linear/ratio.

A final wee bit of a perverse aspect about this I will mention – pretend we have a treated/control area have approximately the same baseline crime counts/rates:

Time Treated Control
 0      30     30
 1      25     25
 2      20     20
 3      25     25

You actually cannot tell in this scenario whether the parallel trends are on the linear scale for my WDD or the ratio scale for the IRR estimate. They are consistent with either! In practice I think in many cases it will be like this – with noisy data, if you choose a control area that has approximately the same baseline crime counts, it will be quite hard to tell whether the linear parallel trends makes more sense or the ratio parallel trends makes more sense.

There are situations where the linear changes do not make sense, but they tend to be scenarios such as the control area has very little crime (so cannot go below 0 to match larger ups/downs in the treated area). So in that case sure the IRR is plausible and the WDD is not, but those are cases where the control area itself is quite questionable. Also note the IRR is not defined for any cells with 0 crimes – but again it is not good for either of our estimators in that case (although mine won’t fail to spit out a number, the power is so low the number it spits out won’t be worth much).

Bias/Coverage

So I have adapted the same simulation code I used in prior studies/blog posts to evaluate the null distribution and the coverage of David’s IRR estimator. I partly did not pursue it initially back when me and Jerry were discussing this idea, because I thought it would be biased. Generalized linear models are based on maximum likelihood estimators, which are only asymptotically valid. In short it appears I was wrong here and David’s IRR estimator is fine even with just four observations, at least for the handful of scenarios I have tried it (have not looked at very tiny counts of crime, it is undefined if any cell has 0 crimes, as you cannot take the log of 0).

Python code pasted at the very end of the blog post, but for example if we generate a set of null no changes pre/post with a baseline of 50 crimes, the logged irr estimate (converted into a z-score here) is just fine and dandy and has a very close to standard normal distribution based on 10k simulations.

So lets look at the scenario where the control area doesn’t change, but the treated area goes from 50 to 30. We can see again the point estimate in this scenario is spot on the money.

And then we can see the coverage of the logged irr estimator is spot on as well:

So if you are interested in slightly different baseline scenarios, you can use that same simulation code to check out the behavior of David’s estimator and conduct simulated power analysis the same way I have shown for the WDD estimator in prior blog posts.

So if both are unbiased and have good coverage again, why would you prefer the WDD estimator over the IRR estimator (or vice-versa)? Well, lets take the 35% reduction I talked about at the beginning of the post, and the department needs to spend $250k on extra officers to conduct whatever hot spot policing intervention. A 35% reduction may be worth it if we start with a baseline of 200 crimes (so would expect to go down to 130, for a reduction of 70 crimes). If the baseline is 20 crimes, it goes down to 13 crimes (so only a reduction of 7 crimes). The actual benefit of the IRR estimate is entirely dependent on the baseline count of crimes it is applied to.

Even if the IRR estimate is itself unbiased and has proper coverage, for even an individual study backing out the estimated reduction in total crimes from the IRR is biased. So here in this same simulated data (50 to 30 in treated, and 50 to 50 in control areas). The true count reduction is -20, and here is the point estimate on the X axis and the length of the confidence interval for each simulation on the Y axis for my WDD test. You can see they are nicely centered on -20, and the length of the confidence intervals has a very tiny variance – they are mostly just a smidge over 50 in total length. So that is probably tough to wrap your head around, but the variance of the variance estimates for the WDD are small.

Now lets do the same graph for the IRR estimate, but translated back out to a count crime reduction based on the simulated values:

We either have a ton of bias in this estimate (if the estimate of the count reduction is too large, the confidence interval is too small). Or the opposite, the estimate of the count reduction is too small, and the confidence interval is crazy wide. In Andrew Gelman’s terminology, it can result in pretty large type M (magnitude) errors in this simulated example (Gelman & Carlin, 2014). So the variance of the variance estimates in this scenario are quite large.

To be clear – if you are interested in estimating a percent reduction, by all means use David’s IRR estimator. If you however want to translate that percent reduction into an estimate of the total crimes reduced though you should use my WDD estimator in that case. You should not back out a total crimes reduced estimate from the IRR.

Final Thoughts

So I have said a few times I think the IRR estimator makes more sense for meta-analyses. Why do I think that? Well, imagine we have an underlying causal process through which a hot spots policing experiment can randomly deter/prevent a particular proportion of crimes. That underlying causal process suggests an IRR effect. And also the problem I mention with translating back to crime counts I believe should get smaller with tighter estimates.

For a causal process that is more akin to my WDD estimator, imagine some crimes will always be deterred/prevented from a hot spots policing experiment, and some will never be. And we don’t know up-front which is which, so the observed reduction is based on whatever mixture of the two we have at that particular location.

The proportion reduction seems to make more sense to me for active patrol type interventions (which are ephemeral) vs permanent CPTED like interventions which should prevent certain criminal acts in perpetuity. But of course any situation in the real world could have both occurring at the same time.

When you go and look at the meta-analysis of hot spots policing, those interventions are all over the place (Hinkle et al., 2020). I think my WDD estimate would not make sense to mash up into a final meta-analytic estimate. The IRR may not make sense either in the end, but it is plausibly more relevant to compare the IRRs from a study with a baseline of 200 crimes vs one with 40 crimes at baseline. I am not sure it makes sense to compare WDDs in that scenario. But that being said, a few of my blog posts have discussed the WDD normalized per unit area or per unit time. Those normalized estimates are probably more apples to apples in the 200 vs 40 scenario.

A final note I have not discussed here is that David discusses a correction for overdispersion, so that is a potential feather in the cap for his estimator vs the WDD. I’d be a bit hesitant though with that – only four observations to estimate the dispersion term is slicing it a bit thin IMO. But I was wrong about the original estimator, so I may be wrong about that as well. It will take simulation evidence to determine that though – David’s paper just provides the correction term, he doesn’t provide evidence for its utility with small sample data.

And to be fair I have not done simulations to see how my estimator behaves in the presence of overdispersion either. I believe it will simply just cause the standard errors to be too small, so like in Wheeler (2016), I imagine it will just require upping the interval (e.g. use a z-score of 3 instead of 2) to get proper coverage for real crime data.

References

Braga, A. A., & Weisburd, D. L. (2020). Does Hot Spots Policing Have Meaningful Impacts on Crime? Findings from An Alternative Approach to Estimating Effect Sizes from Place-Based Program Evaluations. Journal of Quantitative Criminology, Online First.
Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.
Hinkle, J. C., Weisburd, D., Telep, C. W., & Petersen, K. (2020). Problem-oriented policing for reducing crime and disorder: An updated systematic review and meta-analysis. Campbell Systematic Reviews, 16(2), e1089.
Wheeler, A. P. (2016). Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice. International Journal of Police Science & Management, 18(3), 159-172.
Wheeler, A.P., & Ratcliffe, J.H. (2018). A simple weighted displacement difference test to evaluate place based crime interventions. Crime Science, 7(1), 11.
Wilson, D. B. (2021). The relative incident rate ratio effect size for count-based impact evaluations: When an odds ratio is not an odds ratio. Journal of Quantitative Criminology, 1-19.

Python simulation code

Here is a copy-pasted chunk of the entire python simulation code.

'''
Comparing WDD to log(IRR) from Wilson's
recent paper, https://link.springer.com/article/10.1007/s10940-021-09494-w

Andy Wheeler
'''

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import poisson
from scipy.stats import uniform
import matplotlib
import matplotlib.pyplot as plt
import os
my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\wdd_vs_irr'
os.chdir(my_dir)

#########################################################
#Settings for matplotlib

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}

matplotlib.rcParams.update(andy_theme)
#########################################################


#This works for the scipy functions as well
np.random.seed(seed=10)

# A function to generate the WDD estimate for simulated data
def wdd_sim(treat0,treat1,cont0,cont1,pre,post):
    tr_cr_0 = poisson.rvs(mu = treat0, size=int(pre)).sum()
    co_cr_0 = poisson.rvs(mu = cont0, size=int(pre)).sum()
    tr_cr_1 = poisson.rvs(mu = treat1, size=int(post)).sum()
    co_cr_1 = poisson.rvs(mu = cont1, size=int(post)).sum()
    # WDD estimates
    est = ( tr_cr_1/post - tr_cr_0/pre ) - ( co_cr_1/post - co_cr_0/pre )
    post2 = (1/post)**2
    pre2 = (1/pre)**2
    var_est = tr_cr_0*pre2 + tr_cr_1*post2 + co_cr_0*pre2 + co_cr_1*post2
    true_val = ( treat1 - treat0 ) - ( cont1 - cont0 )
    z_score = est / np.sqrt(var_est)
    # Wilson log IRR estimates
    true_logirr = np.log( (treat1*cont0) / (cont1*treat0) )
    est_logirr = np.log( ((tr_cr_1/post)*(co_cr_0/pre)) / ( (co_cr_1/post)*(tr_cr_0/pre) ) )
    se_logirr = np.sqrt( 1/tr_cr_1 + 1/co_cr_0 + 1/co_cr_1 + 1/tr_cr_0 )
    z_logirr = est_logirr / se_logirr
    return (tr_cr_0, co_cr_0, tr_cr_1, co_cr_0, est, var_est, true_val, z_score, true_logirr, est_logirr, se_logirr, z_logirr)

def make_data(n, treat0, treat1, cont0, cont1, pre, post):
    base = pd.DataFrame( range(n), columns=['index'])
    base['treat0'] = treat0
    if treat1 is not None:
        base['treat1'] = treat1
    else:
        base['treat1'] = base['treat0']
    if cont0 is not None:
        base['cont0'] = cont0
    else:
        base['cont0'] = base['treat0']
    if cont1 is not None:
        base['cont1'] = cont1
    else:
        base['cont1'] = base['cont0']
    base.drop(columns='index',inplace=True)
    base['pre'] = pre
    base['post'] = post
    sim_vals = base.apply(lambda x: wdd_sim(**x), axis=1, result_type='expand')
    sim_vals.columns = ['sim_t0','sim_c0','sim_t1','sim_c1','est','var_est','true_val','z_score',
                        'true_logirr','est_logirr','se_logirr','z_logirr']
    return pd.concat([base,sim_vals], axis=1)

# Coverage of the log irr estimate
# Lets look at the coverage rate for a decline from 40 to 20
def cover_logirr(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*data['se_logirr']
    low = data['est_logirr'] - dif
    high = data['est_logirr'] + dif
    cover = ( data['true_logirr'] > low) & ( data['true_logirr'] < high )
    return cover

# Length of ci for WDD
def len_ci(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    return low, high, high - low

# Length of ci for IRR estimate on count scale
# This depends on the baseline estimate to multiply
# The IRR by, using the baseline average of the 
# Treatment area

def len_irr(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*data['se_logirr']
    low = data['est_logirr'] - dif
    high = data['est_logirr'] + dif
    baseline = data['sim_t0']/data['pre']
    # Even if you use hypothetical, the variance is quite high
    #baseline = data['treat0']
    est_count = baseline*np.exp(data['est_logirr']) - baseline
    c1 = baseline*np.exp(low) - baseline
    c2 = baseline*np.exp(high) - baseline
    return est_count, c1, c2, np.abs(c2 - c1)

##########################
# Example with no change, lets look at the null distribution
sim_n = 10000
no_diff = make_data(sim_n, 50, 50, 50, 50, 1, 1)
no_diff['z_logirr'].describe()
##########################

##########################
# Example with equal time periods, a reduction from 50 to 30 and 50 to 50 in control area
sim_dat = make_data(sim_n, 50, 30, 50, 50, 1, 1)
sim_dat[['true_logirr','est_logirr','se_logirr']].describe()

cl = cover_logirr(sim_dat)
cl.mean()

# Compare length of CI for IRR vs WDD

# WDD length
lowdd, highwdd, lwdd = len_ci(sim_dat)
lwdd.describe()

# IRR length on the count scale
est_cnt_irr, lo_irr, hi_irr, ln_irr = len_irr(sim_dat)
ln_irr.describe()

# Scatterplot of estimated count reduction vs
# Length of CI
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(est_cnt_irr, ln_irr, c='k', 
            alpha=0.1, s=4)
ax.set_axisbelow(True)
ax.set_xlabel('Estimated Count Reduction [IRR]')
ax.set_ylabel('Length of CI on count scale [IRR]')
plt.savefig('IRR_Len_Est.png', dpi=500, bbox_inches='tight')
plt.show()

# Lets compare to the WDD estimate
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(sim_dat['est'], lwdd, c='k', 
            alpha=0.1, s=4)
ax.set_axisbelow(True)
ax.set_xlabel('Estimated Count Reduction [WDD]')
ax.set_ylabel('Length of CI on count scale [WDD]')
plt.savefig('WDD_Len_Est.png', dpi=500, bbox_inches='tight')
plt.show()
##########################

2 Comments

by Andy Wheeler on March 18, 2021 • Permalink

Posted in Crime Analysis, Python, scholarly

Tagged glm, Poisson, simulation, wdd

Posted by Andy Wheeler on March 18, 2021

https://andrewpwheeler.com/2021/03/18/comparing-the-wdd-vs-the-wilson-log-irr-estimator/

Clumpy hotspots

Read an article by Tim Hart the other day (part of a special issue I will have an article in as well here soon). In it he evaluated hot spot methods not only by how well they forecast crime, but also by the clumpiness of the hot spot method. Some hot spot methods, such as risk terrain modeling (Caplan et al., 2020; Fox et al., 2021), machine learning models (Wheeler & Steenbeek, 2020), or self-exciting point process models (Mohler et al., 2018) can by their nature produce discontinuous hot spots. Here is an example of a RTM map I made in Yoo & Wheeler (2019) for homeless related crime in Los Angeles, and you can see this is quite spotty in the ups/downs in the high risk areas:

Other hot spot methods, like hierarchical clustering (Wheeler & Reuter, 2020) or kernel density maps however this is not as big an issue. Here is an example kernel density map also from Yoo & Wheeler (2019) based on the same data:

So you can see how the hot spots in the kernel density map are spatially contiguous, whereas the RTM example can be little hot spots all over the jurisdiction. So it is obviously easier to patrol a single contiguous area than many islands over the entire jurisdiction. So it may make sense to trade off a contiguous area that captures somewhat fewer crimes than speckled areas that are all over the map.

Adepeju et al. (2016) was the first to use a particular statistic, the clumpiness index, to evaluate different hot spot methods. Their figure below is a pretty good depiction of the idea – count up the number of internal edges to a hot spot (when a hot spot grid-cell neighbors another hot spot), and the number of external edges. Then it is just a particular formula to make the index range from -1 to 1 given different sized hot spots.

So here I flip this idea on its head abit – instead of using a particular hot spot technique and see its clumpiness, I formulate a linear program given a prediction to trade off a smaller number of predicted crimes in the hot spot vs making the hot spot areas more clumpy. I illustrate my clumpy hot spots using just prior data to predict future data, in particular thefts from motor vehicles in Raleigh North Carolina.

I have posted the data/code on github here. It is a bit too long to embed the code directly in the blog post, but just see the 00_PrepData.py file. The crime data and Raleigh border I downloaded from the Raleigh open data website.

A Linear Program to Clump Hot Spots

So for some quick and dirty math in text, the linear program I formulate is:

Maximize { Sum[ theta*S_i*Crime_i + (1 - theta)*E_i ] }
Subject To:
    1) Sum( S_i ) = k
    2) E_i <= Sum(S_n for n in neighbors(i) ) for each i
    3) E_i <= S_i for each i
    4) S_i element of {0,1}, E_i >= 0 (and can be continuous)

The idea behind this is that if theta=1, this is the same as just taking whatever your input areas are and ranking them to pick the top k areas. So if you have 10000 500 by 500 foot grid cells as your spatial units of analysis, and you wanted the top 1% of the city, that is 100 grid cells. So you would choose k=100 in that scenario. Crime_i here I use as prior counts of crime in the grid cell, but it could be the predicted value from whatever model as well. That is the first constraint in this model approach – you need to choose the total area you want. S_i are the decision variables for the final selected hot spot areas.

The second and third constraints determine the values for the second set of decision variables, E_i. These are the decision variables that encode the interconnected links when a selected grid cell touches another grid cell. Constraint 2 sets E_i to the total number of neighbors of i that are selected, except constraint 3 says if S_i is 0 E_i needs to be 0 as well.

In this formulation, S_i need to be integer variables, but the E_i are defined by the sum of S_i, so they can be continuous. In this formulation if you have N grid cells (or whatever spatial units of analysis), this results in 2*N decision variables, and 2*N + 1 constraints. You could maybe save a few constraints here by working with an undirected graph instead of a directed one (in essence this double counts, a-b and b-a would count as two links). But this will just make it 1.5*N constraints instead of 2*N. So not a big deal probably. I did have some issues solving this using pulps default coin/GLPK solver. But CPLEX solved it no problem. (My example is a total of 20,986 500 by 500 foot grid cells, and I use rook contiguity like the Adepeju article as well. And using CPLEX it solves the models in just a few seconds.)

In this formulation you can think of theta as trading off crimes in the hot spot vs interior edges. So imagine you had theta=0.9, and you had a solution with 200 crimes and 100 interior edges. The objective function in that scenario would be 0.9*200 + 0.1*100 = 190. Now imagine you had an alternative scenario with 190 crimes, but 200 internal edges, the objective function would be 0.9*190 + 0.1*200 = 191. So you are saying, it is ok to have hot spots capture a smaller number of crimes, if they are more connected.

Normal Hotspots vs Clumpy Ones in Raleigh

The open data I use for Raleigh, North Carolina for the NIBRS dataset goes back to June 2014, and has data updated in the beginning of March 2021. I pull out larcenies from motor vehicles, and for the historical train dataset use car larcenies from 2014 through 2019 (n = 17,681). For the test dataset I use car larcenies in 2020 and what is available so far in 2021 (n = 3,376). Again these are grid cells generated over the city boundaries at 500 by 500 foot intervals. For illustration I grab out the top 1% of the city (209 grid cells). I use a train/test dataset as out of sample test data will typically result in reduced predictions. Here are the PAI stats for train vs test when selecting the top 1%.

For all subsequent selections I always use the historical training data to select the hot spots, and the test dataset to evaluate the PAI.

If we do the typical approach of just taking the highest crime grid cells based on the historical data, here are the results both for the PAI and the CI (clumpy index). For those not familiar, PAI is % Crime Capture/% Area, so if the denominator is 1%, and the PAI (for the test data) is 17, that means the hot spots capture 17% of the total thefts from vehicles. The CI ranges from -1 (spread apart) to 1 (entirely clustered). Here it is just over 0, suggesting these are basically randomly distributed in terms of clustering.

You may think that almost spatial randomness in terms of clumping seems at odds with that crime clusters! But it is not really – a consistent relationship with crime hot spots is that they are intensely localized, and often you can go down the street and be in a low crime area (Harries, 2006). The same idea when people say high crime neighborhoods often are spotty interior – they tend to have mostly low crime areas and just a few specific hot spots.

OK, so now to show off my linear program. So what happens if we use theta=0.9?

The total crime numbers are here for the historical data, and it ends up capturing the exact same number of crimes as the select top 1% does (3,664). But, it switches the selection of one of the areas. So what happens here is that we have ties – even with basically little weight assigned to the interior connections, it will prioritize tied crime areas to be connected to other chosen hot spots (whereas before the ties are just random in the way I chose the top 1%). So if you have many ties at the threshold for your hot spot, this is a great way to prioritize particular tied areas.

What happens if we turn down theta to 0.5? So this is saying you would trade off one for one – one interior edge is equal to one crime.

You can see that it changed the selections slightly more here, traded off 24 areas compared to the original just rank solution. Lets check out the map and the CI:

The CI value is now 0.17 (up from 0.08). You can see some larger blobs, but it is still pretty spread apart. But the reduction in the total number of crimes captured is pretty small, going from a PAI of 17 to now a PAI of 16. How about if we crank down theta even more to 0.2?

This trades off a much larger number of areas and total amount of crime – over half of the chosen grid cells are flipped in this scenario. In the subsequent map you can see the hot spots are much more clumpy now, and have a CI of 0.64.

The PAI of 12.6 is a bit of a hit as well, but is not too shabby still. I typically take a PAI of 10 to be the ballpark of what is reasonable based on Weisburd’s Law of Crime Concentration – 5% of the areas contain 50% of the crime (which is a PAI of 10).

So this shows one linear programming approach to trade off clumpy chosen areas vs disconnected speckles over the map. It may be the case though that other approaches are more reasonable, such as using some type of clustering to begin with. E.g. I could use DBSCAN on the gridded predicted values (Wheeler & Reuter, 2020) as see how clumpy those hot spots are. This approach is nice though if you have a fixed area you want to cover though.

Why Raleigh?

For a bit of personal news, I will be moving to the Raleigh area here shortly. I recently negotiated to be 100% remote at my job – so I will still be at HMS (or since we were recently purchased I might be employed by Gainwell I guess by the time I move). So looking forward to the new adventure back on the east coast but still in more temperate climates than PA or NY!

References

Adepeju, M., Rosser, G., & Cheng, T. (2016). Novel evaluation metrics for sparse spatio-temporal point process hotspot predictions-a crime case study. International Journal of Geographical Information Science, 30(11), 2133-2154.
Caplan, J. M., Neudecker, C. H., Kennedy, L. W., Barnum, J. D., & Drawve, G. (2020). Tracking Risk for Crime Throughout the Day: An Examination of Jersey City Robberies. Criminal Justice Review, Online First.
Fox, B., Trolard, A., Simmons, M., Meyers, J. E., & Vogel, M. (2021). Assessing the Differential Impact of Vacancy on Criminal Violence in the City of St. Louis, MO. Criminal Justice Review, Online First.
Harries, K. (2006). Extreme spatial variations in crime density in Baltimore County, MD. Geoforum, 37(3), 404-416.
Hart, T. C. (2021). Investigating Crime Pattern Stability at Micro-Temporal Intervals: Implications for Crime Analysis and Hotspot Policing Strategies. Criminal Justice Review, Online First.
Mohler, G., Raje, R., Carter, J., Valasik, M., & Brantingham, J. (2018). A penalized likelihood method for balancing accuracy and fairness in predictive policing. In 2018 IEEE international conference on systems, man, and cybernetics (SMC) (pp. 2454-2459). IEEE.
Wheeler, A. P. (2020). Allocating police resources while limiting racial inequality. Justice Quarterly, 37(5), 842-868.
Wheeler, A. P., & Reuter, S. (2020). Redrawing hot spots of crime in Dallas, Texas. Police Quarterly, Online First.
Wheeler, A. P., & Steenbeek, W. (2020). Mapping the risk terrain for crime using machine learning. Journal of Quantitative Criminology, Online First.
Yoo, Y., & Wheeler, A. P. (2019). Using risk terrain modeling to predict homeless related crime in Los Angeles, California. Applied Geography, 109, 102039.

1 Comment

by Andy Wheeler on March 9, 2021 • Permalink

Posted in Crime Analysis, Crime Mapping, Python

Tagged hot spots, linear programming

Posted by Andy Wheeler on March 9, 2021

https://andrewpwheeler.com/2021/03/09/clumpy-hotspots/

Transforming predicted variables in regression

The other day on LinkedIn I made a point about how I think scikits TransformedTargetRegressor is very likely to mislead folks. In fact, the example use case in the docs for this function is a common mistake, fitting a model for log(y), then getting predictions phat, and then simply exponentiating those predictions exp(phat).

On LinkedIn I gave an example of how this is problematic for random forests, but here is a similar example for linear regression. For simplicity pretend we only have 3 potential residuals (all equally likely), either a residual of -1, 0, or 1.

Now pretend our logged prediction is 5, so if we simply do exp(5) we get about 148. Now what are our predictions is we consider those 3 potential residuals?

Resid  Pred-Resid Modified_Pred LinPred
  -1     5 - -1        exp(6)     403
   0     5 -  0        exp(5)     148
   1     5 -  1        exp(4)      55

So if we take the mean of our LinPred column, we then get a prediction of about 202. The prediction using this approach is much higher than the naive approach of simply exponentiating 5. The difference is that the exp(5) estimate is the median, and the above estimate taking into account residuals is the mean estimate.

While there are some cases you may want the median estimate, in that case it probably makes more sense to use a quantile estimator of the median from the get go, as opposed to doing the linear regression on log(y). I think for many (probably most) use cases in which you are predicting dollar values, this underestimate can be very problematic. If you are using these estimates for revenue, you will be way under for example. If you are using these estimates for expenses, holy moly you will probably get fired.

This problem will happen for any non-linear transformation. So while some transformations are ok, in scikit for example minmax or standardnormal scalars are ok, things like logs, square roots, or box-cox transformations are not. (To know if it is a linear transformation, if you do a scatterplot of original vs transformed, if it is a straight line it is ok, if it is a curved line it is not!)

I had a friend go back and forth with me for a bit after I posted this. I want to be clear this is not me saying the model of log(y) is the wrong model, it is just to get the estimates for the mean predictions, you need to take a few steps. In particular, one approach to get the mean estimates is to use Duan’s Smearing estimator. I will show how to do that in python below using simulated data.

Example Duan’s Smearing in python

So first, we import the libraries we will be using. And since this is simulated data, will be setting the seed as well.

######################################################
import pandas as pd
import numpy as np
np.random.seed(10)

from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
######################################################

Next I will create a simple linear model on the log scale. So the regression of the logged values is the correct one.

######################################################
# Make a fake dataset, say these are housing prices
n = (10000,1)
error = np.random.normal(0,1,n)
x1 = np.random.normal(10,3,n)
x2 = np.random.normal(5,1,n)
log_y = 10 + 0.2*x1 + 0.6*x2 + error
y = np.exp(log_y)

dat = pd.DataFrame(np.concatenate([y,x1,x2,log_y,error], axis=1),
                   columns=['y','x1','x2','log_y','error'])
x_vars = ['x1','x2']

# Lets look at a histogram of y vs log y
dat['y'].hist(bins=100)
dat['log_y'].hist(bins=100)
######################################################

Here is the histogram of the original values:

And here is the histogram of the logged values:

So although the regression is the conditional relationship, if you see histograms like this I would also by default use a regression to predict log(y).

Now here I do the same thing as in the original function docs, I fit a linear regression using the log as the function and exponential as the inverse function.

######################################################
# Now lets see what happens with the usual approach
tt = TransformedTargetRegressor(regressor=LinearRegression(),
                                func=np.log, inverse_func=np.exp)
tt.fit(dat[x_vars], dat['y'])
print( (tt.regressor_.intercept_, tt.regressor_.coef_) ) #Estimates the correct values

dat['WrongTrans'] = tt.predict(dat[x_vars])

dat[['y','WrongTrans']].describe()
######################################################

So here we estimate the correct simulated values for the regression equation:

But as we will see in a second, the exponentiated predictions are not so well behaved. To illustrate how the WrongTrans variable behaves, I show its distribution compared to the original y value. You can see that on average it is a much smaller estimate. Our sample values have a mean of 7.5 million, and the naive estimate here only has a mean of 4.6 million.

Now here is a way to get an estimate of the mean value. In a nutshell, what you do is take the observed residuals, pretty much like that little table I did in the intro of this blog post, generate predictions given those residuals, and then back transform them and take the mean.

Although this example is using logged regression, I’ve made it pretty general. So if you used any box cox transformation instead of the logged (e.g. sklearns power_transform, it will work.

######################################################
# Duan's smearing, non-parametric approach via residuals

# Can make this general for any function inside of 
# TransformedTargetRegressor
f = tt.get_params()['func']              #function
inv_f = tt.get_params()['inverse_func']  #and inverse function

# Non-parametric approach, approximate via residuals
# Using numpy broadcasting
log_pred = f(dat['WrongTrans'])
resids = f(dat['y']) - log_pred
resids = resids.values.reshape(1,n[0])
dp = inv_f(log_pred.values.reshape(n[0],1) + resids)
dat['DuanPreds'] = dp.mean(axis=1)

dat[['y','WrongTrans','DuanPreds']].describe()
######################################################

So you can see that the Duan Smeared predictions are looking better, at least the mean of the predictions is much closer to the original.

I’ve intentionally done this example without using train/test, as we know the true answers. But in that case, you will want to use the residuals from the training dataset to apply this transformation to the test dataset.

So the residuals and the Duan smearing estimator do not need to be the same dimension. So for example if you have a big data application, you may want to do something like resids = resids.sample(1000) above.

Also another nice perk of this is you can use dp above to give you prediction intervals, so np.quantile(dp,[0.025,0.975], axis=1).T would give you a 95% prediction interval of the mean on the linear scale as well.

Extra, Parametric Estimation

Another approach, which may make sense given the application, is instead of using the observed residuals to give a non-parametric estimate, you can estimate the distribution of the residuals, and then use that to make either an integral estimate of the Smeared estimate back on the original scale. Or in the case of the logged regression there is a closed form solution.

I show how to construct the integral estimator below, again trying to be more general. The integral approach will work for say any box-cox transformation.

######################################################
# Parametric approach, approximating residuals via normal

from scipy.stats import norm
from scipy.integrate import quad

# Look at the residuals again
resids = f(dat['y']) - f(tt.predict(dat[x_vars]))

# Check to make sure that the residuals are really close to normal
# Before doing this
resids.hist(bins=100)

# Fit to a normal distribution 
loc, scale = norm.fit(resids)

# Define integral
def integrand(x,pred):
    return norm.pdf(x, loc, scale)*inv_f(pred - x)

# Pred should be the logged prediction
# -50,50 should be changed if the residuals are scaled differently
def duan_param(pred):
    return quad(integrand, -50, 50, args=(pred))[0]

# This takes awhile to apply to the whole data frame!
dat['log_pred'] = f(tt.predict(dat[x_vars]))
sub_dat = dat.head(100).copy()
sub_dat['DuanParam'] = sub_dat['log_pred'].apply(duan_param)

# Can see that these are very similar to the non-parametric
print( sub_dat[['DuanPreds','DuanParam']].head(10) )

And you can see that this normal based approximation works just fine here, since by construction the model residuals are pretty well behaved in my simulation.

It happens to be the case that there is a simpler estimate than the integral approach (which you can see in my notes takes awhile to estimate).

###########
# Easier way, but only applicable to log transform
# https://en.wikipedia.org/wiki/Smearing_retransformation
test_val = np.log(5000000)

# Integral approach
print( duan_param(test_val) ) 

# Approach for just log transformed
mult = np.exp(0.5*resids.var())
print( np.exp(test_val)*mult )
##########

So you can see the integral vs the closed form function are very close:

The differences could be due to the the integral is simply an estimate (and you can see I did not do negative to positive infinity, but chopped it off, I do not know if there is a better function to estimate the integral or general approach here).

It wouldn’t surprise me if there are closed form solutions for box-cox transforms as well, but I am not familiar with them offhand. Again the integral approach (or the non-parametric approach) will work for whatever function you want. The function itself could be whatever crazy/discontinuous function you want. But this parametric Duan’s Smearing approach relies on the residuals being normally distributed. (I suppose you could use some other types of continuous distribution estimate if you have reason to, I have only seen normal distribution estimates though in practice.)

Other Notes

While this focuses on regression, I do not think this will perform all that badly for other types of models (such as random forests or xgboost). But for forests it may make sense to simply pull out the individual tree estimates, back transform them, and get the mean of that backtransformed estimate. I have a different blog post that has a function showing how to scoop up the individual predictions from a random forest model.

It should also apply the same to any regression model with regularization. But if you want to do this, there are of course other alternative models you may consider that may be better suited towards your end goals of predictions on the linear/original scale.

For example, if you really want prediction intervals, it may make sense to not transform the data, and estimate a quantile regression model at the 5% and 95% quantiles. This would give you a 90% prediction interval.

Another approach is that it may make sense to use a different model, such as Poisson regression or negative binomial regression (or another generalized linear model in general). Even if your data are not integer counts, you can still use these models! (They just need to be 0 and above, no negative values.)

That Stata blog suggests to use Poisson and then robust standard errors, but that is a bad idea if you are really interested in predictions as well (see Gary Kings comment and linked paper). But you can just do negative binomial models in most cases then, and that is a better default than Poisson for many real world datasets.

1 Comment

by Andy Wheeler on February 19, 2021 • Permalink

Posted in data science, Python

Tagged linear-regression, machine-learning, regression

Posted by Andy Wheeler on February 19, 2021

https://andrewpwheeler.com/2021/02/19/transforming-predicted-variables-in-regression/

Geocoding the CMS NPI Registry (python)

So previously I wrote out creating service deserts. I have since found a nicer data source to use for this, the NPI CMS registry. This data file has over 6 million service providers across the US.

Here I will provide an example of using that data to geocode all the pharmacy’s in Texas, again using the census geocoding API and python.

Chunking up the NPI database

So first, you can again download the entire NPI database from here. So I have already downloaded and unzipped that file, which contains a CSV for the January version, named npidata_pfile_20050523-20210110.csv. So as some upfront, here are the libraries I will be using, and I also set the directory to where my data is located.

###############################
import pandas as pd
import numpy as np
import censusgeocode as cg
import time
from datetime import datetime
import os
os.chdir(r'D:\HospitalData\NPPES_Data_Dissemination_January_2021')
###############################

The file is just a bit too big for me to fit in memory on my machine. On Windows, you can use Get-Content npidata_pfile_20050523-20210110.csv | Measure-Object -Line in powershell to get the line counts, or on Unix use wc -l *.csv for example. So I know the file is not quite 6.7 million rows.

So what I do here is create a function to read in the csv file in chunks, only select the columns and rows that I want, and then return that data frame. In the end, you need to search across all of the Taxonomy codes to pull out the type of service provider you want. So for community pharmacies, the code is 3336C0003X, but it is not always in the first Taxonomy slot (some CVS’s have it in the second slot for example). You can see the big list of taxonomy codes here, so my criminology friends may say be interested in mental health or substance abuse service providers for other examples.

In addition to the taxonomy code, I always select organizations, not individuals (Entity Type = 2). And then I only select out pharmacies in Texas (although I bet you could fit all of the US pharmacies in memory pretty easily, maybe 60k in total?) Caveat emptor, I am not 100% sure how to use the deactivation codes properly in this database, as that data is always NaN for Texas pharmacies.

######################################################################
# Prepping the input data in chunks

keep_col = ['NPI','Entity Type Code','Provider Organization Name (Legal Business Name)',
            'NPI Deactivation Reason Code','NPI Deactivation Date','NPI Reactivation Date',
            'Provider First Line Business Practice Location Address',
            'Provider Business Practice Location Address City Name',
            'Provider Business Practice Location Address State Name',
            'Provider Business Practice Location Address Postal Code']
            
taxon_codes = ['Healthcare Provider Taxonomy Code_' + str(i+1) for i in range(15)]
keep_col += taxon_codes
community_pharm = '3336C0003X'
npi_csv = 'npidata_pfile_20050523-20210110.csv' #Newer files will prob change the name

# This defines the rows I want
def sub_rows(data):
    ec = data['Entity Type Code'] == "2"
    st = data['Provider Business Practice Location Address State Name'] == 'TX'
    ta = (data[taxon_codes] == community_pharm).any(axis=1)
    #ac = data['NPI Deactivation Reason Code'].isna()
    all_together = ec & st & ta #& ac
    sub = data[all_together]
    return sub

def csv_chunks(file,chunk_size,keep_cols,row_sub):
    # First lets get the header and figure out the column indices
    header_fields = list(pd.read_csv(npi_csv, nrows=1))
    header_locs = [header_fields.index(i) for i in keep_cols]
    # Now reading in a chunk of data
    skip = 1
    it_n = 0
    sub_n = 0
    ret_chunk = chunk_size
    fin_li_dat = []
    while ret_chunk == chunk_size:
        file_chunk = pd.read_csv(file, usecols=header_locs, skiprows=skip, 
                     nrows=chunk_size, names=header_fields, dtype='str')
        sub_dat = row_sub(file_chunk)
        fin_li_dat.append( sub_dat.copy() )
        skip += chunk_size
        it_n += 1
        sub_n += sub_dat.shape[0]
        print(f'Grabbed iter {it_n} total sub n so far {sub_n}')
        ret_chunk = file_chunk.shape[0]
    fin_dat = pd.concat(fin_li_dat, axis=0)
    return fin_dat


# Takes about 3 minutes
print( datetime.now() )
pharm_tx = csv_chunks(npi_csv, chunk_size=1000000, keep_cols=keep_col, row_sub=sub_rows)
print( datetime.now() )

# No deactivated codes in all of Texas
print( pharm_tx['NPI Deactivation Reason Code'].value_counts() )
######################################################################

So this ends up returning not quite 6800 pharmacies in all of Texas.

Geocoding using the census API

So first, the address data is pretty well formatted. But for those new to geocoding, if you have end parts of address strings like Apt 21 or Suite C, those endings will typically throw geocoders off the mark. So in just a few minutes, I noted the different strings that marked the parts of the addresses I should chop off, and wrote a function to clean those up. Besides that I just limit the zip code to 5 digits, as that field is a mix of 5 and 9 digit zipcodes.

######################################################################
# Now prepping the data for geocoding

ph_tx = pharm_tx.drop(columns=taxon_codes).reset_index(drop=True)

#['Provider First Line Business Practice Location Address', 'Provider Business Practice Location Address City Name', 'Provider Business Practice Location Address State Name', 'Provider Business Practice Location Address Postal Code']

# I just looked through the files and saw that after these strings are not needed
end_str = [' STE', ' SUITE', ' BLDG', ' TOWER', ', #', ' UNIT',
           ' APT', ' BUILDING',',', '#']

 
def clean_add(address):
    add_new = address.upper()
    for su in end_str:
        sf = address.find(su)
        if sf > -1:
            add_new = add_new[0:sf]
    add_new = add_new.replace('.','')
    add_new = add_new.strip()
    return add_new

# Some examples
clean_add('5700 S GESSNER DR STE G')
clean_add('10701-B WEST BELFORT SUITE 170')
clean_add('100 EAST UNIVERSITY BLVD.')
clean_add('5800 BELLAIRE BLVD BLDG 1')
clean_add('2434 N I-35 # S')

ph_tx['Zip5'] = ph_tx['Provider Business Practice Location Address Postal Code'].str[0:5]
ph_tx['Address'] = ph_tx['Provider First Line Business Practice Location Address'].apply(clean_add)
ph_tx.rename(columns={'Provider Business Practice Location Address City Name':'City',
                      'Provider Business Practice Location Address State Name':'State2'},
             inplace=True)
######################################################################

Next is my function to use the batch geocoding in the census api. Note the census api is a bit finicky – technically the census api says you can do batches of up to 5k rows, but I tend to get kicked off for higher values. So here I have a function that chunks it up into tinier batch portions and submits to the API. (A better function would cache intermediate results and wrap all that jazz in a try function.)

 ######################################################################
 #This function breaks up the input data frame into chunks
 #For the census geocoding api
 def split_geo(df, add, city, state, zipcode, chunk_size=500):
     df_new = df.copy()
     df_new.reset_index(inplace=True)
     splits = np.ceil( df.shape[0]/chunk_size)
     chunk_li = np.array_split(df_new['index'], splits)
     res_li = []
     pick_fi = []
     for i,c in enumerate(chunk_li):
         # Grab data, export to csv
         sub_data = df_new.loc[c, ['index',add,city,state,zipcode]]
         sub_data.to_csv('temp_geo.csv',header=False,index=False)
         # Geo the results and turn back into df
         print(f'Geocoding round {int(i)+1} of {int(splits)}, {datetime.now()}')
         result = cg.addressbatch('temp_geo.csv') #should try/except?
         # May want to dump the intermediate results
         #pi_str = f'pickres_{int(i)}.p'
         #pickle.dump( favorite_color, open( pi_str, "wb" ) )
         #pick_fi.append(pi_str.copy())
         names = list(result[0].keys())
         res_zl = []
         for r in result:
             res_zl.append( list(r.values()) )
         res_df = pd.DataFrame(res_zl, columns=names)
         res_li.append( res_df.copy() )
         time.sleep(10) #sleep 10 seconds to not get cutoff from request
     final_df = pd.concat(res_li)
     final_df.rename(columns={'id':'row'}, inplace=True)
     final_df.reset_index(inplace=True, drop=True)
     # Clean up csv file
     os.remove('temp_geo.csv')
     return final_df
 ######################################################################

And now we are onto the final stage, actually running the geocoding function, and piping the end results to a csv file. (Which you can see the final version here.)

######################################################################
# Geocoding the data in chunks

# Takes around 35 minutes
geo_pharm = split_geo(ph_tx, add='Address', city='City', state='State2', zipcode='Zip5', chunk_size=100)

#What is the geocoding hit rate?
print( geo_pharm['match'].value_counts() )
# Only around 65%

# Now merging back with the original data if you want
# Not quite sorted how I need them
geo_pharm['rowN'] = geo_pharm['row'].astype(int)
gp2 = geo_pharm.sort_values(by='rowN').reset_index(drop=True)

# Fields I want
kg = ['address','match','lat','lon']
kd = ['NPI',
      'Provider Organization Name (Legal Business Name)',
      'Provider First Line Business Practice Location Address',
      'Address','City','State2','Zip5']

final_pharm = pd.concat( [ph_tx[kd], gp2[kg]], axis=1 )

final_pharm.to_csv('Pharmacies_Texas.csv',index=False)
######################################################################

Unfortunately the geocoding hit rate is pretty disappointing, only around 65% in this sample. So if I were using this for a project, I would likely do a round of geocoding using the Google API (which is a bit more unforgiving for varied addresses), or perhaps build my own openstreet map geocoder for the US. (Or in general if you don’t have too many to review, doing it interactively in ArcGIS is very nice as well if you have access to Arc.)

2 Comments

by Andy Wheeler on February 9, 2021 • Permalink

Posted in geocoding, healthcare, Mapping, Python

Tagged geocoding, healthcare

Posted by Andy Wheeler on February 9, 2021

https://andrewpwheeler.com/2021/02/09/geocoding-the-cms-npi-registry-python/

Filled contour plot in python

I’ve been making a chart that looks similar to this for a few different projects at work, so figured a quick blog post to show the notes of it would be useful.

So people often talk about setting a decision threshold to turn a predicted probability into a binary yes/no decision. E.g. do I do some process to this observation if the probability is 20%, 30%, 60%, etc. If you can identify the costs and benefits of making particular decisions, you can set a simple threshold to make that decision. For example, say you are sending adverts in the mail for a product. If the person buys the product, your company makes $50, and the advert only costs $1 to send. In this framework, if you have a predictive model for the probability the advert will be successful, then your decision threshold will look like this:

$50*probability - $1

So in this case you need the predicted probability to be above 2% to have an expected positive return on the investment of sending the advert. So if you have a probability of 10% for 2000 customers, you would expect to make 2000 * (50*0.1 - 1) = 8000. The probabilities you get from your predictive model can be thought of as in the long run averages. Any single advert may be a bust, but if your model is right and you send out a bunch, you should make this much money in the end. (If you have a vector of varying probabilities, in R code the estimated revenue will then look like prob <- runif(2000,0,0.1); pover <- prob > 0.02; sum( (50*prob - 1)*pover ).)

But many of the decisions I work with are not a single number in the benefits column. I am working with medical insurance claims data at HMS, and often determining models to audit those claims in some way. In this framework, it is more important to audit a high dollar claim than a lower dollar claim, even if the higher dollar value claim has a lower probability. So I have been making the subsequent filled contour plot I am going to show in the next section to illustrate this.

python contour plot

The code snippet is small enough to just copy-paste entirely. First, I generate data over a regular grid to illustrate different claim amounts and then probabilities. Then I use np.meshgrid to get the data in the right shape for the contour plot. The revenue estimates are then simply the probability times the claims amount, minus some fixed (often labor to audit the claim) cost. After that is is just idiosyncratic matplotlib code to make a nice filled contour.

# Example of making a revenue contour plot
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import numpy as np

n = 500 #how small grid cells are
prob = np.linspace(0,0.5,n)
dollar = np.linspace(0,10000,n)
#np.logspace(0,np.log10(10000),n) #if you want to do logged

# Generate grid
X, Y = np.meshgrid(prob, dollar)

# Example generating revenue
fixed = 200
Rev = (Y*X) - fixed

fig, ax = plt.subplots()
CS = ax.contourf(X, Y, Rev, cmap='RdPu')
clb = fig.colorbar(CS)
#clb.ax.set_xlabel('Revenue') #Abit too wide
clb.ax.set_title('dollar') #html does not like the dollar sign
ax.set_xlabel('Probability')
ax.set_ylabel('Claim Amount')
ax.yaxis.set_major_formatter(StrMethodFormatter('${x:,.0f}'))
plt.title('Revenue Contours')
plt.xticks(np.arange(0,0.6,0.1))
plt.yticks(np.arange(0,11000,1000))
plt.annotate('Revenue subtracts $200 of fixed labor costs',
(0,0), (0, -50),
xycoords='axes fraction',
textcoords='offset points', va='top')
#plt.savefig('RevContour.png',dpi=500,bbox_inches='tight')
plt.show()

The color bar does nice here out of the box. Next up in my personal learning will be how to manipulate color bars a bit more. Here I may want to either use a mask to not show negative expected returns, or a diverging color scheme (e.g. blue for negative returns).

Leave a comment

by Andy Wheeler on January 29, 2021 • Permalink

Posted in data science, Data Visualization, Python

Tagged contour, decision-threshold, matplotlib

Posted by Andy Wheeler on January 29, 2021

https://andrewpwheeler.com/2021/01/29/filled-contour-plot-in-python/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 392 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category Python

Front Prep

AUC Plots

Calibration Charts

Python f-string numeric formatting

SPSS break long labels

Downloading ACS Data

Peer Review for Criminology Software

Future Posts?

MDE for the WDD and log IRR

Python code

References

Identifying Assumptions

Bias/Coverage

Final Thoughts

References

Python simulation code

A Linear Program to Clump Hot Spots

Normal Hotspots vs Clumpy Ones in Raleigh

Why Raleigh?

References

Example Duan’s Smearing in python

Extra, Parametric Estimation

Other Notes

Chunking up the NPI database

Geocoding using the census API

python contour plot

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange