Weighting machine learning models

Here is a trick I have seen on several occasions people could take advantage of to make fitting models for big data more convenient. If you are using data that is all categorical and a fairly small number of variables, you can aggregate your data and fit models using weights, instead of fitting models on the original micro level data.

For a recent example at my work, was working on a model that originally has around 6 million observations and around a dozen categorical inputs (each with fairly high cardinality, e.g. around 1000 different categories). When aggregating to unique cases, this model is well under half a million rows though. It is much easier for me to iterate and fit models on the half a million row dataset than the 6 million row one.

Here I will show an example using NIBRS data. See my prior blog post on Association Rules for background on the data, I will be using the 2012 NIBRS dataset, which has over 5 million observations. Below is python code to illustrate, but pretty much all statistical packages allow you weight observations.

So first I load the libraries I will be using, and make a nice function to print out the logit coefficients for a sklearn model.

import pandas as pd
import numpy as np
from scipy.stats import binom
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Function to pretty print logit coefficients
def nice_coef(mod,x_vars):
    coef = list(mod.coef_[0]) + list(mod.intercept_)
    vlist = x_vars + ['Intercept']
    return pd.DataFrame(zip(vlist,coef),columns=['Var','Coef'])

Next we can read in my NIRBS data directly from the dropbox link (replace www with dl to do this in general for dropbox links).

# See https://github.com/apwheele/apwheele.github.io/tree/master/MathPosts/association_rules
# For explanation behind NIBRS data
ndum = pd.read_csv('https://dl.dropbox.com/sh/puws33uebzt9ckd/AADVM86qPJVqP4RHWkWfGBzpa/NIBRS_DummyDat.csv?dl=0')
# If you want to read original NIRBS data
# use https://dl.dropbox.com/sh/puws33uebzt9ckd/AACL3wBhZDr3P_ZbsbUxltERa/NIBRS_2012.csv?dl=0

This data is already prepped with the repeated dummy variables for different categories. It is over 5 million observations, but a simple way to work with this data is to use a groupby and create a weight variable:

group_vars = list(ndum)
ndum['weight'] = 1
ndum_agg = ndum.groupby(group_vars, as_index=False).sum() # sums the weight variable

print(ndum.shape)
print(ndum_agg.shape)

So you can see we went from over 5 million observations to only a few over 7000.

A few notes here. One, if querying the data from a DB, it may be better to do the counts on the DB side and only load in the tinier data into memory, e.g. SELECT COUNT(ID) AS weight, A1,A2... FROM Table GROUP BY A1,A2,.....

Second, I do not have any missing data here, but pandas groupby will by default drop missing rows. So you may want to do something like data.fillna(-1).groupby, or the option to not drop NA values.

Now, lets go onto to fitting a model. Here I am using logit regression, as it is easier to compare the inputs for the weighted/non-weighted model, but you can do this for whatever machine learning model you want. I am predicting the probability an officer is assaulted.

logit_mod = LogisticRegression(penalty='none', solver='newton-cg')
y_var = 'ass_LEO_Assault'
x_vars = group_vars.copy() #[0:7]
x_vars.remove(y_var)

# Takes a few minutes on my machine!
logit_mod.fit(ndum[x_vars],ndum[y_var])

# Coefficients for the model
bigres = nice_coef(logit_mod,x_vars)
bigres

I was not sure if my computer would actually fit this model without running out of memory. But it did crunch it out in a few minutes. Now lets look at the results when we estimate the model with the weighted data. In all the sklearn models you can just pass in a sample_weight into the fit function.

logit_mod.fit(ndum_agg[x_vars],ndum_agg[y_var],sample_weight=ndum_agg['weight'])

weight_res = nice_coef(logit_mod,x_vars)
weight_res

You can see that these are basically the same model predictions. For a few of the coefficients you can find discrepancies starting at the second decimal, but the majority are within floating point errors.

bigres['Coef'] - weight_res['Coef']

This was fit instantly instead of waiting several minutes. For more intense ML models (like random forests), this can dramatically improve the time it takes to fit models.

If you are interested in doing a train/test split, this is quite easy with the weights as well. Basically you just need to allocate some of the weight to the train and some to the test. Here I show how to do that using a binomial random variable. Then you feed the train weights to the fit function:

train_prop = 0.5
train_weight = binom.rvs(ndum_agg['weight'].astype(int), train_prop, random_state=10)
test_weight = ndum_agg['weight'] - train_weight

logit_mod.fit(ndum_agg[x_vars],ndum_agg[y_var],sample_weight=train_weight)

And in sklearn pretty much all of the evaluation functions also take a sample weight function.

pred_probs = logit_mod.predict_proba(ndum_agg[x_vars])

# Eval metrics for the test data
roc_auc_score(ndum_agg[y_var],pred_probs[:,1],sample_weight=test_weight)
roc_auc_score(ndum_agg[y_var],pred_probs[:,1],sample_weight=train_weight)

So this shows that the AUC metric decreases in the test dataset (as you would expect it to). Note do not take this model seriously, I would need to think more thoroughly about the selection of rows here, as well as how to effectively interpret these particular categorical inputs in a more reasonable way than eyeballing the coefficients.

I am wondering if weighting the data is actually a more convenient way to do train/test CV splits, just change the weights instead of splitting up datasets. (You could also do fractional weights, e.g. train_weight = ndum_agg['weight']/2, which works nice for stratifying the sample, but may cause some issues generalizing.)

So note this does not always work – but works best with sparse/categorical data. If you have numeric data, you can try to discretize the data to a reasonable amount to still model it as a continuous input (e.g. round age to one decimal, e.g. 20.1). But if you have more than a few numeric inputs you will have a much harder time compressing the data into a smaller number of weighted rows.

It also only works if you have a limited number of inputs. If you have 100 variables you will be able to aggregate less than if you are working with 10.

But despite those limitations, I have come across several different examples in my career where aggregating and weighting the data was clearly the easiest approach, and NIBRS is a great example.

Big data problems for Criminal Justice

I am on the job market this year, and I have noticed a few academic jobs focused on big data (see this Penn State posting for one example). Because example data sets in criminal justice are not typical fodder for big data conversations, I figured I would talk abit about my experiences and illustrate the need for the types of skills needed to manipulate and analyze these big datasets.

As opposed to trying to further define the big data buzzword, I will simply talk about the actual size of data I have dealt with. Depending on the definition used, most large criminal justice datasets may be called medium sized data. That is you can load it in a database or statistical program (particularly those that do not load everything into RAM, like SPSS and SAS) and calculate different summary statistics and fit simple models. Were not talking about datasets that need custom big data solutions like Hadoop. The biggest single table I’ve personally worked with is a set of 25 million arrest histories (with around 150 variables). Using SPSS server to sort this dataset took less than a minute, using my local machine it took about 10 minutes. Nothing much to complain about there, and it is where the statistical programs that don’t load everything into memory shine.

To talk specifics, the police agency where I was an analyst at (Troy, NY) is a fairly small city with a population of around 50,000 people. They generated around 60,000 calls for service per year (this includes anytime someone calls 911, or police initiated interactions like a traffic stop). Every single one of these incidents generates a one to many relationship for multiple tables, and here is a sampling of those relationships; multiple free text description of the event and follow up investigations, people involved in the incident, offences committed, property stolen or damaged, persons arrested, property recovered or confiscated, drug and weapon contraband, vehicles involved, etc. Over the time period of 04-13 the incident narratives themselves are around 1 gigabyte, and the number of unique individuals and institutions in the "names" table was around 100,000. None of these tables alone would be considered big data, but when taking multiple years and having to conduct multiple table merges it turns into complicated medium size data pretty quickly.

I’m sure I’m not alone here working with police departments. In the past month I’ve had conversations with two individuals about corrections datasets that result in millions of records. Criminal justice organizations have been collecting data for along time, and given say 50,000 records per year it only takes 10 years to turn that into 500,000. When considering larger agencies (like statewide corrections or courts) the per year becomes even larger.

Most of the time summary statistics and fairly simple regression models are all researchers and analysts are interested in in criminal justice. The field is not heavily devoted to prediction, and certainly not to fitting complicated machine learning models. Many regression tasks can be estimated with data as large as 25 million records (given that the number of predictor variables tends to be small) and even if it didn’t sampling (or reducing the data to unique observations and weighting) is an obvious option. So for these types of simple needs just learning effective practices at manipulating datasets — such as SQL and best practices for conducting data manipulations in statistical packages is most of the education one needs. But these are still definitely needs that are not met in any social science curricula that I am aware. By fire is my only experience.

Two particular areas that turn little data into big data are spatial and network analysis, as one not only needs to consider the number of nodes but also the number of edges (or potential edges) in the system to calculate various measures. For example, in my dissertation I needed to conduct spatial lags of several variables (and this is needed in calculating measures such as Moran’s I). In matrix notation this typically involves calculating Wx, where W is an n by n spatial weights matrix. In my dissertation, n was 21,506, so not a large dataset, but W is then a 21,506^2 matrix. It can be held in memory, but good luck trying to calculate anything with it. Most of the spatial econometrics literature discusses how calculating W^-1 is problematic, let alone the simpler operation of Wx. So to do those calculations I needed to create custom code. I hope to be able to write a blog post on how it can be done at some point – but these blog posts aren’t earning me any brownie points to getting a job (let alone getting tenure in the future).

The other area that I believe needs to be developed in the social science related to medium data problems are custom visualization solutions. Data in social science typically has lots of noise to signal, and adding in 100,000 observations rarely makes things clearer. This is why I think visualization within the social sciences has potential to expand, as the majority of historical discussions are not extensible to our particular use applications in the social sciences.

So I’m excited by academia recognizing that big data is a problem and takes custom solutions in the social sciences. An environment where I can be reworded for taking on those big data tasks and partly focus on publishing software, as opposed to solely publish or perish, would help develop the field and have a more lasting impact on practical applications than journal articles. At least a place that acknowledges the need to develop curricula related to these data management tasks would be a good start. But I’m not sure I like the types of applications currently being pitched in the social sciences as big data problems, particularly the trivial applications of examining social networks like facebook or twitter, nor emphasis on big data tools like Hadoop that I don’t think are applicable to the social scientists toolset. But I’m certainly biased to think that applications in criminal justice have more practical implications than alot of contemporary social science research.