Column storage for wide datasets

The notion of big data is quite a bit overhyped. I do get some exposure to it at work, but many tools like Hadoop are not needed over doing things in chunks on more typical machines. One thing though I have learned more generally about databases, many relational databases (such as postgres) store data under the hood like this:

Index1|1|'a'|Index2|2|'b'.....|Index9999|1|'z'

So they are stacked cumulatively in an underlying file format. And then to access the information the system has to know “ok I need to find Index2 and column 2”, so it calculates offsets for where those pieces of information should be located in the file.

This however is not so nice for databases that have many columns. In particular administrative record databases that have few rows, but many thousands of columns (often with many missing fields and low dimensional categories), this default format is not very friendly. In fact many databases have limits on the number of columns a table can have. (“Few rows” is relative, but really I am contrasting with sensor data that often has billions/trillions of rows, but very few columns.)

In these situations, a columnar database (or data format) makes more sense. Instead of having to calculate many large number of offsets, the database often will represent the key-column pairs in some easier base format, and then will only worry about grabbing specific columns. Both representing the underlying data in a more efficient manner will lower on computer disk space (e.g. only takes up 1 gig instead of many gigabytes) as well as improve input/output operations on the data (e.g. it is faster to read the data/write new data).

I will show some examples via the American Community Survey data for micro areas. I have saved the functions/code here on github to follow along.

So first, I load in pandas and DuckDB. DuckDB is an open source database that uses by default a columnar storage format, but can consider it a very similar drop in replacement for sqllite for persistantly storing data.

import pandas as pd
import duckdb
import os

# my local census functions
# grabbing small area 2019 data
# here defaults to just Texas/Delaware
# still takes a few minutes
import census_funcs
cen2019_small, fields2019 = census_funcs.get_data(year=2019)
print(cen2019_small.shape) # (21868, 17145)

Here I grab the small area data for just Texas/Delaware (this includes mostly census tracts and census block groups in the census FTP data). You can see not so many rows, almost 22k, but quite a few columns, over 17k. This is after some data munging to drop entirely missing/duplicated columns even. The census data just has very many slightly different aggregate categories.

Next I save these data files to disk. For data that does not change very often, you just want to do this process a single time and save that data somewhere you can more easily access it. No need to re-download the data from the internet/census site everytime you want to do a new analysis.

I don’t have timing data here, but you can just experiment to see the parquet data format is quite a bit faster to save (and csv formats are the slowest). DuckDB is smart and just knows to look at your local namespace to find the referenced pandas dataframe in the execute string.

# Save locally to CSV file
cen2019_small.to_csv('census_test.csv')

# Trying zip compression
cen2019_small.to_csv('census_test.csv.zip',compression="zip")

# Save to parquet files
cen2019_small.to_parquet('census.parquet.gzip',compression='gzip',engine='pyarrow')

# Save to DuckDB, should make the DB if does not exist
con = duckdb.connect(database='census.duckdb',read_only=False)
con.execute('CREATE TABLE census_2019 AS SELECT * FROM cen2019_small')

If we then go and check out the datasizes on disk, csv for just these two states is 0.8 gigs (the full set of data for all 50 states is closer to 20 gigs). Using zip compression reduces this to around 1/4th of the size for the csv file. Using parquet format (which can be considered an alternative fixed file format to CSV although columnar oriented) and gzip compression is pretty much the same as zip compression for this data (not many missing values or repeat categories of numbers), but if you have repeat categorical data with a bit of missing data should be slightly better compression (I am thinking NIBRS here). The entire DuckDB database is actually smaller than the CSV file (I haven’t checked closely, I try to coerce the data to smarter float/int formats before saving, but there are probably even more space to squeeze out of my functions).

# Lets check out the file sizes
files = ['census_test.csv','census_test.csv.zip',
         'census.parquet.gzip','census.duckdb']

for fi in files:
    file_size = os.path.getsize(fi)
    print(f'File size for {fi} is {file_size/1024**3:,.1f} gigs')

# File size for census_test.csv is 0.8 gigs
# File size for census_test.csv.zip is 0.2 gigs
# File size for census.parquet.gzip is 0.2 gigs
# File size for census.duckdb is 0.7 gigs

The benefit of having a database here like DuckDB is for later SQL querying, as well as the ability to save additional tables. If I scoop up the 2018 data (that has slightly different columns), I can save to an additional table. Then later on downstream applications can select out the limited columns/years as needed (unlikely any real analysis workflow needs all 17,000 columns).

# Can add in another table into duckdb
cen2018_small, fields2018 = census_funcs.get_data(year=2018)
con.execute('CREATE TABLE census_2018 AS SELECT * FROM cen2018_small')
con.close()

ducksize = os.path.getsize(files[-1])
print(f'File size for {files[-1]} with two tables is {ducksize/1024**3:,.1f} gigs')
# File size for census.duckdb with two tables is 1.5 gigs

Sqllite is nice, as you can basically share a sqllite file and you know your friend will be able to open it. I have not worked with DuckDB that much, but hopefully it has similar functionality and ability to share without too much headache.

I have not worried about doing timing – I don’t really care about write timings of these data formats compared to CSV (slow read timing is annoying, but 10 seconds vs 1 minute to read data is not a big deal). But it is good practice to not be gluttonous with on disk space, which saving a bunch of inefficient csv files can be a bit wasteful.

Weighting machine learning models

Here is a trick I have seen on several occasions people could take advantage of to make fitting models for big data more convenient. If you are using data that is all categorical and a fairly small number of variables, you can aggregate your data and fit models using weights, instead of fitting models on the original micro level data.

For a recent example at my work, was working on a model that originally has around 6 million observations and around a dozen categorical inputs (each with fairly high cardinality, e.g. around 1000 different categories). When aggregating to unique cases, this model is well under half a million rows though. It is much easier for me to iterate and fit models on the half a million row dataset than the 6 million row one.

Here I will show an example using NIBRS data. See my prior blog post on Association Rules for background on the data, I will be using the 2012 NIBRS dataset, which has over 5 million observations. Below is python code to illustrate, but pretty much all statistical packages allow you weight observations.

So first I load the libraries I will be using, and make a nice function to print out the logit coefficients for a sklearn model.

import pandas as pd
import numpy as np
from scipy.stats import binom
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Function to pretty print logit coefficients
def nice_coef(mod,x_vars):
    coef = list(mod.coef_[0]) + list(mod.intercept_)
    vlist = x_vars + ['Intercept']
    return pd.DataFrame(zip(vlist,coef),columns=['Var','Coef'])

Next we can read in my NIRBS data directly from the dropbox link (replace www with dl to do this in general for dropbox links).

# See https://github.com/apwheele/apwheele.github.io/tree/master/MathPosts/association_rules
# For explanation behind NIBRS data
ndum = pd.read_csv('https://dl.dropbox.com/sh/puws33uebzt9ckd/AADVM86qPJVqP4RHWkWfGBzpa/NIBRS_DummyDat.csv?dl=0')
# If you want to read original NIRBS data
# use https://dl.dropbox.com/sh/puws33uebzt9ckd/AACL3wBhZDr3P_ZbsbUxltERa/NIBRS_2012.csv?dl=0

This data is already prepped with the repeated dummy variables for different categories. It is over 5 million observations, but a simple way to work with this data is to use a groupby and create a weight variable:

group_vars = list(ndum)
ndum['weight'] = 1
ndum_agg = ndum.groupby(group_vars, as_index=False).sum() # sums the weight variable

print(ndum.shape)
print(ndum_agg.shape)

So you can see we went from over 5 million observations to only a few over 7000.

A few notes here. One, if querying the data from a DB, it may be better to do the counts on the DB side and only load in the tinier data into memory, e.g. SELECT COUNT(ID) AS weight, A1,A2... FROM Table GROUP BY A1,A2,.....

Second, I do not have any missing data here, but pandas groupby will by default drop missing rows. So you may want to do something like data.fillna(-1).groupby, or the option to not drop NA values.

Now, lets go onto to fitting a model. Here I am using logit regression, as it is easier to compare the inputs for the weighted/non-weighted model, but you can do this for whatever machine learning model you want. I am predicting the probability an officer is assaulted.

logit_mod = LogisticRegression(penalty='none', solver='newton-cg')
y_var = 'ass_LEO_Assault'
x_vars = group_vars.copy() #[0:7]
x_vars.remove(y_var)

# Takes a few minutes on my machine!
logit_mod.fit(ndum[x_vars],ndum[y_var])

# Coefficients for the model
bigres = nice_coef(logit_mod,x_vars)
bigres

I was not sure if my computer would actually fit this model without running out of memory. But it did crunch it out in a few minutes. Now lets look at the results when we estimate the model with the weighted data. In all the sklearn models you can just pass in a sample_weight into the fit function.

logit_mod.fit(ndum_agg[x_vars],ndum_agg[y_var],sample_weight=ndum_agg['weight'])

weight_res = nice_coef(logit_mod,x_vars)
weight_res

You can see that these are basically the same model predictions. For a few of the coefficients you can find discrepancies starting at the second decimal, but the majority are within floating point errors.

bigres['Coef'] - weight_res['Coef']

This was fit instantly instead of waiting several minutes. For more intense ML models (like random forests), this can dramatically improve the time it takes to fit models.

If you are interested in doing a train/test split, this is quite easy with the weights as well. Basically you just need to allocate some of the weight to the train and some to the test. Here I show how to do that using a binomial random variable. Then you feed the train weights to the fit function:

train_prop = 0.5
train_weight = binom.rvs(ndum_agg['weight'].astype(int), train_prop, random_state=10)
test_weight = ndum_agg['weight'] - train_weight

logit_mod.fit(ndum_agg[x_vars],ndum_agg[y_var],sample_weight=train_weight)

And in sklearn pretty much all of the evaluation functions also take a sample weight function.

pred_probs = logit_mod.predict_proba(ndum_agg[x_vars])

# Eval metrics for the test data
roc_auc_score(ndum_agg[y_var],pred_probs[:,1],sample_weight=test_weight)
roc_auc_score(ndum_agg[y_var],pred_probs[:,1],sample_weight=train_weight)

So this shows that the AUC metric decreases in the test dataset (as you would expect it to). Note do not take this model seriously, I would need to think more thoroughly about the selection of rows here, as well as how to effectively interpret these particular categorical inputs in a more reasonable way than eyeballing the coefficients.

I am wondering if weighting the data is actually a more convenient way to do train/test CV splits, just change the weights instead of splitting up datasets. (You could also do fractional weights, e.g. train_weight = ndum_agg['weight']/2, which works nice for stratifying the sample, but may cause some issues generalizing.)

So note this does not always work – but works best with sparse/categorical data. If you have numeric data, you can try to discretize the data to a reasonable amount to still model it as a continuous input (e.g. round age to one decimal, e.g. 20.1). But if you have more than a few numeric inputs you will have a much harder time compressing the data into a smaller number of weighted rows.

It also only works if you have a limited number of inputs. If you have 100 variables you will be able to aggregate less than if you are working with 10.

But despite those limitations, I have come across several different examples in my career where aggregating and weighting the data was clearly the easiest approach, and NIBRS is a great example.

Big data problems for Criminal Justice

I am on the job market this year, and I have noticed a few academic jobs focused on big data (see this Penn State posting for one example). Because example data sets in criminal justice are not typical fodder for big data conversations, I figured I would talk abit about my experiences and illustrate the need for the types of skills needed to manipulate and analyze these big datasets.

As opposed to trying to further define the big data buzzword, I will simply talk about the actual size of data I have dealt with. Depending on the definition used, most large criminal justice datasets may be called medium sized data. That is you can load it in a database or statistical program (particularly those that do not load everything into RAM, like SPSS and SAS) and calculate different summary statistics and fit simple models. Were not talking about datasets that need custom big data solutions like Hadoop. The biggest single table I’ve personally worked with is a set of 25 million arrest histories (with around 150 variables). Using SPSS server to sort this dataset took less than a minute, using my local machine it took about 10 minutes. Nothing much to complain about there, and it is where the statistical programs that don’t load everything into memory shine.

To talk specifics, the police agency where I was an analyst at (Troy, NY) is a fairly small city with a population of around 50,000 people. They generated around 60,000 calls for service per year (this includes anytime someone calls 911, or police initiated interactions like a traffic stop). Every single one of these incidents generates a one to many relationship for multiple tables, and here is a sampling of those relationships; multiple free text description of the event and follow up investigations, people involved in the incident, offences committed, property stolen or damaged, persons arrested, property recovered or confiscated, drug and weapon contraband, vehicles involved, etc. Over the time period of 04-13 the incident narratives themselves are around 1 gigabyte, and the number of unique individuals and institutions in the "names" table was around 100,000. None of these tables alone would be considered big data, but when taking multiple years and having to conduct multiple table merges it turns into complicated medium size data pretty quickly.

I’m sure I’m not alone here working with police departments. In the past month I’ve had conversations with two individuals about corrections datasets that result in millions of records. Criminal justice organizations have been collecting data for along time, and given say 50,000 records per year it only takes 10 years to turn that into 500,000. When considering larger agencies (like statewide corrections or courts) the per year becomes even larger.

Most of the time summary statistics and fairly simple regression models are all researchers and analysts are interested in in criminal justice. The field is not heavily devoted to prediction, and certainly not to fitting complicated machine learning models. Many regression tasks can be estimated with data as large as 25 million records (given that the number of predictor variables tends to be small) and even if it didn’t sampling (or reducing the data to unique observations and weighting) is an obvious option. So for these types of simple needs just learning effective practices at manipulating datasets — such as SQL and best practices for conducting data manipulations in statistical packages is most of the education one needs. But these are still definitely needs that are not met in any social science curricula that I am aware. By fire is my only experience.

Two particular areas that turn little data into big data are spatial and network analysis, as one not only needs to consider the number of nodes but also the number of edges (or potential edges) in the system to calculate various measures. For example, in my dissertation I needed to conduct spatial lags of several variables (and this is needed in calculating measures such as Moran’s I). In matrix notation this typically involves calculating Wx, where W is an n by n spatial weights matrix. In my dissertation, n was 21,506, so not a large dataset, but W is then a 21,506^2 matrix. It can be held in memory, but good luck trying to calculate anything with it. Most of the spatial econometrics literature discusses how calculating W^-1 is problematic, let alone the simpler operation of Wx. So to do those calculations I needed to create custom code. I hope to be able to write a blog post on how it can be done at some point – but these blog posts aren’t earning me any brownie points to getting a job (let alone getting tenure in the future).

The other area that I believe needs to be developed in the social science related to medium data problems are custom visualization solutions. Data in social science typically has lots of noise to signal, and adding in 100,000 observations rarely makes things clearer. This is why I think visualization within the social sciences has potential to expand, as the majority of historical discussions are not extensible to our particular use applications in the social sciences.

So I’m excited by academia recognizing that big data is a problem and takes custom solutions in the social sciences. An environment where I can be reworded for taking on those big data tasks and partly focus on publishing software, as opposed to solely publish or perish, would help develop the field and have a more lasting impact on practical applications than journal articles. At least a place that acknowledges the need to develop curricula related to these data management tasks would be a good start. But I’m not sure I like the types of applications currently being pitched in the social sciences as big data problems, particularly the trivial applications of examining social networks like facebook or twitter, nor emphasis on big data tools like Hadoop that I don’t think are applicable to the social scientists toolset. But I’m certainly biased to think that applications in criminal justice have more practical implications than alot of contemporary social science research.