GPU go brrr: Estimating OLS (with standard errors) via deep learning

So a bunch of my criminologists friends have methods envy. So to help them out, I made some python functions to estimate OLS models using pytorch (a deep learning python library). So if you use my functions, you can just append something like an analysis with GPU accelerated deep learning to the title of your paper and you are good to go. So for example, if your paper is An analysis of the effects of self control on asking non-questions at an ASC conference, you can simply change it to A GPU accelerated deep learning analysis of the effects of self control on asking non-questions at an ASC conference. See how that works, instantly better.

Code and data saved here. There are quite a few tutorials on doing OLS in deep learning libraries, only thing special here is I also calculate the standard errors for OLS in the code as well.

python code walkthrough

So first I just import the libraries I need. Then change the directory to wherever you saved the code on your local machine, and then import my deep_ols script. The deep_ols script also relies on the scipy.stats library, but besides pytorch it is the usual scientific python stack.

import os
import sys
import torch
import statsmodels.api as sm
import pandas as pd
import numpy as np

###########################################
#Setting the directory and importing
#My deep learning OLS function

my_dir = r'C:\Users\andre\OneDrive\Desktop\DeepLearning_OLS'
os.chdir(my_dir)
sys.path.append(my_dir)
import deep_ols
############################################

For the dataset I am using, it is a set of the number of doctor visits I took from some of the Stata docs.

#Data from Stata, https://www.stata-press.com/data/r16/gsem_mixture.dta
#see pg 501 https://www.stata.com/manuals/sem.pdf

visit_dat = pd.read_stata('gsem_mixture.dta')
visit_dat['intercept'] = 1
y_dat = visit_dat['drvisits']
x_dat = visit_dat[['intercept','private','medicaid','age','educ','actlim','chronic']]

print( y_dat.describe() )
print( x_dat.describe() )

Then to estimate the model it is simply below, the function returns the torch model object, the variance/covariance matrix of the coefficients (as a torch tensor), and then a nice pandas data frame of the results.

mod, vc, res = deep_ols.ols_deep(y=y_dat,x=x_dat)

As you can see, it prints out GPU accelerated loss results so fast for your pleasure.

Then, like a champ you can see the OLS results. P-values < 0.05 get a strong arm, those greater than this get a shruggie. No betas to be found in my model results.

Just to confirm, I show you get the same results using statsmodel.

stats_mod = sm.OLS(y_dat,x_dat)
sm_results = stats_mod.fit()
print(sm_results.summary())

So get on my level bro and start estimating your OLS models with a GPU.

Some Notes

First, don’t actually use this in real life. I don’t do any pre-processing of the data, so if you have some data that is not on a reasonable scale (e.g. if you had a variable income going from 0 to $500,000) all sorts of bad things can happen. Second, it is not really accelerated – I mean it is on the GPU and the GPU goes brr, but I’m pretty sure this will not ever be faster than typical OLS libraries. Third, this isn’t really “deep learning” – I don’t have any latent layers in the model.

Also note the standard errors for the coefficients are just calculated using a closed form solution that I pilfered from this reddit post. (If anyone has a textbook reference for that would appreciate it! I Maria Kondo’d most of my text books when I moved out of my UTD office.) If I figure out the right gradient calculations, I could probably also add ‘robust’ somewhere to my suggested title (by using the outer product gradient approach to get a robust variance/covariance matrix for the coefficients).

As a bonus in the data file, I have a python script that shows how layers in a vanilla deep learning model (with two layers) are synonymous with a particular partial least squares model. The ‘relu activation function’ is equivalent to a constraint that restricts the coefficients to be positive, but otherwise will produce equivalent predictions.

You could probably do some nonsense of saving the Jacobian matrix to get standard errors for the latent layers in the neural network if you really wanted to.

To end, for those who are really interested in deep learning models, I think a better analogy is that they are just a way to specify and estimate a set of simultaneous equations. Understanding the nature of those equations will help you relate how deep learning is similar to regression equations you are more familiar with. The neuron analogy is just nonsense IMO and should be retired.

Using pytorch to estimate group based traj models

Deep learning, tensors, pytorch. Now that I have that seo junk out of the way 🙂 – I’ve been trying to teach myself some “Deep Learning”, as it is what all of the cool kids are doing these days.

I was having a hard time though with many of the different examples. Many are for image data, and so it was hard for me to translate that to actual applications I am interested in. Many do talk about dimension reduction and reducing to hidden layers, so I thought that was similar in nature to latent class analysis, such as group-based-trajectory-modelling (GBTM).

If you aren’t familiar with GBTM, imagine a scenario in which you cluster data, and then you estimate a different regression model to predict some outcome for each subset of the clustered data. This is just a way to do that whole set-up in one go, instead of doing each part separately. It has quite a few different names – latent class analysis and mixture modelling are two common ones. The only thing really different about GBTM is that you have repeated observations – so if you follow the same person over time, they should always be assigned to the same cluster/mixture.

In short you totally can do GBTM models in deep learning libraries (as I will show), but actually most examples that I have walked through are more akin to dimension reduction of columns (so like PCA/Canonical Correlation). But the deep learning libraries are flexible enough to do the latent class analysis I want here. As far as I can tell they are basically just a nice way to estimate systems of equations (with a ton of potential parameters, and do it on the GPU if you want).

So I took it as a challenge to estimate GBTM models in a deep learning library – here pytorch. In terms of the different architectures/libraries (e.g. pytorch, tensorflow, Vowpal Wabbit) I just chose pytorch because one of my co-workers suggested pytorch was easier to learn!

I’ve posted a more detailed notebook of the code, but it worked out quite well. So first I simulated two groups of data (50 observations in each group and 11 time periods). I added a tiny bit of random noise, so this (I was hoping) should be a pretty tame problem for the machine to learn.

The code to generate a pytorch module and have the machine churn out the gradients is pretty slick (less than 30 lines total of non-comments). Many GBTM code bases make you do the analysis in wide format (so one row is an observation), but here I was able to figure out how to set it up in long data format, which makes it real easy to generalize to unbalanced data.

It took quite a few iterations to converge though, (iterations were super fast, but it is a tiny problem, so not sure how timing will generalize) and only converged when using the Adam optimizer (stochastic gradient descent converged to an answer with a similar mean square error, but not to anywhere near the right answer). These models are notorious for converging to sub-optimal locations, so that may just be an intrinsic part of the problem and a good library needs to do better with starting conditions.

I have a few notes about potential updates to the code at the end of my Jupyter notebook. For count or binomial 0/1 data, that should be a pretty easy update. Also need to write code to do out of sample predictions (which I think I can figure out as well). A harder problem I am not sure how to figure out is to do an equation for the latent groups inside of the function. And I don’t know how to get standard errors for the coefficient estimates. Hopefully I can figure that out while trying to teach myself some more deep learning. I have a few convolution ideas I want to try out for spatial-temporal crime forecasting and include proactive police feedback, but I won’t get around to them for quite awhile I imagine.