p-values with large samples (AMA)

Vishnu K, a doctoral student in Finance writes in a question:

Dear Professor Andrew Wheeler

Hope you are fine. I am big follower of your blog and have used it heavily to train myself. Since you welcome open questions, I thought of asking one here and I hope you don’t mind.

I was reading the blog Dave Giles and one of his blogs https://davegiles.blogspot.com/2019/10/everythings-significant-when-you-have.html assert that one must adjust for p values when working with large samples. In a related but old post, he says the same

“So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about ‘statistically significant’ results when the sample size is in the thousands, or even bigger. So, the p-values reported above are mostly pretty marginal, as far as significance is concerned” https://davegiles.blogspot.com/2011/04/drawing-inferences-from-very-large-data.html#more

In one of the posts of Andrew Gelman, he said the same

“When the sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values.

Large n, likely to find something. Huge n, almost certain to find lots of small p-values” https://statmodeling.stat.columbia.edu/2009/06/18/the_sample_size/

As Leamer (1978) points if the level of significance should be set as a decreasing function of sample size, is there a formula through which we can check the needed level of significance for rejecting a null?

Context 1: Sample Size is 30, number of explanatory variables are 5

Context 2: Sample Size is 1000, number of explanatory variables are 5

In both contexts cant, we use p-value <.05 or should we fix a very small p-value for context 2 even though both contexts relates to same data set with difference in context 2 being a lot more data points!

Worrying about p-values here is in my opinion the wrong way to think about it. You can focus on the effect size, and even if an effect is significant, it may be substantively too small to influence how you use that information.

Finance I see, so I will try to make a relevant example. Lets say a large university randomizes students to take a financial literacy course, and then 10 years later follows up to see their overall retirement savings accumulated. Say the sample is very large, and we have results of:

Taken Class: N=100,000 Mean=5,000 SD=2,000
   No Class: N=100,000 Mean=4,980 SD=2,000

SE of Difference ~= 9
Mean Difference = 20
T-Stat ~= 2.24
p-value ~= 0.025

So we can see that the treated class saves more! But it is only 20 dollars more over ten years. We have quite a precise estimate. Even though those who took the class save more, do you really think taking the class is worth it? Probably not based on these stats, it is such a trivial effect size given the sample and overall variance of savings.

And then as a follow up from Vishnu:

Thanks a lot Prof Andrew. One final question is, Can we use the Cohen’s d or any other stats for effect size estimation in these cases?

Cohen’s d = (4980 – 5000) ⁄ 2000 = 0.01.

I don’t personally really worry about Cohen’s D to be honest. I like to try to work out the cost-benefits on the scales that are meaningful (although this makes it difficult to compare across different studies). So since I am a criminologist, I will give a crime example:

Treated Areas: 40 crimes
Non-Treated Areas: 50 crimes

Ignore the standard error for this at the moment. Whether a drop in 10 crimes “is worth it” depends on the nature of the treatment and the type of crime. If the drop is simply stealing small items from the store, but the intervention was hire 10 security guards, it is likely not worth it (the 10 guards salary is likely much higher than the 10 items they prevented theft of).

But pretend now that the intervention was nudging police officers to patrol more in hot spots (so no marginal labor cost) and the crimes we examined were shootings. Preventing 10 shootings is a pretty big deal, because they have such large societal costs.

In this scenario the costs-benefits are always on the count scale (how many crimes did you prevent). Doing another scale (like Cohen’s D or incident rate ratios or whatever) just obfuscates how to calculate the costs/benefits in this scenario.

Generating multiple solutions for linear programs

I had a question the other day on my TURF analysis about generating not just the single best solution for a linear programming problem, but multiple solutions. This is one arena I like the way people typically do genetic algorithms, in that they have a leaderboard with multiple top solutions. Which is nice for exploratory data analysis.

This example though it is pretty easy to amend the linear program to return exact solutions by generating lazy constraints. (And because the program is fast, you might as well do it this way as well to get an exact answer.) The way it works here, we have decision variables for products selected. You run the linear program, and then collect the k products selected, then add a constraint to the model in the form of:

Sum(Product_k) <= k-1 

Then rerun the model with this constraint, get the solution, and then repeat. This constraint makes it so the group of decision variables selected cannot be replicated (and in the forbidden set). Python’s pulp makes this quite easy, and here is a function to estimate the TURF model I showed in that prior blog post, but generate this multiple leaderboard:

import pandas as pd
import pulp

# Function to get solution from turf model
def get_solution(prod_dec,prod_ind):
    '''
    prod_dec - decision variables for products
    prod_ind - indices for products
    '''
    picked = []
    for j in prod_ind:
        if prod_dec[j].varValue >= 0.99:
            picked.append(j)
    return picked

# Larger function to set up turf model 
# And generate multiple solutions
def turf(data,k,solutions):
    '''
    data - dataframe with 0/1 selected items
    k - integer for products selected
    solutions - integer for number of potential solutions to return
    '''
    # Indices for Dec variables
    Cust = data.index
    Prod = list(data)
    # Problem and Decision Variables
    P = pulp.LpProblem("TURF", pulp.LpMaximize)
    Cu = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust], lowBound=0, upBound=1, cat=pulp.LpInteger)
    Pr = pulp.LpVariable.dicts("Products Selected", [j for j in Prod], lowBound=0, upBound=1, cat=pulp.LpInteger)
    # Objective Function (assumes weight = 1 for all rows)
    P += pulp.lpSum(Cu[i] for i in Cust)
    # Reached Constraint
    for i in Cust:
        #Get the products selected
        p_sel = data.loc[i] == 1
        sel_prod = list(p_sel.index[p_sel])
        #Set the constraint
        P += Cu[i] <= pulp.lpSum(Pr[j] for j in sel_prod)
    # Total number of products selected constraint
    P += pulp.lpSum(Pr[j] for j in Prod) == k
    # Solve the equation multiple times, add constraints
    res = []
    km1 = k - 1
    for _ in range(solutions):
        P.solve()
        pi = get_solution(Pr,Prod)
        # Lazy constraint
        P += pulp.lpSum(Pr[j] for j in pi) <= km1
        # Add in objective
        pi.append(pulp.value(P.objective))
        res.append(pi)
    # Turn results into nice dataframe
    cols = ['Prod' + str(i+1) for i in range(k)]
    res_df = pd.DataFrame(res, columns=cols+['Reach'])
    res_df['Reach'] = res_df['Reach'].astype(int)
    return res_df

And now we can simply read in our data, turn it into binary 0/1, and then generate the multiple solutions.

#This is simulated data from XLStat
#https://help.xlstat.com/s/article/turf-analysis-in-excel-tutorial?language=en_US
prod_data = pd.read_csv('demoTURF.csv',index_col=0)

#Replacing the likert scale data with 0/1
prod_bin = 1*(prod_data > 3)

# Get Top 10 solutions
top10 = turf(data=prod_bin,k=5,solutions=10)
print(top10)

Which generates the results:

>>> print(top10)
        Prod1       Prod2       Prod3       Prod4       Prod5  Reach
0   Product 4  Product 10  Product 15  Product 21  Product 25    129
1   Product 3  Product 15  Product 16  Product 25  Product 26    129
2   Product 4  Product 14  Product 15  Product 16  Product 26    129
3  Product 14  Product 15  Product 16  Product 23  Product 26    129
4   Product 3  Product 15  Product 21  Product 25  Product 26    129
5  Product 10  Product 15  Product 21  Product 23  Product 25    129
6   Product 4  Product 10  Product 15  Product 16  Product 27    129
7  Product 10  Product 15  Product 16  Product 23  Product 27    129
8   Product 3  Product 10  Product 15  Product 21  Product 25    128
9   Product 4  Product 10  Product 15  Product 21  Product 27    128

I haven’t checked too closely, but this looks to me offhand like the same top 8 solutions (with a reach of 129) as the XLStat website. The last two rows are different, likely because there are further ties.

For TURF analysis, you may actually consider a lazy constraint in the form of:

Product_k = 0 for each k

This is more restrictive, but if you have a very large pool of products, will ensure that each set of products selected are entirely different (so a unique set of 5 products for every row). Or you may consider a constraint in terms of customers reached instead of on products, so if solution 1 reaches 129, you filter those rows out and only count newly reached people in subsequent solutions. This ensures each row of the leaderboard above reaches different people than the prior rows.

Forum posts on Stackoverflow sites

When the initial cross validated stack exchange site (Stackoverflow but for statistics) was formed, I participated a ton. My participation waned though about the time I got a job as a professor. When starting I could skim almost every question, and I learned a ton from that participation. But when the site got more popular that approach was not sustainable. And combined with less time as a professor I just stopped checking entirely.

More recently I have started to simply browse the front page in the morning and only click on questions that look interesting (or I think I could answer reasonably quickly). Most of those answers recently have been Poisson related stuff:

Related I have lost time for the past two weeks, but before that made some good manic progress for my ptools R package. Next step is to make some vignettes for the more complicated spatial feature engineering functions (and maybe either a pre-commit hook to remind me to build the ReadMe, or generate a ReadMe as an artifact using Github actions). The package currently has good documentation, unit tests, and CICD using Github actions.

I also skim the Operations Research site and the Data Science sites. See some recent questions I answered:

The OR site has a really amazing set of people answering questions. I doubt I will ever see a simple enough question fast enough to answer before the multiple guru’s on that site. But I love perusing the answers, similar to when I first started cross validated I have learned quite a bit about formulating linear programming problems.

The data science exchange is at the other end of the spectrum – it is partly due to ill-specified questions, but the level of commentary is very poor (it may in fact be a net negative to the world/internet overall – quite a bit of bad advice). It is lower quality than skimming data science articles on Medium for example (there is some bad stuff on Medium, but overall it is more good than bad that I have seen at least). There is quite a bit of bad data science advice on the internet, and I can see it in the people I am interviewing for DS jobs. This is mainly because quite a bit of DS is statistics, and people seem to rely on copy-pasta solutions without understanding the underlying statistical/decision analysis problems they are solving.

Extending sklearns OrdinalEncoder

I’ve used a variant of this for a few different projects, so figured it was worth sharing. Sklearn’s OrdinalEncoder is close, but not quite what I want for a few different scenarios. Those are:

  • mixed input data types
  • missing data support (which can vary across the mixed input types)
  • the ability to limit encoding of rare categories (useful for regression models)

So I have scripted up a simple new class, what I call SimpleOrdEnc(), and can share it here in the blog post.

from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import pandas as pd

class SimpleOrdEnc():
    def __init__(self, dtype=int, unknown_value=-1, lim_k=None,
                 lim_count=None):
        self.unknown_value = unknown_value
        self.dtype = dtype
        self.lim_k = lim_k
        self.lim_count = lim_count
        self.vars = None
        self.soe = None
    def fit(self, X):
        self.vars = list(X)
        # Now creating fit for each variable
        res_oe = {}
        for v in list(X):
            res_oe[v] = OrdinalEncoder(dtype=self.dtype,
                handle_unknown='use_encoded_value',
                        unknown_value=self.unknown_value)
            # Get unique values minus missing
            xc = X[v].value_counts().reset_index()
            # If lim_k, only taking top K value
            if self.lim_k:
                top_k = self.lim_k - 1
                un_vals = xc.loc[0:top_k,:]
            # If count, using that to filter
            elif self.lim_count:
                un_vals = xc[xc[v] >= self.lim_count].copy()
            # If neither
            else:
                un_vals = xc
            # Now fitting the encoder for one variable
            res_oe[v].fit( un_vals[['index']] )
        # Appending back to the big class
        self.soe = res_oe
    # Defining transform/inverse_transform classes
    def transform(self, X):
        xcop = X[self.vars].copy()
        for v in self.vars:
            xcop[v] = self.soe[v].transform( X[[v]].fillna(self.unknown_value) )
        return xcop
    def inverse_transform(self, X):
        xcop = X[self.vars].copy()
        for v in self.vars:
            xcop[v] = self.soe[v].inverse_transform( X[[v]].fillna(self.unknown_value) )
        return xcop

This works mostly the same way that other sklearn objects do. You instantiate the object, then call fit, transform, inverse_transform, etc. Under the hood it just turns the data into a collection of ordinal encoders, but does a few things. One is that it strips missing values from the original fit data – so out of the box you do not need to do anything like x.fillna(-1), it just works. It is on you though to choose a missing unknown value that does not collide with potential encoded or decoded values. (Also for fit it should be a pandas dataframe, weird stuff will happen if you pass other numpy objects or lists.)

The second are the lim_k or the lim_count arguments. This is useful if you want to encode rare categories as another value. lim_k is if you want to say keep the top 20 categories in the fitted dataset. lim_count sets the threshold at how many cases are in the data, e.g. if you have at least 100 keep it. lim_k takes precedence over lim_count, so if you specify both lim_count is ignored.

These args also confound the missing values, so missing (even if it is common in the data), gets assiged to the ‘other’ category in this encoding. If that is not the behavior you want, I don’t see any way around not explicitly using fillna() before all this.

So here is a simple example use case:

x1 = [1,2,3]
x2 = ['a','b','c']
x3 = ['z','z',None]
x4 = [4,np.nan,5]

x = pd.DataFrame(zip(x1,x2,x3,x4),columns=['x1','x2','x3','x4'])
print(x)

oe = SimpleOrdEnc()
oe.fit(x)

# Transform to the same data
tx = oe.transform(x)
print(tx)

# Inverse transform gives you None
ix = oe.inverse_transform(tx)
print(ix)

So you can see this handles missing input data, but the inverse transform always returns None values for missing. The fit method though returns numeric encoded columns with the same variable names. I default to missing values of -1 as light boost (and I think catboost as well), have those as the default missing data values for categorical data.

Here is an example limiting the output to only categories that have at least 10 observations, and setting the missing data value to 99 instead of -1.

size = 1000
x1 = np.random.choice(['a','b','c'], size).tolist()
x2 = np.random.choice([1, np.nan, 2], size, p=[0.8,0.18,0.02]).tolist()
x3 = np.random.choice(['z','y','x','w',np.nan], size, p=[0.8,0.15,0.04, 0.005, 0.005]).tolist()
x = pd.DataFrame(zip(x1,x2,x3),columns=['x1','x2','x3'])

oe = SimpleOrdEnc(lim_count=10, unknown_value=99)
oe.fit(x)

# Checking with a simpler data frame

x1 = ['a','b','c','d',None,]
x2 = [1,2,-1,4,np.nan]
x3 = ['z','y','x','w','v']
sx = pd.DataFrame(zip(x1,x2,x3),columns=['x1','x2','x3'])

oe.transform(sx)

Because the class is all wrapped up in one object, you can then use pickle to save the object/use later in pipelines. If I know I want to test out specific regression models with my data, I often use the lim_count to make sure I am not keeping a bunch of small dangles of high cardinality data. (Often in my data I use missing data is rare enough I don’t even want to worry about imputing, I rather just treat it as a rare category.)

One use case this does not work out so well though for is ICD codes in wide format. Will need to write another blog post about that. But often will just reshape wide to long to fit these encoders.

Musings on Project Organization, Books and Courses

Is there a type of procrastination via which people write lists of things? I have that condition.

I have been recently thinking about project organization. At work we have been using the Cookie Cutter Data Science project set up – and I really hate it. I have been thinking about this more recently, as I have taken over several other data scientists models at work. The Cookie Cutter Template is waaaay too complicated, and mixes logic of building python packages (e.g. setup.py, a LICENSE folder) with data science in production code (who makes their functions pip installable for a production pipeline?). Here is the Cookie Cutter directory structure (even slightly cut off):

Cookie cutter has way too many folders (data folder in source, and data folder itself), multiple nested folders (what is the difference between external data, interim, and raw data?, what is the difference between features and data in the src folder?) I can see cases for individual parts of these needed sometimes (e.g. an external data file defining lookups for ICD codes), but why start with 100 extra folders that you don’t need. I find this very difficult taking over other peoples projects in that I don’t know where there are things and where there are not (most of these folders are empty).

So I’ve reorganized some of my projects at work, and they now look like this:

├── README.md           <- High level overview of project + any special notes
├── requirements.txt    <- Default python libraries we often use (eg sklearn, sqlalchemy)
├                         + special instructions for conda environments in our VMs
├── .gitignore          <- ignore `models/*.pkl`, `*.csv`, etc.
├── /models             <- place to store trained and serialized models
├── /notebooks          <- I don't even use notebooks very often, more like a scratch/EDA folder
├── /reports            <- Powerpoint reports to business (using HMS template)
├── /src                <- Place to store functions

And then depending on the project, we either use secret environment variables, or have a YAML file that has database connection strings etc. (And that YAML is specified in .gitignore.)

And then over time in the root folder it will typically have shell scripts call whatever production pipeline or API we are building. All the function files in source is fine, although it can grow to more modules if you really want it to.

And this got me thinking about how to teach this program management stuff to new data scientists we are hiring, and if I was still a professor how I would structure a course to teach this type of stuff in a social science program.

Courses

So in my procrastination I made a generic syllabi for what this software developement course would look like, Software & Project Development For Social Scientists. It would have a class/week on using the command prompt, then a week on github, then a few weeks building a python library, then ditto for an R package. And along the way sprinkle in literate programming (notebooks and markdown and Latex), unit testing, and docker.

And here we could discuss how projects are organized. And social science students get exposed to way more stuff that is relevant in a typical data science role. I have over the years also dreamt up other data science related courses as well.

Stats Programming for CJ. This goes through the basics of data manipulation using statistical programming. I would likely have tutorials for R, python, SPSS, and Stata for this. My experience with students is that even if they have had multiple stats classes in grad school, if you ask them “take this incident dataset with dates, and prepare a weekly level file with counts of crimes per week” they don’t know how to do even that simple task (an aggregation). So students need an entry level data manipulation course.

Optimization for Criminal Justice (or alt title Operations Research and Machine Learning for CJ). This one is not as developed as some of my other courses, but I think I could make it work for a semester. I think learning linear programming is a really great skill not taught at all in any CJ program I am aware of. I have some small notes on machine learning in my Research Design class for PhD students, but that could be expanded out (week for decision trees/forests, week for boosting, week for neural networks, etc.).

And last, I have made syllabi for the one credit entry level course for undergrad students, and the equivalent course for the new PhD students, College Prep. These classes I had I don’t think did a very good job. My intro one at Bloomsburg for undergrad had a textbook lol! The only thing I remember about my PhD one was fear mongering over publications (which at that point I had no idea what was going on), and spending the last class with Julie Horney and David McDowell at whatever the place next to the Washington Tavern in Albany was called (?Gingerbread?).

These are of course just in my head at the moment. I have posted my course materials over the years that I have delivered.

I have pitched to a few programs to hire me as a semi teaching professor (and still keep my private sector gig). This set up is not that uncommon in comp sci departments, but no CJ ones I think are interested. Even though I like musing about courses, adjunct pay is way too low to justify this investment, and should be paid to both develop the material as well as deliver the class.

Books

I have similarly made outlines for books over the years as well. One is Data Science for Crime Analysis with Python. I think there is an opening in the crime analysis market to advance to more professional coding, and so a python book would be good. But the market is overall tiny, my high end guesstimates are only around 800, so hard to justify the effort. (It would be mainly just a collection of my blog posts, but all in a nicer format for everyone to walk through/replicate.)

Another is a reader book, Handbook of Advanced Crime Analysis. That may not be needed though, as Cory Haberman and Liz Groff did a recent book that has quite a bit of overlap (can’t find it at the moment, maybe it is not out yet). Many current advanced techniques are scattered and sometimes difficult to replicate, I figured a reader that also includes code walkthroughs would help quite a few PhD students.

And again if I was still in the publishing game I would like to turn my Poisson course notes into a little Sage green book.

If I was still a professor, this would go hand in hand with developing courses. I know Uni’s do sometimes have grants to develop open source teaching materials, and these would probably best fit those molds. These aren’t going to generate revenue directly from sales.

So complaints and snippets on blog posts are all you are going to get for now from me.

Using Jupyter Notebooks to make nice README’s for GitHub

Working on my R package ptools, the devtools folks have you make a readme R markdown file to compile to a nice readme markdown file for github. I thought to myself that you could functionally do the same thing with juypter notebooks for python. So here is a quick example of that for my retenmod python package.

So first, here is the old readme in its entirety, rendered on Github:

You can see I have an example code snippet, but it does not actually output the results. Here is the update, where I use jupyter to render the markdown nicely:

So we have nice syntax highlighting even. (Note that the pip install code is not run in a %sh cell, it is just code formatting inside of a markdown cell.) I do like the way R markdown renders the output a bit nicer than here (I also haven’t tried with pretty pandas tables). But you can also include matplotlib images as well:

You might typically want to add in README.ipynb to your gitignore file, but here I included it in the github package so you can see what this notebook looks like. To compile the notebook to markdown is quite simple:

jupyter nbconvert --execute --to markdown README.ipynb

If you have matplotlib images, it saves them in a folder named README_files (not sure if there is a flag to change this option). To get the images to render you then also need to push that image folder into Github.

Spatial consistency of shootings and NIJ recid working papers

I have two recent working papers out:

The NIJ forecasting paper is the required submission by NIJ. Gio and I will likely try to turn this into a real paper in the near future. I’d note George Mohler and Michael Porter did the same thing as us, clip the probabilities to under 0.5 to win the fairness competition.

NIJ was interested in “what variables are the most important” – I will need to slate a longer blog post about this in the future, but this generally is not the right way to frame predictive challenges. You do not need real in depth understanding of the underlying system, and many times different effects can be swapped out for one another (e.g. Dawes, 1979).

The paper on shootings in Buffalo is consistent with my blog posts on shootings in NYC (precincts, grid cells). Even though shootings have gone up by quite a bit in Buffalo overall, the spatial distribution is very consistent over time. Appears similar to a recent paper by Jeff Brantingham and company as well.

It is a good use case for the differences in SPPT results when adjusting for multiple comparisons, we get a S index of 0.88 without adjustments, (see below distribution of p-values). These are consistent with random data though, so when doing a false discovery rate correction we have 0 areas below 0.05.

If you look at the maps there are some fuzzy evidence of shifts, but it is quite weak overall. Also one thing I mention here is that even though we have hot spots of shootings, even the hottest grid cells only have 1 shooting a month. Not clear to me if that is sufficient density (if only considering shootings) to really justify a hot spots approach.

References

  • Brantingham, P. J., Carter, J., MacDonald, J., Melde, C., & Mohler, G. (2021). Is the recent surge in violence in American cities due to contagion?. Journal of Criminal Justice, 76, 101848.
  • Circo, G., & Wheeler, A. (2021). National Institute of Justice Recidivism Forecasting Challenge Team “MCHawks” Performance Analysis. CrimRxiv. https://doi.org/10.21428/cb6ab371.9aa2c75a
  • Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34(7), 571.
  • Drake, G., Wheeler, A., Kim, D.-Y., Phillips, S. W., & Mendolera, K. (2021). The Impact of COVID-19 on the Spatial Distribution of Shooting Violence in Buffalo, NY. CrimRxiv. https://doi.org/10.21428/cb6ab371.e187aede
  • Mohler, G., & Porter, M. D. (2021). A note on the multiplicative fairness score in the NIJ recidivism forecasting challenge. Crime Science, 10(1), 1-5.

Reversion in the tech stack and why DS models fail

Hackernews recently shared a story about not using an IDE, and I feel mostly the same way. Hence the title in the post – so my current workflow for when I steal some time to work on my R package ptools my workflow looks like this, using Rterm from the shell:

I don’t have anything against RStudio, I just only have so much room in my brain. Sometimes conversations at work are like a foreign language, “How are we going to test the NiFi script from Hadoop to our Kubernetes environment” or “I can pull the docker image from JFrog, but I run out of room when extracting the image on our sandbox machine. But df says we have plenty of room on all the partitions?”.

If you notice at the top the (base) in front of the shell, that is because this is within the anaconda shell as well. So if you look at many of my past blog posts (see here for one example), I am just using the snipping tool in windows to take screenshots of the shell output in interactive mode.

I am typically just writing the code in Notepad++ (as well as this blog post) – and it is quite simple to switch between interactive copy this function/code and compiling entire scripts. Here is a screenshot of R unit tests for example.

So Notepad++ has some text highlight (for both R and python), and that is nice, but honestly not that necessary. Main thing I use is the selection of brackets to make sure they are balanced. I am sure I am missing out on some nice autocomplete features that would make me more productive, and function hints in Spyder are nice for pandas functions (I mostly use google still though for that when I need it).

I do use VS Code for development work on our headerless virtual machines at work. But that is more to replicate essentially the workflow on Windows with file explorer + Notepad++ + Shell (I am not a vim ninja – what is it esc + wq:, need to look that up everytime). I fucked up one of my git repos the other day using VS Code stuff tools, and I am just using git directly anymore. (Again this means I am the problem, not VS Code!)

Why data science projects fail?

Some more random musings, but the more I get involved and see what projects work and what don’t at work, pretty much all of the failures I have come across are due to what I will call “not modeling the right thing”. That potentially covers a bit, but quite a few are simply not understanding counterfactual reasoning and selection bias.

The modeling part (in terms of actually fitting models) is typically quite easy – you do some simple but slightly theoretically informed feature engineering and feed that info into a machine learning model that is very flexible. But maybe that is the problem, people can easily fool themselves into thinking a model looks good, but because they are modeling the wrong thing it does not result in better decision making.

Even in most of the failures I have seen, selection bias is surmountable (often just requires multiple models or models on different samples of data – reduced form for the win!). So learning how to train/test split the data, and feed your data into XGboost only takes a few classes to learn. How to know the right thing to model though takes a bit more thought.

A secondary part of the failure is not learning how to translate the model outputs into actionable decisions. But the not modeling the right thing is at the start, so makes any downstream decision not work out how you want.

Chunking it up in pandas

In the python pandas library, you can read a table (or a query) from a SQL database like this:

data = pandas.read_sql_table('tablename',db_connection)

Pandas also has an inbuilt function to return an iterator of chunks of the dataset, instead of the whole dataframe.

data_chunks = pandas.read_sql_table('tablename',db_connection,chunksize=2000)

I thought for awhile this was somewhat worthless, as I thought it still read the whole thing into memory. I have been using it though to pull new records, generate predictions, and then feed the predictions back into another table though (easier to write back to a new DB in smaller chunks).

But I actually did some tests recently and I was wrong. When using chunksize, it does not read the whole table into memory, so works like you would want. In particular, I was testing out sqlalchemy and execution_options=dict(stream_results=True) when creating the engine. Which works great for postgres – it is not needed though for every other DB I have tried so far (I will show postgres and sqlite here, at work have also tested teradata and sql server).

Here I use the python library memory_profiler and show the difference in memory usage for a database of crime incidents from Dallas PD. First in my script I load in the libraries I will be using, and then created my different engine connection strings

''' 
Tests for memory and chunking
up dataframes
'''
import sqlalchemy
import pandas as pd
from datetime import datetime
from memory_profiler import profile

# Check out https://github.com/apwheele/Blog_Code/tree/master/Python/jupyter_reports
# For where this data came from
# sqlalchemy string for SQLlite
sqli = r'sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite'
lite_eng = sqlalchemy.create_engine(sqli)

# sqlalchemy string for postgres
pg = f'postgresql://id:pwd@localhost/dbname' #need to fill these in with your own info
pg_eng = sqlalchemy.create_engine(pg)
pg_eng_str = sqlalchemy.create_engine(pg, execution_options=dict(stream_results=True))

Next, we can make a function to either read a table all at once, or to return an iterator that chunks the data up into a tinier number of rows. We also decorate this function with the profile function to get some nice statistics later. Here I just do some random stats and accumulate them in each iteration.

@profile
def read_table(sql_con, chnk=None):
    begin_time = datetime.now()
    inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    # If chunk, do the calcs in iterations
    if chnk:
        tot_inc, tot_arr = 0,0
        for c in inc:
            tot_inc += c.shape[0]
            tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    else:
        tot_inc = inc.shape[0]
        tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

Most of the time I am using something like this for predictions or generating follow up metrics to see how well my model is doing (and rewriting those intermediate results to a table). So inside the iteration loop looks something like pred_val = predict_function(c) and pred_val.to_sql('new_table',sql_con,if_exists='append').

Now just to finish off the script, you can do run the tests if called directly:

if __name__ == "__main__":
    read_table(sqli,None)       #sqlite big table all at once
    read_table(sqli,2000)       #sqlite in chunks
    read_table(pg_eng,None)     #pg big table
    read_table(pg_eng,2000)     #pg in chunks
    read_table(pg_eng_str,2000) #pg stream in chunks

Now when running this script, you would run it as python -m memory_profiler chunk_tests.py > test_results.txt, and this will give you a nice set of print outs for each function call. So lets start with sqlite reading the entire database into memory:

sql: sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:22:41.652316 -- 2021-08-11 19:23:53.743702

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     82.4 MiB     82.4 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     82.4 MiB      0.0 MiB           1       begin_time = datetime.now()
    31   4260.7 MiB   4178.3 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33   4260.7 MiB      0.0 MiB           1       if chnk:
    34                                                 tot_inc, tot_arr = 0,0
    35                                                 for c in inc:
    36                                                     tot_inc += c.shape[0]
    37                                                     tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39   4260.7 MiB      0.0 MiB           1           tot_inc = inc.shape[0]
    40   4261.5 MiB      0.8 MiB           1           tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41   4261.5 MiB      0.0 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

You can see that the incident database is 785k rows (it is 100 columns as well). It takes only alittle over a minute to read in the table, but it takes up over 4 gigabytes of memory. So I can fit this in memory on my machine, but if I say added a free text comments field I might not be able to fit this in memory on my personal machine.

Now lets see what these results look like when I chunk the data up into smaller sets of only 2000 rows at a time:

sql: sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:23:56.475680 -- 2021-08-11 19:25:33.896884

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     91.6 MiB     91.6 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     91.6 MiB      0.0 MiB           1       begin_time = datetime.now()
    31     91.9 MiB      0.3 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33     91.9 MiB      0.0 MiB           1       if chnk:
    34     91.9 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
    35    117.3 MiB    -69.6 MiB         394           for c in inc:
    36    117.3 MiB    -89.6 MiB         393               tot_inc += c.shape[0]
    37    117.3 MiB    -89.6 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39                                                 tot_inc = inc.shape[0]
    40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41    112.0 MiB     -5.3 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

We only end up maxing out the memory at under 120 megabytes. You can see the total function also only takes around 1.5 minutes. So it is only ~30 seconds longer to run this result in chunks than doing it all at once.

Now onto postgres, I will forego showing the results when reading in the whole table at once – it is pretty much the same as sqlite when reading the whole table at once. But here are the results for the chunks with the usual sqlalchemy connection string:

sql: Engine(postgresql://postgres:***@localhost/postgres)
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:26:28.454807 -- 2021-08-11 19:27:41.917612

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     51.3 MiB     51.3 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     51.3 MiB      0.0 MiB           1       begin_time = datetime.now()
    31   2017.2 MiB   1966.0 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33   2017.2 MiB      0.0 MiB           1       if chnk:
    34   2017.2 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
    35   2056.8 MiB  -1927.3 MiB         394           for c in inc:
    36   2056.8 MiB      0.0 MiB         393               tot_inc += c.shape[0]
    37   2056.8 MiB      0.0 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39                                                 tot_inc = inc.shape[0]
    40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41     89.9 MiB  -1966.9 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

So here we get some savings, but not nearly as good as sqlite – it ends up reading in around 2 gig for the iterator. But no fear, there is a specific option when setting up the sqlalchemy string for postgres. If you pay close attention to above where I build the connection strings, the engine for this trial is pg_eng_str = sqlalchemy.create_engine(pg, execution_options=dict(stream_results=True)). And when I use the streaming results engine, here is the memory profile I get:

sql: Engine(postgresql://postgres:***@localhost/postgres)
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:27:41.922613 -- 2021-08-11 19:28:54.435834

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     88.8 MiB     88.8 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     88.8 MiB      0.0 MiB           1       begin_time = datetime.now()
    31     89.6 MiB      0.8 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33     89.6 MiB      0.0 MiB           1       if chnk:
    34     89.6 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
    35    100.5 MiB   -322.6 MiB         394           for c in inc:
    36    100.5 MiB   -325.4 MiB         393               tot_inc += c.shape[0]
    37    100.5 MiB   -325.4 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39                                                 tot_inc = inc.shape[0]
    40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41     92.3 MiB     -8.2 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

Which is very similar to sqlite results.

Some workflows unfortunately are not so reducable into tiny chunks like this. E.g. say you were doing multiple time series for different groups, you would likely need to use substitution into the SQL to get one group at a time. I have done this to grab chunks of claims data one month at a time, but in many of my workflows I could just replace that with chunksize and do one read.

Some folks reading this may be thinking “this is a great use case for hadoop”. For the most part I am personally not working with that big of data, just mildly annoying at the edge of fitting into memory data. So this just do stuff in chunks works great for most of my workflows, and I don’t have to deal with the hassles of hadoop/spark.

My team is hiring

My company, HMS, is currently hiring in my data science team:

These positions can be remote, and HMS will sponsor visas as well.

Data scientist is a really broad brush. Typically what I am looking for is prior experience in data management (e.g. you should know some SQL), and the ability to explain the models you are using in plain English to business partners. You do not need prior X years of data science experience — on our team we have hired folks who were business analysts who had coding/machine learning experience, people pretty much straight from PhD, and I was a professor of criminology.

The director position is a bit different, but even there if you are say a project manager in data science and want to move up to a director role, a senior/principal data scientist with prior supervisory experience, or direct a different data related role (and have more than a passing knowledge of machine learning), those experiences would all be considered seriously for the role.

If you want to get a flavor for the types of projects we are working on at HMS (these are focused on health insurance claims and payment integrity, so no population health nor subrogation examples, but will be OK to get a high level), check out these blog posts of mine:

If you have questions always feel free to email!