Forum posts on Stackoverflow sites

When the initial cross validated stack exchange site (Stackoverflow but for statistics) was formed, I participated a ton. My participation waned though about the time I got a job as a professor. When starting I could skim almost every question, and I learned a ton from that participation. But when the site got more popular that approach was not sustainable. And combined with less time as a professor I just stopped checking entirely.

More recently I have started to simply browse the front page in the morning and only click on questions that look interesting (or I think I could answer reasonably quickly). Most of those answers recently have been Poisson related stuff:

Related I have lost time for the past two weeks, but before that made some good manic progress for my ptools R package. Next step is to make some vignettes for the more complicated spatial feature engineering functions (and maybe either a pre-commit hook to remind me to build the ReadMe, or generate a ReadMe as an artifact using Github actions). The package currently has good documentation, unit tests, and CICD using Github actions.

I also skim the Operations Research site and the Data Science sites. See some recent questions I answered:

The OR site has a really amazing set of people answering questions. I doubt I will ever see a simple enough question fast enough to answer before the multiple guru’s on that site. But I love perusing the answers, similar to when I first started cross validated I have learned quite a bit about formulating linear programming problems.

The data science exchange is at the other end of the spectrum – it is partly due to ill-specified questions, but the level of commentary is very poor (it may in fact be a net negative to the world/internet overall – quite a bit of bad advice). It is lower quality than skimming data science articles on Medium for example (there is some bad stuff on Medium, but overall it is more good than bad that I have seen at least). There is quite a bit of bad data science advice on the internet, and I can see it in the people I am interviewing for DS jobs. This is mainly because quite a bit of DS is statistics, and people seem to rely on copy-pasta solutions without understanding the underlying statistical/decision analysis problems they are solving.

Extending sklearns OrdinalEncoder

I’ve used a variant of this for a few different projects, so figured it was worth sharing. Sklearn’s OrdinalEncoder is close, but not quite what I want for a few different scenarios. Those are:

  • mixed input data types
  • missing data support (which can vary across the mixed input types)
  • the ability to limit encoding of rare categories (useful for regression models)

So I have scripted up a simple new class, what I call SimpleOrdEnc(), and can share it here in the blog post.

from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import pandas as pd

class SimpleOrdEnc():
    def __init__(self, dtype=int, unknown_value=-1, lim_k=None,
                 lim_count=None):
        self.unknown_value = unknown_value
        self.dtype = dtype
        self.lim_k = lim_k
        self.lim_count = lim_count
        self.vars = None
        self.soe = None
    def fit(self, X):
        self.vars = list(X)
        # Now creating fit for each variable
        res_oe = {}
        for v in list(X):
            res_oe[v] = OrdinalEncoder(dtype=self.dtype,
                handle_unknown='use_encoded_value',
                        unknown_value=self.unknown_value)
            # Get unique values minus missing
            xc = X[v].value_counts().reset_index()
            # If lim_k, only taking top K value
            if self.lim_k:
                top_k = self.lim_k - 1
                un_vals = xc.loc[0:top_k,:]
            # If count, using that to filter
            elif self.lim_count:
                un_vals = xc[xc[v] >= self.lim_count].copy()
            # If neither
            else:
                un_vals = xc
            # Now fitting the encoder for one variable
            res_oe[v].fit( un_vals[['index']] )
        # Appending back to the big class
        self.soe = res_oe
    # Defining transform/inverse_transform classes
    def transform(self, X):
        xcop = X[self.vars].copy()
        for v in self.vars:
            xcop[v] = self.soe[v].transform( X[[v]].fillna(self.unknown_value) )
        return xcop
    def inverse_transform(self, X):
        xcop = X[self.vars].copy()
        for v in self.vars:
            xcop[v] = self.soe[v].inverse_transform( X[[v]].fillna(self.unknown_value) )
        return xcop

This works mostly the same way that other sklearn objects do. You instantiate the object, then call fit, transform, inverse_transform, etc. Under the hood it just turns the data into a collection of ordinal encoders, but does a few things. One is that it strips missing values from the original fit data – so out of the box you do not need to do anything like x.fillna(-1), it just works. It is on you though to choose a missing unknown value that does not collide with potential encoded or decoded values. (Also for fit it should be a pandas dataframe, weird stuff will happen if you pass other numpy objects or lists.)

The second are the lim_k or the lim_count arguments. This is useful if you want to encode rare categories as another value. lim_k is if you want to say keep the top 20 categories in the fitted dataset. lim_count sets the threshold at how many cases are in the data, e.g. if you have at least 100 keep it. lim_k takes precedence over lim_count, so if you specify both lim_count is ignored.

These args also confound the missing values, so missing (even if it is common in the data), gets assiged to the ‘other’ category in this encoding. If that is not the behavior you want, I don’t see any way around not explicitly using fillna() before all this.

So here is a simple example use case:

x1 = [1,2,3]
x2 = ['a','b','c']
x3 = ['z','z',None]
x4 = [4,np.nan,5]

x = pd.DataFrame(zip(x1,x2,x3,x4),columns=['x1','x2','x3','x4'])
print(x)

oe = SimpleOrdEnc()
oe.fit(x)

# Transform to the same data
tx = oe.transform(x)
print(tx)

# Inverse transform gives you None
ix = oe.inverse_transform(tx)
print(ix)

So you can see this handles missing input data, but the inverse transform always returns None values for missing. The fit method though returns numeric encoded columns with the same variable names. I default to missing values of -1 as light boost (and I think catboost as well), have those as the default missing data values for categorical data.

Here is an example limiting the output to only categories that have at least 10 observations, and setting the missing data value to 99 instead of -1.

size = 1000
x1 = np.random.choice(['a','b','c'], size).tolist()
x2 = np.random.choice([1, np.nan, 2], size, p=[0.8,0.18,0.02]).tolist()
x3 = np.random.choice(['z','y','x','w',np.nan], size, p=[0.8,0.15,0.04, 0.005, 0.005]).tolist()
x = pd.DataFrame(zip(x1,x2,x3),columns=['x1','x2','x3'])

oe = SimpleOrdEnc(lim_count=10, unknown_value=99)
oe.fit(x)

# Checking with a simpler data frame

x1 = ['a','b','c','d',None,]
x2 = [1,2,-1,4,np.nan]
x3 = ['z','y','x','w','v']
sx = pd.DataFrame(zip(x1,x2,x3),columns=['x1','x2','x3'])

oe.transform(sx)

Because the class is all wrapped up in one object, you can then use pickle to save the object/use later in pipelines. If I know I want to test out specific regression models with my data, I often use the lim_count to make sure I am not keeping a bunch of small dangles of high cardinality data. (Often in my data I use missing data is rare enough I don’t even want to worry about imputing, I rather just treat it as a rare category.)

One use case this does not work out so well though for is ICD codes in wide format. Will need to write another blog post about that. But often will just reshape wide to long to fit these encoders.

Musings on Project Organization, Books and Courses

Is there a type of procrastination via which people write lists of things? I have that condition.

I have been recently thinking about project organization. At work we have been using the Cookie Cutter Data Science project set up – and I really hate it. I have been thinking about this more recently, as I have taken over several other data scientists models at work. The Cookie Cutter Template is waaaay too complicated, and mixes logic of building python packages (e.g. setup.py, a LICENSE folder) with data science in production code (who makes their functions pip installable for a production pipeline?). Here is the Cookie Cutter directory structure (even slightly cut off):

Cookie cutter has way too many folders (data folder in source, and data folder itself), multiple nested folders (what is the difference between external data, interim, and raw data?, what is the difference between features and data in the src folder?) I can see cases for individual parts of these needed sometimes (e.g. an external data file defining lookups for ICD codes), but why start with 100 extra folders that you don’t need. I find this very difficult taking over other peoples projects in that I don’t know where there are things and where there are not (most of these folders are empty).

So I’ve reorganized some of my projects at work, and they now look like this:

├── README.md           <- High level overview of project + any special notes
├── requirements.txt    <- Default python libraries we often use (eg sklearn, sqlalchemy)
├                         + special instructions for conda environments in our VMs
├── .gitignore          <- ignore `models/*.pkl`, `*.csv`, etc.
├── /models             <- place to store trained and serialized models
├── /notebooks          <- I don't even use notebooks very often, more like a scratch/EDA folder
├── /reports            <- Powerpoint reports to business (using HMS template)
├── /src                <- Place to store functions

And then depending on the project, we either use secret environment variables, or have a YAML file that has database connection strings etc. (And that YAML is specified in .gitignore.)

And then over time in the root folder it will typically have shell scripts call whatever production pipeline or API we are building. All the function files in source is fine, although it can grow to more modules if you really want it to.

And this got me thinking about how to teach this program management stuff to new data scientists we are hiring, and if I was still a professor how I would structure a course to teach this type of stuff in a social science program.

Courses

So in my procrastination I made a generic syllabi for what this software developement course would look like, Software & Project Development For Social Scientists. It would have a class/week on using the command prompt, then a week on github, then a few weeks building a python library, then ditto for an R package. And along the way sprinkle in literate programming (notebooks and markdown and Latex), unit testing, and docker.

And here we could discuss how projects are organized. And social science students get exposed to way more stuff that is relevant in a typical data science role. I have over the years also dreamt up other data science related courses as well.

Stats Programming for CJ. This goes through the basics of data manipulation using statistical programming. I would likely have tutorials for R, python, SPSS, and Stata for this. My experience with students is that even if they have had multiple stats classes in grad school, if you ask them “take this incident dataset with dates, and prepare a weekly level file with counts of crimes per week” they don’t know how to do even that simple task (an aggregation). So students need an entry level data manipulation course.

Optimization for Criminal Justice (or alt title Operations Research and Machine Learning for CJ). This one is not as developed as some of my other courses, but I think I could make it work for a semester. I think learning linear programming is a really great skill not taught at all in any CJ program I am aware of. I have some small notes on machine learning in my Research Design class for PhD students, but that could be expanded out (week for decision trees/forests, week for boosting, week for neural networks, etc.).

And last, I have made syllabi for the one credit entry level course for undergrad students, and the equivalent course for the new PhD students, College Prep. These classes I had I don’t think did a very good job. My intro one at Bloomsburg for undergrad had a textbook lol! The only thing I remember about my PhD one was fear mongering over publications (which at that point I had no idea what was going on), and spending the last class with Julie Horney and David McDowell at whatever the place next to the Washington Tavern in Albany was called (?Gingerbread?).

These are of course just in my head at the moment. I have posted my course materials over the years that I have delivered.

I have pitched to a few programs to hire me as a semi teaching professor (and still keep my private sector gig). This set up is not that uncommon in comp sci departments, but no CJ ones I think are interested. Even though I like musing about courses, adjunct pay is way too low to justify this investment, and should be paid to both develop the material as well as deliver the class.

Books

I have similarly made outlines for books over the years as well. One is Data Science for Crime Analysis with Python. I think there is an opening in the crime analysis market to advance to more professional coding, and so a python book would be good. But the market is overall tiny, my high end guesstimates are only around 800, so hard to justify the effort. (It would be mainly just a collection of my blog posts, but all in a nicer format for everyone to walk through/replicate.)

Another is a reader book, Handbook of Advanced Crime Analysis. That may not be needed though, as Cory Haberman and Liz Groff did a recent book that has quite a bit of overlap (can’t find it at the moment, maybe it is not out yet). Many current advanced techniques are scattered and sometimes difficult to replicate, I figured a reader that also includes code walkthroughs would help quite a few PhD students.

And again if I was still in the publishing game I would like to turn my Poisson course notes into a little Sage green book.

If I was still a professor, this would go hand in hand with developing courses. I know Uni’s do sometimes have grants to develop open source teaching materials, and these would probably best fit those molds. These aren’t going to generate revenue directly from sales.

So complaints and snippets on blog posts are all you are going to get for now from me.

Using Jupyter Notebooks to make nice README’s for GitHub

Working on my R package ptools, the devtools folks have you make a readme R markdown file to compile to a nice readme markdown file for github. I thought to myself that you could functionally do the same thing with juypter notebooks for python. So here is a quick example of that for my retenmod python package.

So first, here is the old readme in its entirety, rendered on Github:

You can see I have an example code snippet, but it does not actually output the results. Here is the update, where I use jupyter to render the markdown nicely:

So we have nice syntax highlighting even. (Note that the pip install code is not run in a %sh cell, it is just code formatting inside of a markdown cell.) I do like the way R markdown renders the output a bit nicer than here (I also haven’t tried with pretty pandas tables). But you can also include matplotlib images as well:

You might typically want to add in README.ipynb to your gitignore file, but here I included it in the github package so you can see what this notebook looks like. To compile the notebook to markdown is quite simple:

jupyter nbconvert --execute --to markdown README.ipynb

If you have matplotlib images, it saves them in a folder named README_files (not sure if there is a flag to change this option). To get the images to render you then also need to push that image folder into Github.

Spatial consistency of shootings and NIJ recid working papers

I have two recent working papers out:

The NIJ forecasting paper is the required submission by NIJ. Gio and I will likely try to turn this into a real paper in the near future. I’d note George Mohler and Michael Porter did the same thing as us, clip the probabilities to under 0.5 to win the fairness competition.

NIJ was interested in “what variables are the most important” – I will need to slate a longer blog post about this in the future, but this generally is not the right way to frame predictive challenges. You do not need real in depth understanding of the underlying system, and many times different effects can be swapped out for one another (e.g. Dawes, 1979).

The paper on shootings in Buffalo is consistent with my blog posts on shootings in NYC (precincts, grid cells). Even though shootings have gone up by quite a bit in Buffalo overall, the spatial distribution is very consistent over time. Appears similar to a recent paper by Jeff Brantingham and company as well.

It is a good use case for the differences in SPPT results when adjusting for multiple comparisons, we get a S index of 0.88 without adjustments, (see below distribution of p-values). These are consistent with random data though, so when doing a false discovery rate correction we have 0 areas below 0.05.

If you look at the maps there are some fuzzy evidence of shifts, but it is quite weak overall. Also one thing I mention here is that even though we have hot spots of shootings, even the hottest grid cells only have 1 shooting a month. Not clear to me if that is sufficient density (if only considering shootings) to really justify a hot spots approach.

References

  • Brantingham, P. J., Carter, J., MacDonald, J., Melde, C., & Mohler, G. (2021). Is the recent surge in violence in American cities due to contagion?. Journal of Criminal Justice, 76, 101848.
  • Circo, G., & Wheeler, A. (2021). National Institute of Justice Recidivism Forecasting Challenge Team “MCHawks” Performance Analysis. CrimRxiv. https://doi.org/10.21428/cb6ab371.9aa2c75a
  • Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34(7), 571.
  • Drake, G., Wheeler, A., Kim, D.-Y., Phillips, S. W., & Mendolera, K. (2021). The Impact of COVID-19 on the Spatial Distribution of Shooting Violence in Buffalo, NY. CrimRxiv. https://doi.org/10.21428/cb6ab371.e187aede
  • Mohler, G., & Porter, M. D. (2021). A note on the multiplicative fairness score in the NIJ recidivism forecasting challenge. Crime Science, 10(1), 1-5.

Reversion in the tech stack and why DS models fail

Hackernews recently shared a story about not using an IDE, and I feel mostly the same way. Hence the title in the post – so my current workflow for when I steal some time to work on my R package ptools my workflow looks like this, using Rterm from the shell:

I don’t have anything against RStudio, I just only have so much room in my brain. Sometimes conversations at work are like a foreign language, “How are we going to test the NiFi script from Hadoop to our Kubernetes environment” or “I can pull the docker image from JFrog, but I run out of room when extracting the image on our sandbox machine. But df says we have plenty of room on all the partitions?”.

If you notice at the top the (base) in front of the shell, that is because this is within the anaconda shell as well. So if you look at many of my past blog posts (see here for one example), I am just using the snipping tool in windows to take screenshots of the shell output in interactive mode.

I am typically just writing the code in Notepad++ (as well as this blog post) – and it is quite simple to switch between interactive copy this function/code and compiling entire scripts. Here is a screenshot of R unit tests for example.

So Notepad++ has some text highlight (for both R and python), and that is nice, but honestly not that necessary. Main thing I use is the selection of brackets to make sure they are balanced. I am sure I am missing out on some nice autocomplete features that would make me more productive, and function hints in Spyder are nice for pandas functions (I mostly use google still though for that when I need it).

I do use VS Code for development work on our headerless virtual machines at work. But that is more to replicate essentially the workflow on Windows with file explorer + Notepad++ + Shell (I am not a vim ninja – what is it esc + wq:, need to look that up everytime). I fucked up one of my git repos the other day using VS Code stuff tools, and I am just using git directly anymore. (Again this means I am the problem, not VS Code!)

Why data science projects fail?

Some more random musings, but the more I get involved and see what projects work and what don’t at work, pretty much all of the failures I have come across are due to what I will call “not modeling the right thing”. That potentially covers a bit, but quite a few are simply not understanding counterfactual reasoning and selection bias.

The modeling part (in terms of actually fitting models) is typically quite easy – you do some simple but slightly theoretically informed feature engineering and feed that info into a machine learning model that is very flexible. But maybe that is the problem, people can easily fool themselves into thinking a model looks good, but because they are modeling the wrong thing it does not result in better decision making.

Even in most of the failures I have seen, selection bias is surmountable (often just requires multiple models or models on different samples of data – reduced form for the win!). So learning how to train/test split the data, and feed your data into XGboost only takes a few classes to learn. How to know the right thing to model though takes a bit more thought.

A secondary part of the failure is not learning how to translate the model outputs into actionable decisions. But the not modeling the right thing is at the start, so makes any downstream decision not work out how you want.

Chunking it up in pandas

In the python pandas library, you can read a table (or a query) from a SQL database like this:

data = pandas.read_sql_table('tablename',db_connection)

Pandas also has an inbuilt function to return an iterator of chunks of the dataset, instead of the whole dataframe.

data_chunks = pandas.read_sql_table('tablename',db_connection,chunksize=2000)

I thought for awhile this was somewhat worthless, as I thought it still read the whole thing into memory. I have been using it though to pull new records, generate predictions, and then feed the predictions back into another table though (easier to write back to a new DB in smaller chunks).

But I actually did some tests recently and I was wrong. When using chunksize, it does not read the whole table into memory, so works like you would want. In particular, I was testing out sqlalchemy and execution_options=dict(stream_results=True) when creating the engine. Which works great for postgres – it is not needed though for every other DB I have tried so far (I will show postgres and sqlite here, at work have also tested teradata and sql server).

Here I use the python library memory_profiler and show the difference in memory usage for a database of crime incidents from Dallas PD. First in my script I load in the libraries I will be using, and then created my different engine connection strings

''' 
Tests for memory and chunking
up dataframes
'''
import sqlalchemy
import pandas as pd
from datetime import datetime
from memory_profiler import profile

# Check out https://github.com/apwheele/Blog_Code/tree/master/Python/jupyter_reports
# For where this data came from
# sqlalchemy string for SQLlite
sqli = r'sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite'
lite_eng = sqlalchemy.create_engine(sqli)

# sqlalchemy string for postgres
pg = f'postgresql://id:pwd@localhost/dbname' #need to fill these in with your own info
pg_eng = sqlalchemy.create_engine(pg)
pg_eng_str = sqlalchemy.create_engine(pg, execution_options=dict(stream_results=True))

Next, we can make a function to either read a table all at once, or to return an iterator that chunks the data up into a tinier number of rows. We also decorate this function with the profile function to get some nice statistics later. Here I just do some random stats and accumulate them in each iteration.

@profile
def read_table(sql_con, chnk=None):
    begin_time = datetime.now()
    inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    # If chunk, do the calcs in iterations
    if chnk:
        tot_inc, tot_arr = 0,0
        for c in inc:
            tot_inc += c.shape[0]
            tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    else:
        tot_inc = inc.shape[0]
        tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

Most of the time I am using something like this for predictions or generating follow up metrics to see how well my model is doing (and rewriting those intermediate results to a table). So inside the iteration loop looks something like pred_val = predict_function(c) and pred_val.to_sql('new_table',sql_con,if_exists='append').

Now just to finish off the script, you can do run the tests if called directly:

if __name__ == "__main__":
    read_table(sqli,None)       #sqlite big table all at once
    read_table(sqli,2000)       #sqlite in chunks
    read_table(pg_eng,None)     #pg big table
    read_table(pg_eng,2000)     #pg in chunks
    read_table(pg_eng_str,2000) #pg stream in chunks

Now when running this script, you would run it as python -m memory_profiler chunk_tests.py > test_results.txt, and this will give you a nice set of print outs for each function call. So lets start with sqlite reading the entire database into memory:

sql: sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:22:41.652316 -- 2021-08-11 19:23:53.743702

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     82.4 MiB     82.4 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     82.4 MiB      0.0 MiB           1       begin_time = datetime.now()
    31   4260.7 MiB   4178.3 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33   4260.7 MiB      0.0 MiB           1       if chnk:
    34                                                 tot_inc, tot_arr = 0,0
    35                                                 for c in inc:
    36                                                     tot_inc += c.shape[0]
    37                                                     tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39   4260.7 MiB      0.0 MiB           1           tot_inc = inc.shape[0]
    40   4261.5 MiB      0.8 MiB           1           tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41   4261.5 MiB      0.0 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

You can see that the incident database is 785k rows (it is 100 columns as well). It takes only alittle over a minute to read in the table, but it takes up over 4 gigabytes of memory. So I can fit this in memory on my machine, but if I say added a free text comments field I might not be able to fit this in memory on my personal machine.

Now lets see what these results look like when I chunk the data up into smaller sets of only 2000 rows at a time:

sql: sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:23:56.475680 -- 2021-08-11 19:25:33.896884

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     91.6 MiB     91.6 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     91.6 MiB      0.0 MiB           1       begin_time = datetime.now()
    31     91.9 MiB      0.3 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33     91.9 MiB      0.0 MiB           1       if chnk:
    34     91.9 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
    35    117.3 MiB    -69.6 MiB         394           for c in inc:
    36    117.3 MiB    -89.6 MiB         393               tot_inc += c.shape[0]
    37    117.3 MiB    -89.6 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39                                                 tot_inc = inc.shape[0]
    40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41    112.0 MiB     -5.3 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

We only end up maxing out the memory at under 120 megabytes. You can see the total function also only takes around 1.5 minutes. So it is only ~30 seconds longer to run this result in chunks than doing it all at once.

Now onto postgres, I will forego showing the results when reading in the whole table at once – it is pretty much the same as sqlite when reading the whole table at once. But here are the results for the chunks with the usual sqlalchemy connection string:

sql: Engine(postgresql://postgres:***@localhost/postgres)
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:26:28.454807 -- 2021-08-11 19:27:41.917612

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     51.3 MiB     51.3 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     51.3 MiB      0.0 MiB           1       begin_time = datetime.now()
    31   2017.2 MiB   1966.0 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33   2017.2 MiB      0.0 MiB           1       if chnk:
    34   2017.2 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
    35   2056.8 MiB  -1927.3 MiB         394           for c in inc:
    36   2056.8 MiB      0.0 MiB         393               tot_inc += c.shape[0]
    37   2056.8 MiB      0.0 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39                                                 tot_inc = inc.shape[0]
    40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41     89.9 MiB  -1966.9 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

So here we get some savings, but not nearly as good as sqlite – it ends up reading in around 2 gig for the iterator. But no fear, there is a specific option when setting up the sqlalchemy string for postgres. If you pay close attention to above where I build the connection strings, the engine for this trial is pg_eng_str = sqlalchemy.create_engine(pg, execution_options=dict(stream_results=True)). And when I use the streaming results engine, here is the memory profile I get:

sql: Engine(postgresql://postgres:***@localhost/postgres)
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:27:41.922613 -- 2021-08-11 19:28:54.435834

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28     88.8 MiB     88.8 MiB           1   @profile
    29                                         def read_table(sql_con, chnk=None):
    30     88.8 MiB      0.0 MiB           1       begin_time = datetime.now()
    31     89.6 MiB      0.8 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
    32                                             # If chunk, do the calcs in iterations
    33     89.6 MiB      0.0 MiB           1       if chnk:
    34     89.6 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
    35    100.5 MiB   -322.6 MiB         394           for c in inc:
    36    100.5 MiB   -325.4 MiB         393               tot_inc += c.shape[0]
    37    100.5 MiB   -325.4 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
    38                                             else:
    39                                                 tot_inc = inc.shape[0]
    40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
    41     92.3 MiB     -8.2 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')

Which is very similar to sqlite results.

Some workflows unfortunately are not so reducable into tiny chunks like this. E.g. say you were doing multiple time series for different groups, you would likely need to use substitution into the SQL to get one group at a time. I have done this to grab chunks of claims data one month at a time, but in many of my workflows I could just replace that with chunksize and do one read.

Some folks reading this may be thinking “this is a great use case for hadoop”. For the most part I am personally not working with that big of data, just mildly annoying at the edge of fitting into memory data. So this just do stuff in chunks works great for most of my workflows, and I don’t have to deal with the hassles of hadoop/spark.

My team is hiring

My company, HMS, is currently hiring in my data science team:

These positions can be remote, and HMS will sponsor visas as well.

Data scientist is a really broad brush. Typically what I am looking for is prior experience in data management (e.g. you should know some SQL), and the ability to explain the models you are using in plain English to business partners. You do not need prior X years of data science experience — on our team we have hired folks who were business analysts who had coding/machine learning experience, people pretty much straight from PhD, and I was a professor of criminology.

The director position is a bit different, but even there if you are say a project manager in data science and want to move up to a director role, a senior/principal data scientist with prior supervisory experience, or direct a different data related role (and have more than a passing knowledge of machine learning), those experiences would all be considered seriously for the role.

If you want to get a flavor for the types of projects we are working on at HMS (these are focused on health insurance claims and payment integrity, so no population health nor subrogation examples, but will be OK to get a high level), check out these blog posts of mine:

If you have questions always feel free to email!

Solving the P-Median model

Wendy Jang, my former student at UTD and now Data Scientist for the Stanislaus County Sheriff’s Dept. writes in:

Hello Andy,

I have some questions about your posting in GitHub. Honestly, I am not good at Python at all. I was playing with your python code and trying to understand the work flow. Then, I can pretty much mimic what you did; however, I encountered some errors along the way.

I was able to follow through lines but then I am stuck on PMed function. Do you know what possibly caused this error? See the screenshot below. Please advise. Thanks!

Specifically, Wendy is asking about my patrol redistricting example with workload inequality constraints. Wendy’s problem here is specifically she likely does not have CPLEX installed. CPLEX is free for academics, but unfortunately costs a bit of money for anyone else (not sure if they have cheaper licensing for public sector, looks like currently about $200 a month).

This was a good opportunity to update some of my code, so now see the two .py files in the DataCreated folder, in particular 01_pmed_class.py has a nice class function. I have additionally added in functionality to create a map, and more importantly eliminate subtours (my solution can potentially return disconnected areas). I have also added the ability to use different solvers.

For this problem CPLEX just works much better (takes around 10 minutes), but I was able to get the SCIP solver to return the same solution in around 5 hours. GLPK did not return a solution even letting it churn out for over 12 hours. I tested CBC for shorter time periods, but that appears to be a no-go as well.

So just a brief overview, the libraries I will be using (check out the end of the post for setting up the conda environment):

import pickle
import pulp
import networkx
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import geopandas as gpd

class pmed():
    """
    Ar - list of areas in model
    Di - distance dictionary organized Di[a1][a2] gives the distance between areas a1 and a2
    Co - contiguity dictionary organized Co[a1] gives all the contiguous neighbors of a1 in a list
    Ca - dictionary of total number of calls, Ca[a1] gives calls in area a1
    Ta - integer number of areas to create
    In - float inequality constraint
    Th - float distance threshold to make a decision variables
    """
    def __init__(self,Ar,Di,Co,Ca,Ta,In,Th):

I do not copy-paste the entire pmed class in the blog post. It is long, but is the rewrite of my model from the paper. Next, to load in the objects I need to fit the model (which were created/pickled in the 00_PrepareData.py file).

# Loading in data
data_loc = r'D:\Dropbox\Dropbox\PublicCode_Git\PatrolRedistrict\PatrolRedistrict\DataCreated\fin_obj.pkl'
areas, dist_dict, cont_dict, call_dict, nid_invmap, MergeData = pickle.load(open(data_loc,"rb"))

# shapefile for reporting areas
carr_report = gpd.read_file(r'D:\Dropbox\Dropbox\PublicCode_Git\PatrolRedistrict\PatrolRedistrict\DataOrig\ReportingAreas.shp')

And now we can create a pmed object, which has all the info necessary, and gives us a print out of the total number of decision variables and constraints.

# Creating pmed object
pmed12 = pmed(Ar=areas,Di=dist_dict,Co=cont_dict,Ca=call_dict,Ta=12,In=0.1,Th=10)

Writing the class this way allows the user to input their own solver, and you can see it prints out the available solvers on your current machine. So to pass in the SCIOPT solver you could do pmed12.solve(solver=pulp.SCIP_CMD()), which again for this problem does find the solution, but takes 5 hours. So here I still stick with CPLEX to just illustrate the new classes functionality:

pmed12.solve(solver=pulp.CPLEX(timeLimit=30*60,msg=True))     # takes around 10 minutes

This prints out a ton of information at the command line that you do not get if you run from within Jupyter notebooks. So for debugging that is much more useful. timeLimit is in seconds, but does not include the presolve phase I believe, so some of these solvers can just get stuck on the presolve forever.

We can look at the results in a nice map if you do have geopandas installed and pass it a geopandas data frame and the key to match it on:

pmed12.map_plot(carr_report, 'PDGrid')

This map shows the new areas and different colors with a thick border, and the source area as a dot with a white outline. So here you can see that we end up having one disconnected area – a subtour in optimization parlance. I have added some methods to deal with this though:

stres = pmed12.collect_subtours()

And you can see that for one of our areas it identified a subtour. I haven’t written this to automatically resolve for subtours due to the potential length of the solving time. So here we can see the subtours have 0 calls in them. You can assign those areas to wherever and it does not change the objective function. So that is what I did in the paper and showed how in the Jupyter notebook at the main page for the github project.

So if you are using a function that takes 5 hours, I would suggest you manually fix these subtours if there is an obvious to the human eye easy solution. But here with the shorter solve time with CPLEX I can rerun the algorithm with a warm start, and it runs even faster (under 5 minutes). The .collect_subtours() method adds subtour constraints into the pulp model, so you just need to redo the solve() method to eliminate that particular subtour (I do not know if this is guaranteed to converge though for all problems).

So here in this data eliminating that one subtour results in a solution with no subtours:

pmed12.solve(solver=pulp.CPLEX_CMD(msg=False,warmStart=True))
stres = pmed12.collect_subtours()

And we can replot the result, which shows it chooses the areas that you would have assigned by eye anyway:

pmed12.map_plot(carr_report, 'PDGrid')

So again for this problem if you have CPLEX I would recommend it (have not tried Gurobi). But at least for this particular dataset SCIOPT was able to solve the problem in 5 hours. So if you are a crime analyst or someone else without academic access to CPLEX you can install the sciopt solver and give that a go for your actual data.

Note that this is based off of results for 325 subareas. So if you have more subareas it will take a longer (and if you have fewer it may be quite a bit shorter).

Setting up the Conda environment

So once you have python installed, you can typically do something like:

pip install pulp

Or via conda:

conda install pyscipopt

I often have trouble though, especially when working with the python geospatial libraries, to install geopandas, fiona, etc. So here what I do is create a new conda environment:

conda create -n linprog
conda activate linprog
conda install -c conda-forge python=3 pip pandas numpy networkx scikit-learn dbfread geopandas glpk pyscipopt pulp

And then you can run the pmedian code above in this environment. I suppose I should turn this into a python package, and I see a bunch of folks anymore are doing docker images as well with their packages in complicated environments. This is actually not that bad, minus geopandas makes things a bit tricky.

Variance of leaderboard metrics for competitions

In doing a post mortem on our results for the NIJ recidivism challenge, first I calculated the extent to which our predictions would have done better if we did not bias our predictions to meet the fairness challenge. In the end, for Round 1 our team would have been in 3rd or 4th place for the small team rankings if we went with the unbiased predictions. It ended up being it only increased our Brier score by around ~0.001-0.002 though for each. (So I am glad we biased with a chance to win the fairness competition in the end.)

The leaderboards are so tight across the competition, often you need to go to the fourth decimal to determine the rankings. Here are the rankings for Round 1 Brier Scores for the small team:

Ultimately these metrics used to determine the rankings are themselves statistics measured with error. So here I did a simulation to see the extent that these metrics had error.

These are not exactly the models we ended up using, but are very close (only performed slightly worse than the ones we ended up going with), but here I will show an example in python comparing rankings between a logit regression with L1 penalties vs a lightboosted model. So for some upfront on the python libraries I will be using, and I download the data directly:

import numpy as np
import pandas as pd
from scipy.stats import binom
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, brier_score_loss

full_data = pd.read_csv('https://data.ojp.usdoj.gov/api/views/ynf5-u8nk/rows.csv?accessType=DOWNLOAD',index_col='ID')

The next part is just encoding the data. I am doing this for R1, so only using a certain set of information.

# Numeric Impute
num_imp = ['Gang_Affiliated','Supervision_Risk_Score_First',
           'Prior_Arrest_Episodes_DVCharges','Prior_Arrest_Episodes_GunCharges',
           'Prior_Conviction_Episodes_Viol','Prior_Conviction_Episodes_PPViolationCharges',
           'Residence_PUMA']

# Ordinal Encode (just keep puma as is)
ord_enc = {}
ord_enc['Gender'] = {'M':1, 'F':0}
ord_enc['Race'] = {'WHITE':0, 'BLACK':1}
ord_enc['Age_at_Release'] = {'18-22':6,'23-27':5,'28-32':4,
                  '33-37':3,'38-42':2,'43-47':1,
                  '48 or older':0}
ord_enc['Supervision_Level_First'] = {'Standard':0,'High':1,
                         'Specialized':2,'NA':-1}
ord_enc['Education_Level'] = {'Less than HS diploma':0,
                              'High School Diploma':1,
                              'At least some college':2,
                              'NA':-1}
ord_enc['Prison_Offense'] = {'NA':-1,'Drug':0,'Other':1,
                             'Property':2,'Violent/Non-Sex':3,
                             'Violent/Sex':4}
ord_enc['Prison_Years'] = {'Less than 1 year':0,'1-2 years':1,
                           'Greater than 2 to 3 years':2,'More than 3 years':3}

# _more clip 
more_clip = ['Dependents','Prior_Arrest_Episodes_Felony','Prior_Arrest_Episodes_Misd',
             'Prior_Arrest_Episodes_Violent','Prior_Arrest_Episodes_Property',
             'Prior_Arrest_Episodes_Drug',
             'Prior_Arrest_Episodes_PPViolationCharges',
             'Prior_Conviction_Episodes_Felony','Prior_Conviction_Episodes_Misd',
             'Prior_Conviction_Episodes_Prop','Prior_Conviction_Episodes_Drug']

# Function to prep data as I want, label encode
# And missing imputation
def prep_data(data,ext_vars=['Recidivism_Arrest_Year1','Training_Sample']):
    cop_dat = data.copy()
    # Numeric impute
    for n in num_imp:
        cop_dat[n] = data[n].fillna(-1).astype(int)
    # Ordinal Recodes
    for o in ord_enc.keys():
        cop_dat[o] = data[o].fillna('NA').replace(ord_enc[o]).astype(int)
    # _more clip
    for m in more_clip:
        cop_dat[m] = data[m].str.split(' ',n=1,expand=True)[0].astype(int)
    # Only keeping variables of interest
    kv = ext_vars + num_imp + list(ord_enc.keys()) + more_clip
    return cop_dat[kv].astype(int)

pdata = prep_data(full_data)

I did smart ordinal encoding, minus the missing data. So logit models are not super crazy with this data, although dummy variables + imputatation are likely a better approach (I am just being lazy here). But those should not be an issue for the tree based boosted models. Here I estimate models using the original train/test split chosen by NIJ:

y_var = 'Recidivism_Arrest_Year1'
x_vars = list(pdata)
x_vars.remove(y_var)
x_vars.remove('Training_Sample')

cat_vars = list( set(x_vars) - set(more_clip) )

l1 = LogisticRegression(penalty='l1', solver='liblinear')
lb = LGBMClassifier(silent=True)

# Original train/test split
train = pdata[pdata['Training_Sample'] == 1].copy()
test = pdata[pdata['Training_Sample'] == 0].copy()

# Fit models, and then eval on out of sample
l1.fit(train[x_vars],train[y_var])
lb.fit(train[x_vars],train[y_var],feature_name=x_vars,categorical_feature=cat_vars)

l1pp = l1.predict_proba(test[x_vars])[:,1]
lbpp = lb.predict_proba(test[x_vars])[:,1]

And then we can see how our two models do in this scenario according to the AUC or the Brier score statistic.

# ROC for the models
aucl1 = roc_auc_score(test[y_var],l1pp)
auclb = roc_auc_score(test[y_var],lbpp)
print(f'AUC L1 {aucl1}, AUC LightBoosted {auclb}')

# Brier score for models
bsl1 = brier_score_loss(test[y_var],l1pp)
bslb = brier_score_loss(test[y_var],lbpp)
print(f'Brier L1 {bsl1}, Brier LightBoosted {bslb}')

So you can see that the L1 model wins over the light boosted model (despite the wonky encoding with missing data) for both the AUC (+0.002) and the Brier Score (+0.001). (Note this is for the pooled sampled for both males/females.)

But is this just luck of the draw for the particular train/test dataset? That is, when we chose another train/test split, but fit the same models, would the light boosted model win some of the time? Here I do that, using the approximately 70% train/test split, but make it random and then estimate the test set Brier/AUC.

res = [] #list to stuff results into

for i in range(1000):
    print(f'Round {i}')
    rand_train = binom.rvs(1,0.7,size=pdata.shape[0])
    train = pdata[rand_train == 1].copy()
    test = pdata[rand_train == 0].copy()
    l1.fit(train[x_vars],train[y_var])
    lb.fit(train[x_vars],train[y_var],feature_name=x_vars,categorical_feature=cat_vars)
    l1pp = l1.predict_proba(test[x_vars])[:,1]
    lbpp = lb.predict_proba(test[x_vars])[:,1]
    aucl1 = roc_auc_score(test[y_var],l1pp)
    auclb = roc_auc_score(test[y_var],lbpp)
    bsl1 = brier_score_loss(test[y_var],l1pp)
    bslb = brier_score_loss(test[y_var],lbpp)
    loc_tup = (i,aucl1,auclb,bsl1,bslb)
    res.append(loc_tup)

fin_data = pd.DataFrame(res,columns=['Iter','AUCL1','AUCLB','BSL1','BSLB'])

fin_data.describe().T
# L1 wins for Brier score
(fin_data['BSL1'] < fin_data['BSLB']).mean()
# L1 wins for AUC
(fin_data['AUCL1'] > fin_data['AUCLB']).mean()

So you can see that the standard deviation for AUC is around 0.005, and the Brier Score is 0.002, also based on the means/min/max we can see that these two models have quite a bit of overlap in the distribution.

But, the results are correlated – when L1 tends to do worse, lightboosted also does worse. So when we look at the rankings, in this scenario L1 wins the majority of the time (but not always). This suggests to me that it was a good thing NIJ did not use AUC to judge, Brier scores seem much less volatile than AUC in this sample.

We can check out the correlations between the scores. AUC only has a correlation of around 0.8, whereas Brier has a correlation of 0.9. (If correlations were 1 the train/test split wouldn’t matter, the same person would always win in the rankings.)

# Results tend to be fairly correlated
fin_data.corr()
fin_data.cov()

But despite these models having a clear winner in this example, the margins between these two models are larger than the margins in the typical leaderboards. So I did a simulation using the observed leaderboard Brier scores for males for R1 as the means, and used the variance/covariance estimates above to make random draws.

This shows us, given the four observed leaderboard metrics, and my guesstimates for the typical error, how often will the leaders flip. Tighter scores and larger variances mean more flips.

# Simulation to see how often rankings flip
mu = np.array([0.1916, 0.1919, 0.1920, 0.1922])
tv = len(mu)
sd = 0.002 # sd and corr based on my earlier simulation
cor = 0.9
var = sd**2
cov = cor*(sd**2)

# filling the var/covariance matrix
r = np.ones((tv,tv)) * cov
np.fill_diagonal(r, var)

# Generating random multivariate normal
np.random.seed(10)
y = np.random.multivariate_normal(mu, r, size=1000)
y_ranks = y.argsort(axis=1)

# Making a nicer long dataset to see how often ranks swapped
persons = ['IdleSpeculation','SRLLC','Sevigny','TeamSmith']
y_rankdf = pd.DataFrame(y_ranks,columns=persons)
longy = y_rankdf.melt()

# How often are the ranks switched?
pd.crosstab(longy['variable'],longy['value'], normalize='columns')

How to read this table is that in the observed data for small team Males Round 1, IdleSpeculation was Ranked #1 with a Brier Score of 0.1916. My simulations show that based on those prior estimates, IdleSpeculation takes the top prize the most often (column 0), but only 43% of the time. You can see that even the bottom score here by TeamSmith takes #1 in 10% of the simulations.

This shows that there is some signal in the leaderboard, if it was totally random each of the ranks would have ~25% in each outcome. But it is clearly far from certain though either. This only considers people on the leaderboard who I know their results. It could also easily be someone in 5,6,7 could even have swapped to the #1 results.

What can we learn from this? One, the leaderboard results are unlikely to signify substantively improved models between different competitors. Clearly IdleSpeculation did well across the entire competition, but it would be hard to argue they were clearly better than everyone else (e.g. IdleSpeculations #3 rank in females in round 1 I suspect is just as likely due to bad luck as it is to their model being substantively worse than TeamKlus or TeamSherill).

Two, I think it would be better for competitions like this for people to submit algorithms, and then the algorithms can be judged on various train/tests (or a grid search cross-validation, or whatever). Using a simple train/test split like this will always come with some noise in the rankings.

This also solves the issue with transparency. Currently NIJ is simply asking us to submit a paper saying how we did the results. It would be more transparent to force people to submit code to replicate the results 100% (as well as prevent any obvious cheating).