# Generating multiple solutions for linear programs

I had a question the other day on my TURF analysis about generating not just the single best solution for a linear programming problem, but multiple solutions. This is one arena I like the way people typically do genetic algorithms, in that they have a leaderboard with multiple top solutions. Which is nice for exploratory data analysis.

This example though it is pretty easy to amend the linear program to return exact solutions by generating lazy constraints. (And because the program is fast, you might as well do it this way as well to get an exact answer.) The way it works here, we have decision variables for products selected. You run the linear program, and then collect the k products selected, then add a constraint to the model in the form of:

``Sum(Product_k) <= k-1 ``

Then rerun the model with this constraint, get the solution, and then repeat. This constraint makes it so the group of decision variables selected cannot be replicated (and in the forbidden set). Python’s pulp makes this quite easy, and here is a function to estimate the TURF model I showed in that prior blog post, but generate this multiple leaderboard:

``````import pandas as pd
import pulp

# Function to get solution from turf model
def get_solution(prod_dec,prod_ind):
'''
prod_dec - decision variables for products
prod_ind - indices for products
'''
picked = []
for j in prod_ind:
if prod_dec[j].varValue >= 0.99:
picked.append(j)
return picked

# Larger function to set up turf model
# And generate multiple solutions
def turf(data,k,solutions):
'''
data - dataframe with 0/1 selected items
k - integer for products selected
solutions - integer for number of potential solutions to return
'''
# Indices for Dec variables
Cust = data.index
Prod = list(data)
# Problem and Decision Variables
P = pulp.LpProblem("TURF", pulp.LpMaximize)
Cu = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust], lowBound=0, upBound=1, cat=pulp.LpInteger)
Pr = pulp.LpVariable.dicts("Products Selected", [j for j in Prod], lowBound=0, upBound=1, cat=pulp.LpInteger)
# Objective Function (assumes weight = 1 for all rows)
P += pulp.lpSum(Cu[i] for i in Cust)
# Reached Constraint
for i in Cust:
#Get the products selected
p_sel = data.loc[i] == 1
sel_prod = list(p_sel.index[p_sel])
#Set the constraint
P += Cu[i] <= pulp.lpSum(Pr[j] for j in sel_prod)
# Total number of products selected constraint
P += pulp.lpSum(Pr[j] for j in Prod) == k
# Solve the equation multiple times, add constraints
res = []
km1 = k - 1
for _ in range(solutions):
P.solve()
pi = get_solution(Pr,Prod)
# Lazy constraint
P += pulp.lpSum(Pr[j] for j in pi) <= km1
pi.append(pulp.value(P.objective))
res.append(pi)
# Turn results into nice dataframe
cols = ['Prod' + str(i+1) for i in range(k)]
res_df = pd.DataFrame(res, columns=cols+['Reach'])
res_df['Reach'] = res_df['Reach'].astype(int)
return res_df``````

And now we can simply read in our data, turn it into binary 0/1, and then generate the multiple solutions.

``````#This is simulated data from XLStat
#https://help.xlstat.com/s/article/turf-analysis-in-excel-tutorial?language=en_US

#Replacing the likert scale data with 0/1
prod_bin = 1*(prod_data > 3)

# Get Top 10 solutions
top10 = turf(data=prod_bin,k=5,solutions=10)
print(top10)``````

Which generates the results:

``````>>> print(top10)
Prod1       Prod2       Prod3       Prod4       Prod5  Reach
0   Product 4  Product 10  Product 15  Product 21  Product 25    129
1   Product 3  Product 15  Product 16  Product 25  Product 26    129
2   Product 4  Product 14  Product 15  Product 16  Product 26    129
3  Product 14  Product 15  Product 16  Product 23  Product 26    129
4   Product 3  Product 15  Product 21  Product 25  Product 26    129
5  Product 10  Product 15  Product 21  Product 23  Product 25    129
6   Product 4  Product 10  Product 15  Product 16  Product 27    129
7  Product 10  Product 15  Product 16  Product 23  Product 27    129
8   Product 3  Product 10  Product 15  Product 21  Product 25    128
9   Product 4  Product 10  Product 15  Product 21  Product 27    128``````

I haven’t checked too closely, but this looks to me offhand like the same top 8 solutions (with a reach of 129) as the XLStat website. The last two rows are different, likely because there are further ties.

For TURF analysis, you may actually consider a lazy constraint in the form of:

``Product_k = 0 for each k``

This is more restrictive, but if you have a very large pool of products, will ensure that each set of products selected are entirely different (so a unique set of 5 products for every row). Or you may consider a constraint in terms of customers reached instead of on products, so if solution 1 reaches 129, you filter those rows out and only count newly reached people in subsequent solutions. This ensures each row of the leaderboard above reaches different people than the prior rows.

# Extending sklearns OrdinalEncoder

I’ve used a variant of this for a few different projects, so figured it was worth sharing. Sklearn’s OrdinalEncoder is close, but not quite what I want for a few different scenarios. Those are:

• mixed input data types
• missing data support (which can vary across the mixed input types)
• the ability to limit encoding of rare categories (useful for regression models)

So I have scripted up a simple new class, what I call `SimpleOrdEnc()`, and can share it here in the blog post.

``````from sklearn.preprocessing import OrdinalEncoder
import numpy as np
import pandas as pd

class SimpleOrdEnc():
def __init__(self, dtype=int, unknown_value=-1, lim_k=None,
lim_count=None):
self.unknown_value = unknown_value
self.dtype = dtype
self.lim_k = lim_k
self.lim_count = lim_count
self.vars = None
self.soe = None
def fit(self, X):
self.vars = list(X)
# Now creating fit for each variable
res_oe = {}
for v in list(X):
res_oe[v] = OrdinalEncoder(dtype=self.dtype,
handle_unknown='use_encoded_value',
unknown_value=self.unknown_value)
# Get unique values minus missing
xc = X[v].value_counts().reset_index()
# If lim_k, only taking top K value
if self.lim_k:
top_k = self.lim_k - 1
un_vals = xc.loc[0:top_k,:]
# If count, using that to filter
elif self.lim_count:
un_vals = xc[xc[v] >= self.lim_count].copy()
# If neither
else:
un_vals = xc
# Now fitting the encoder for one variable
res_oe[v].fit( un_vals[['index']] )
# Appending back to the big class
self.soe = res_oe
# Defining transform/inverse_transform classes
def transform(self, X):
xcop = X[self.vars].copy()
for v in self.vars:
xcop[v] = self.soe[v].transform( X[[v]].fillna(self.unknown_value) )
return xcop
def inverse_transform(self, X):
xcop = X[self.vars].copy()
for v in self.vars:
xcop[v] = self.soe[v].inverse_transform( X[[v]].fillna(self.unknown_value) )
return xcop``````

This works mostly the same way that other sklearn objects do. You instantiate the object, then call fit, transform, inverse_transform, etc. Under the hood it just turns the data into a collection of ordinal encoders, but does a few things. One is that it strips missing values from the original fit data – so out of the box you do not need to do anything like `x.fillna(-1)`, it just works. It is on you though to choose a missing unknown value that does not collide with potential encoded or decoded values. (Also for fit it should be a pandas dataframe, weird stuff will happen if you pass other numpy objects or lists.)

The second are the `lim_k` or the `lim_count` arguments. This is useful if you want to encode rare categories as another value. lim_k is if you want to say keep the top 20 categories in the fitted dataset. lim_count sets the threshold at how many cases are in the data, e.g. if you have at least 100 keep it. lim_k takes precedence over lim_count, so if you specify both lim_count is ignored.

These args also confound the missing values, so missing (even if it is common in the data), gets assiged to the ‘other’ category in this encoding. If that is not the behavior you want, I don’t see any way around not explicitly using `fillna()` before all this.

So here is a simple example use case:

``````x1 = [1,2,3]
x2 = ['a','b','c']
x3 = ['z','z',None]
x4 = [4,np.nan,5]

x = pd.DataFrame(zip(x1,x2,x3,x4),columns=['x1','x2','x3','x4'])
print(x)

oe = SimpleOrdEnc()
oe.fit(x)

# Transform to the same data
tx = oe.transform(x)
print(tx)

# Inverse transform gives you None
ix = oe.inverse_transform(tx)
print(ix)`````` So you can see this handles missing input data, but the inverse transform always returns `None` values for missing. The fit method though returns numeric encoded columns with the same variable names. I default to missing values of -1 as light boost (and I think catboost as well), have those as the default missing data values for categorical data.

Here is an example limiting the output to only categories that have at least 10 observations, and setting the missing data value to 99 instead of -1.

``````size = 1000
x1 = np.random.choice(['a','b','c'], size).tolist()
x2 = np.random.choice([1, np.nan, 2], size, p=[0.8,0.18,0.02]).tolist()
x3 = np.random.choice(['z','y','x','w',np.nan], size, p=[0.8,0.15,0.04, 0.005, 0.005]).tolist()
x = pd.DataFrame(zip(x1,x2,x3),columns=['x1','x2','x3'])

oe = SimpleOrdEnc(lim_count=10, unknown_value=99)
oe.fit(x)

# Checking with a simpler data frame

x1 = ['a','b','c','d',None,]
x2 = [1,2,-1,4,np.nan]
x3 = ['z','y','x','w','v']
sx = pd.DataFrame(zip(x1,x2,x3),columns=['x1','x2','x3'])

oe.transform(sx)`````` Because the class is all wrapped up in one object, you can then use pickle to save the object/use later in pipelines. If I know I want to test out specific regression models with my data, I often use the lim_count to make sure I am not keeping a bunch of small dangles of high cardinality data. (Often in my data I use missing data is rare enough I don’t even want to worry about imputing, I rather just treat it as a rare category.)

One use case this does not work out so well though for is ICD codes in wide format. Will need to write another blog post about that. But often will just reshape wide to long to fit these encoders.

# Using Jupyter Notebooks to make nice README’s for GitHub

Working on my R package ptools, the devtools folks have you make a readme R markdown file to compile to a nice readme markdown file for github. I thought to myself that you could functionally do the same thing with juypter notebooks for python. So here is a quick example of that for my retenmod python package.

So first, here is the old readme in its entirety, rendered on Github: You can see I have an example code snippet, but it does not actually output the results. Here is the update, where I use jupyter to render the markdown nicely: So we have nice syntax highlighting even. (Note that the pip install code is not run in a `%sh` cell, it is just code formatting inside of a markdown cell.) I do like the way R markdown renders the output a bit nicer than here (I also haven’t tried with pretty pandas tables). But you can also include matplotlib images as well: You might typically want to add in `README.ipynb` to your gitignore file, but here I included it in the github package so you can see what this notebook looks like. To compile the notebook to markdown is quite simple:

``jupyter nbconvert --execute --to markdown README.ipynb``

If you have matplotlib images, it saves them in a folder named `README_files` (not sure if there is a flag to change this option). To get the images to render you then also need to push that image folder into Github.

# Reversion in the tech stack and why DS models fail

Hackernews recently shared a story about not using an IDE, and I feel mostly the same way. Hence the title in the post – so my current workflow for when I steal some time to work on my R package ptools my workflow looks like this, using `Rterm` from the shell: I don’t have anything against RStudio, I just only have so much room in my brain. Sometimes conversations at work are like a foreign language, “How are we going to test the NiFi script from Hadoop to our Kubernetes environment” or “I can pull the docker image from JFrog, but I run out of room when extracting the image on our sandbox machine. But `df` says we have plenty of room on all the partitions?”.

If you notice at the top the `(base)` in front of the shell, that is because this is within the anaconda shell as well. So if you look at many of my past blog posts (see here for one example), I am just using the snipping tool in windows to take screenshots of the shell output in interactive mode.

I am typically just writing the code in Notepad++ (as well as this blog post) – and it is quite simple to switch between interactive copy this function/code and compiling entire scripts. Here is a screenshot of R unit tests for example. So Notepad++ has some text highlight (for both R and python), and that is nice, but honestly not that necessary. Main thing I use is the selection of brackets to make sure they are balanced. I am sure I am missing out on some nice autocomplete features that would make me more productive, and function hints in Spyder are nice for pandas functions (I mostly use google still though for that when I need it).

I do use VS Code for development work on our headerless virtual machines at work. But that is more to replicate essentially the workflow on Windows with `file explorer + Notepad++ + Shell` (I am not a vim ninja – what is it `esc` + `wq:`, need to look that up everytime). I fucked up one of my git repos the other day using VS Code stuff tools, and I am just using `git` directly anymore. (Again this means I am the problem, not VS Code!)

# Why data science projects fail?

Some more random musings, but the more I get involved and see what projects work and what don’t at work, pretty much all of the failures I have come across are due to what I will call “not modeling the right thing”. That potentially covers a bit, but quite a few are simply not understanding counterfactual reasoning and selection bias.

The modeling part (in terms of actually fitting models) is typically quite easy – you do some simple but slightly theoretically informed feature engineering and feed that info into a machine learning model that is very flexible. But maybe that is the problem, people can easily fool themselves into thinking a model looks good, but because they are modeling the wrong thing it does not result in better decision making.

Even in most of the failures I have seen, selection bias is surmountable (often just requires multiple models or models on different samples of data – reduced form for the win!). So learning how to train/test split the data, and feed your data into XGboost only takes a few classes to learn. How to know the right thing to model though takes a bit more thought.

A secondary part of the failure is not learning how to translate the model outputs into actionable decisions. But the not modeling the right thing is at the start, so makes any downstream decision not work out how you want.

# Chunking it up in pandas

In the python pandas library, you can read a table (or a query) from a SQL database like this:

``data = pandas.read_sql_table('tablename',db_connection)``

Pandas also has an inbuilt function to return an iterator of chunks of the dataset, instead of the whole dataframe.

``data_chunks = pandas.read_sql_table('tablename',db_connection,chunksize=2000)``

I thought for awhile this was somewhat worthless, as I thought it still read the whole thing into memory. I have been using it though to pull new records, generate predictions, and then feed the predictions back into another table though (easier to write back to a new DB in smaller chunks).

But I actually did some tests recently and I was wrong. When using `chunksize`, it does not read the whole table into memory, so works like you would want. In particular, I was testing out sqlalchemy and `execution_options=dict(stream_results=True)` when creating the engine. Which works great for postgres – it is not needed though for every other DB I have tried so far (I will show postgres and sqlite here, at work have also tested teradata and sql server).

Here I use the python library memory_profiler and show the difference in memory usage for a database of crime incidents from Dallas PD. First in my script I load in the libraries I will be using, and then created my different engine connection strings

``````'''
Tests for memory and chunking
up dataframes
'''
import sqlalchemy
import pandas as pd
from datetime import datetime
from memory_profiler import profile

# Check out https://github.com/apwheele/Blog_Code/tree/master/Python/jupyter_reports
# For where this data came from
# sqlalchemy string for SQLlite
sqli = r'sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite'
lite_eng = sqlalchemy.create_engine(sqli)

# sqlalchemy string for postgres
pg = f'postgresql://id:pwd@localhost/dbname' #need to fill these in with your own info
pg_eng = sqlalchemy.create_engine(pg)
pg_eng_str = sqlalchemy.create_engine(pg, execution_options=dict(stream_results=True))``````

Next, we can make a function to either read a table all at once, or to return an iterator that chunks the data up into a tinier number of rows. We also decorate this function with the profile function to get some nice statistics later. Here I just do some random stats and accumulate them in each iteration.

``````@profile
begin_time = datetime.now()
# If chunk, do the calcs in iterations
if chnk:
tot_inc, tot_arr = 0,0
for c in inc:
tot_inc += c.shape
tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
else:
tot_inc = inc.shape
tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')``````

Most of the time I am using something like this for predictions or generating follow up metrics to see how well my model is doing (and rewriting those intermediate results to a table). So inside the iteration loop looks something like `pred_val = predict_function(c)` and `pred_val.to_sql('new_table',sql_con,if_exists='append')`.

Now just to finish off the script, you can do run the tests if called directly:

``````if __name__ == "__main__":
read_table(sqli,None)       #sqlite big table all at once

Now when running this script, you would run it as `python -m memory_profiler chunk_tests.py > test_results.txt`, and this will give you a nice set of print outs for each function call. So lets start with sqlite reading the entire database into memory:

``````sql: sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:22:41.652316 -- 2021-08-11 19:23:53.743702

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
28     82.4 MiB     82.4 MiB           1   @profile
30     82.4 MiB      0.0 MiB           1       begin_time = datetime.now()
31   4260.7 MiB   4178.3 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
32                                             # If chunk, do the calcs in iterations
33   4260.7 MiB      0.0 MiB           1       if chnk:
34                                                 tot_inc, tot_arr = 0,0
35                                                 for c in inc:
36                                                     tot_inc += c.shape
37                                                     tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
38                                             else:
39   4260.7 MiB      0.0 MiB           1           tot_inc = inc.shape
40   4261.5 MiB      0.8 MiB           1           tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
41   4261.5 MiB      0.0 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')``````

You can see that the incident database is 785k rows (it is 100 columns as well). It takes only alittle over a minute to read in the table, but it takes up over 4 gigabytes of memory. So I can fit this in memory on my machine, but if I say added a free text comments field I might not be able to fit this in memory on my personal machine.

Now lets see what these results look like when I chunk the data up into smaller sets of only 2000 rows at a time:

``````sql: sqlite:///D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports\DPD.sqlite
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:23:56.475680 -- 2021-08-11 19:25:33.896884

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
28     91.6 MiB     91.6 MiB           1   @profile
30     91.6 MiB      0.0 MiB           1       begin_time = datetime.now()
31     91.9 MiB      0.3 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
32                                             # If chunk, do the calcs in iterations
33     91.9 MiB      0.0 MiB           1       if chnk:
34     91.9 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
35    117.3 MiB    -69.6 MiB         394           for c in inc:
36    117.3 MiB    -89.6 MiB         393               tot_inc += c.shape
37    117.3 MiB    -89.6 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
38                                             else:
39                                                 tot_inc = inc.shape
40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
41    112.0 MiB     -5.3 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')``````

We only end up maxing out the memory at under 120 megabytes. You can see the total function also only takes around 1.5 minutes. So it is only ~30 seconds longer to run this result in chunks than doing it all at once.

Now onto postgres, I will forego showing the results when reading in the whole table at once – it is pretty much the same as sqlite when reading the whole table at once. But here are the results for the chunks with the usual sqlalchemy connection string:

``````sql: Engine(postgresql://postgres:***@localhost/postgres)
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:26:28.454807 -- 2021-08-11 19:27:41.917612

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
28     51.3 MiB     51.3 MiB           1   @profile
30     51.3 MiB      0.0 MiB           1       begin_time = datetime.now()
31   2017.2 MiB   1966.0 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
32                                             # If chunk, do the calcs in iterations
33   2017.2 MiB      0.0 MiB           1       if chnk:
34   2017.2 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
35   2056.8 MiB  -1927.3 MiB         394           for c in inc:
36   2056.8 MiB      0.0 MiB         393               tot_inc += c.shape
37   2056.8 MiB      0.0 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
38                                             else:
39                                                 tot_inc = inc.shape
40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
41     89.9 MiB  -1966.9 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')``````

So here we get some savings, but not nearly as good as sqlite – it ends up reading in around 2 gig for the iterator. But no fear, there is a specific option when setting up the sqlalchemy string for postgres. If you pay close attention to above where I build the connection strings, the engine for this trial is `pg_eng_str = sqlalchemy.create_engine(pg, execution_options=dict(stream_results=True))`. And when I use the streaming results engine, here is the memory profile I get:

``````sql: Engine(postgresql://postgres:***@localhost/postgres)
Total incidents 785064
Total arrests 89741
Runtime 2021-08-11 19:27:41.922613 -- 2021-08-11 19:28:54.435834

Filename: chunk_tests.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
28     88.8 MiB     88.8 MiB           1   @profile
30     88.8 MiB      0.0 MiB           1       begin_time = datetime.now()
31     89.6 MiB      0.8 MiB           1       inc = pd.read_sql_table('incidents',sql_con,chunksize=chnk)
32                                             # If chunk, do the calcs in iterations
33     89.6 MiB      0.0 MiB           1       if chnk:
34     89.6 MiB      0.0 MiB           1           tot_inc, tot_arr = 0,0
35    100.5 MiB   -322.6 MiB         394           for c in inc:
36    100.5 MiB   -325.4 MiB         393               tot_inc += c.shape
37    100.5 MiB   -325.4 MiB         393               tot_arr += (c['Offense_Status'] == 'Clear by Arrest').sum()
38                                             else:
39                                                 tot_inc = inc.shape
40                                                 tot_arr = (inc['Offense_Status'] == 'Clear by Arrest').sum()
41     92.3 MiB     -8.2 MiB           1       print(f'\nsql: {sql_con}\nTotal incidents {tot_inc}\nTotal arrests {tot_arr}\nRuntime {begin_time} -- {datetime.now()}\n')``````

Which is very similar to sqlite results.

Some workflows unfortunately are not so reducable into tiny chunks like this. E.g. say you were doing multiple time series for different groups, you would likely need to use substitution into the SQL to get one group at a time. I have done this to grab chunks of claims data one month at a time, but in many of my workflows I could just replace that with chunksize and do one read.

Some folks reading this may be thinking “this is a great use case for hadoop”. For the most part I am personally not working with that big of data, just mildly annoying at the edge of fitting into memory data. So this just do stuff in chunks works great for most of my workflows, and I don’t have to deal with the hassles of hadoop/spark.

# Solving the P-Median model

Wendy Jang, my former student at UTD and now Data Scientist for the Stanislaus County Sheriff’s Dept. writes in:

Hello Andy,

I have some questions about your posting in GitHub. Honestly, I am not good at Python at all. I was playing with your python code and trying to understand the work flow. Then, I can pretty much mimic what you did; however, I encountered some errors along the way.

I was able to follow through lines but then I am stuck on PMed function. Do you know what possibly caused this error? See the screenshot below. Please advise. Thanks!

Specifically, Wendy is asking about my patrol redistricting example with workload inequality constraints. Wendy’s problem here is specifically she likely does not have CPLEX installed. CPLEX is free for academics, but unfortunately costs a bit of money for anyone else (not sure if they have cheaper licensing for public sector, looks like currently about \$200 a month).

This was a good opportunity to update some of my code, so now see the two `.py` files in the DataCreated folder, in particular `01_pmed_class.py` has a nice class function. I have additionally added in functionality to create a map, and more importantly eliminate subtours (my solution can potentially return disconnected areas). I have also added the ability to use different solvers.

For this problem CPLEX just works much better (takes around 10 minutes), but I was able to get the SCIP solver to return the same solution in around 5 hours. GLPK did not return a solution even letting it churn out for over 12 hours. I tested CBC for shorter time periods, but that appears to be a no-go as well.

So just a brief overview, the libraries I will be using (check out the end of the post for setting up the conda environment):

``````import pickle
import pulp
import networkx
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import geopandas as gpd

class pmed():
"""
Ar - list of areas in model
Di - distance dictionary organized Di[a1][a2] gives the distance between areas a1 and a2
Co - contiguity dictionary organized Co[a1] gives all the contiguous neighbors of a1 in a list
Ca - dictionary of total number of calls, Ca[a1] gives calls in area a1
Ta - integer number of areas to create
In - float inequality constraint
Th - float distance threshold to make a decision variables
"""
def __init__(self,Ar,Di,Co,Ca,Ta,In,Th):``````

I do not copy-paste the entire `pmed` class in the blog post. It is long, but is the rewrite of my model from the paper. Next, to load in the objects I need to fit the model (which were created/pickled in the `00_PrepareData.py` file).

``````# Loading in data
data_loc = r'D:\Dropbox\Dropbox\PublicCode_Git\PatrolRedistrict\PatrolRedistrict\DataCreated\fin_obj.pkl'
areas, dist_dict, cont_dict, call_dict, nid_invmap, MergeData = pickle.load(open(data_loc,"rb"))

# shapefile for reporting areas

And now we can create a `pmed` object, which has all the info necessary, and gives us a print out of the total number of decision variables and constraints.

``````# Creating pmed object
pmed12 = pmed(Ar=areas,Di=dist_dict,Co=cont_dict,Ca=call_dict,Ta=12,In=0.1,Th=10)`````` Writing the class this way allows the user to input their own solver, and you can see it prints out the available solvers on your current machine. So to pass in the SCIOPT solver you could do `pmed12.solve(solver=pulp.SCIP_CMD())`, which again for this problem does find the solution, but takes 5 hours. So here I still stick with CPLEX to just illustrate the new classes functionality:

``pmed12.solve(solver=pulp.CPLEX(timeLimit=30*60,msg=True))     # takes around 10 minutes``

This prints out a ton of information at the command line that you do not get if you run from within Jupyter notebooks. So for debugging that is much more useful. `timeLimit` is in seconds, but does not include the presolve phase I believe, so some of these solvers can just get stuck on the presolve forever.

We can look at the results in a nice map if you do have geopandas installed and pass it a geopandas data frame and the key to match it on:

``pmed12.map_plot(carr_report, 'PDGrid')`` This map shows the new areas and different colors with a thick border, and the source area as a dot with a white outline. So here you can see that we end up having one disconnected area – a subtour in optimization parlance. I have added some methods to deal with this though:

``stres = pmed12.collect_subtours()`` And you can see that for one of our areas it identified a subtour. I haven’t written this to automatically resolve for subtours due to the potential length of the solving time. So here we can see the subtours have 0 calls in them. You can assign those areas to wherever and it does not change the objective function. So that is what I did in the paper and showed how in the Jupyter notebook at the main page for the github project.

So if you are using a function that takes 5 hours, I would suggest you manually fix these subtours if there is an obvious to the human eye easy solution. But here with the shorter solve time with CPLEX I can rerun the algorithm with a warm start, and it runs even faster (under 5 minutes). The `.collect_subtours()` method adds subtour constraints into the pulp model, so you just need to redo the `solve()` method to eliminate that particular subtour (I do not know if this is guaranteed to converge though for all problems).

So here in this data eliminating that one subtour results in a solution with no subtours:

``````pmed12.solve(solver=pulp.CPLEX_CMD(msg=False,warmStart=True))
stres = pmed12.collect_subtours()`````` And we can replot the result, which shows it chooses the areas that you would have assigned by eye anyway:

``pmed12.map_plot(carr_report, 'PDGrid')`` So again for this problem if you have CPLEX I would recommend it (have not tried Gurobi). But at least for this particular dataset SCIOPT was able to solve the problem in 5 hours. So if you are a crime analyst or someone else without academic access to CPLEX you can install the sciopt solver and give that a go for your actual data.

Note that this is based off of results for 325 subareas. So if you have more subareas it will take a longer (and if you have fewer it may be quite a bit shorter).

# Setting up the Conda environment

So once you have python installed, you can typically do something like:

``pip install pulp``

Or via conda:

``conda install pyscipopt``

I often have trouble though, especially when working with the python geospatial libraries, to install geopandas, fiona, etc. So here what I do is create a new conda environment:

``````conda create -n linprog
conda activate linprog
conda install -c conda-forge python=3 pip pandas numpy networkx scikit-learn dbfread geopandas glpk pyscipopt pulp``````

And then you can run the pmedian code above in this environment. I suppose I should turn this into a python package, and I see a bunch of folks anymore are doing docker images as well with their packages in complicated environments. This is actually not that bad, minus geopandas makes things a bit tricky.

# Variance of leaderboard metrics for competitions

In doing a post mortem on our results for the NIJ recidivism challenge, first I calculated the extent to which our predictions would have done better if we did not bias our predictions to meet the fairness challenge. In the end, for Round 1 our team would have been in 3rd or 4th place for the small team rankings if we went with the unbiased predictions. It ended up being it only increased our Brier score by around ~0.001-0.002 though for each. (So I am glad we biased with a chance to win the fairness competition in the end.)

The leaderboards are so tight across the competition, often you need to go to the fourth decimal to determine the rankings. Here are the rankings for Round 1 Brier Scores for the small team: Ultimately these metrics used to determine the rankings are themselves statistics measured with error. So here I did a simulation to see the extent that these metrics had error.

These are not exactly the models we ended up using, but are very close (only performed slightly worse than the ones we ended up going with), but here I will show an example in python comparing rankings between a logit regression with L1 penalties vs a lightboosted model. So for some upfront on the python libraries I will be using, and I download the data directly:

``````import numpy as np
import pandas as pd
from scipy.stats import binom
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, brier_score_loss

The next part is just encoding the data. I am doing this for R1, so only using a certain set of information.

``````# Numeric Impute
num_imp = ['Gang_Affiliated','Supervision_Risk_Score_First',
'Prior_Arrest_Episodes_DVCharges','Prior_Arrest_Episodes_GunCharges',
'Prior_Conviction_Episodes_Viol','Prior_Conviction_Episodes_PPViolationCharges',
'Residence_PUMA']

# Ordinal Encode (just keep puma as is)
ord_enc = {}
ord_enc['Gender'] = {'M':1, 'F':0}
ord_enc['Race'] = {'WHITE':0, 'BLACK':1}
ord_enc['Age_at_Release'] = {'18-22':6,'23-27':5,'28-32':4,
'33-37':3,'38-42':2,'43-47':1,
'48 or older':0}
ord_enc['Supervision_Level_First'] = {'Standard':0,'High':1,
'Specialized':2,'NA':-1}
ord_enc['Education_Level'] = {'Less than HS diploma':0,
'High School Diploma':1,
'At least some college':2,
'NA':-1}
ord_enc['Prison_Offense'] = {'NA':-1,'Drug':0,'Other':1,
'Property':2,'Violent/Non-Sex':3,
'Violent/Sex':4}
ord_enc['Prison_Years'] = {'Less than 1 year':0,'1-2 years':1,
'Greater than 2 to 3 years':2,'More than 3 years':3}

# _more clip
more_clip = ['Dependents','Prior_Arrest_Episodes_Felony','Prior_Arrest_Episodes_Misd',
'Prior_Arrest_Episodes_Violent','Prior_Arrest_Episodes_Property',
'Prior_Arrest_Episodes_Drug',
'Prior_Arrest_Episodes_PPViolationCharges',
'Prior_Conviction_Episodes_Felony','Prior_Conviction_Episodes_Misd',
'Prior_Conviction_Episodes_Prop','Prior_Conviction_Episodes_Drug']

# Function to prep data as I want, label encode
# And missing imputation
def prep_data(data,ext_vars=['Recidivism_Arrest_Year1','Training_Sample']):
cop_dat = data.copy()
# Numeric impute
for n in num_imp:
cop_dat[n] = data[n].fillna(-1).astype(int)
# Ordinal Recodes
for o in ord_enc.keys():
cop_dat[o] = data[o].fillna('NA').replace(ord_enc[o]).astype(int)
# _more clip
for m in more_clip:
cop_dat[m] = data[m].str.split(' ',n=1,expand=True).astype(int)
# Only keeping variables of interest
kv = ext_vars + num_imp + list(ord_enc.keys()) + more_clip
return cop_dat[kv].astype(int)

pdata = prep_data(full_data)``````

I did smart ordinal encoding, minus the missing data. So logit models are not super crazy with this data, although dummy variables + imputatation are likely a better approach (I am just being lazy here). But those should not be an issue for the tree based boosted models. Here I estimate models using the original train/test split chosen by NIJ:

``````y_var = 'Recidivism_Arrest_Year1'
x_vars = list(pdata)
x_vars.remove(y_var)
x_vars.remove('Training_Sample')

cat_vars = list( set(x_vars) - set(more_clip) )

l1 = LogisticRegression(penalty='l1', solver='liblinear')
lb = LGBMClassifier(silent=True)

# Original train/test split
train = pdata[pdata['Training_Sample'] == 1].copy()
test = pdata[pdata['Training_Sample'] == 0].copy()

# Fit models, and then eval on out of sample
l1.fit(train[x_vars],train[y_var])
lb.fit(train[x_vars],train[y_var],feature_name=x_vars,categorical_feature=cat_vars)

l1pp = l1.predict_proba(test[x_vars])[:,1]
lbpp = lb.predict_proba(test[x_vars])[:,1]``````

And then we can see how our two models do in this scenario according to the AUC or the Brier score statistic.

``````# ROC for the models
aucl1 = roc_auc_score(test[y_var],l1pp)
auclb = roc_auc_score(test[y_var],lbpp)
print(f'AUC L1 {aucl1}, AUC LightBoosted {auclb}')

# Brier score for models
bsl1 = brier_score_loss(test[y_var],l1pp)
bslb = brier_score_loss(test[y_var],lbpp)
print(f'Brier L1 {bsl1}, Brier LightBoosted {bslb}')`````` So you can see that the L1 model wins over the light boosted model (despite the wonky encoding with missing data) for both the AUC (+0.002) and the Brier Score (+0.001). (Note this is for the pooled sampled for both males/females.)

But is this just luck of the draw for the particular train/test dataset? That is, when we chose another train/test split, but fit the same models, would the light boosted model win some of the time? Here I do that, using the approximately 70% train/test split, but make it random and then estimate the test set Brier/AUC.

``````res = [] #list to stuff results into

for i in range(1000):
print(f'Round {i}')
rand_train = binom.rvs(1,0.7,size=pdata.shape)
train = pdata[rand_train == 1].copy()
test = pdata[rand_train == 0].copy()
l1.fit(train[x_vars],train[y_var])
lb.fit(train[x_vars],train[y_var],feature_name=x_vars,categorical_feature=cat_vars)
l1pp = l1.predict_proba(test[x_vars])[:,1]
lbpp = lb.predict_proba(test[x_vars])[:,1]
aucl1 = roc_auc_score(test[y_var],l1pp)
auclb = roc_auc_score(test[y_var],lbpp)
bsl1 = brier_score_loss(test[y_var],l1pp)
bslb = brier_score_loss(test[y_var],lbpp)
loc_tup = (i,aucl1,auclb,bsl1,bslb)
res.append(loc_tup)

fin_data = pd.DataFrame(res,columns=['Iter','AUCL1','AUCLB','BSL1','BSLB'])

fin_data.describe().T
# L1 wins for Brier score
(fin_data['BSL1'] < fin_data['BSLB']).mean()
# L1 wins for AUC
(fin_data['AUCL1'] > fin_data['AUCLB']).mean()`````` So you can see that the standard deviation for AUC is around 0.005, and the Brier Score is 0.002, also based on the means/min/max we can see that these two models have quite a bit of overlap in the distribution.

But, the results are correlated – when L1 tends to do worse, lightboosted also does worse. So when we look at the rankings, in this scenario L1 wins the majority of the time (but not always). This suggests to me that it was a good thing NIJ did not use AUC to judge, Brier scores seem much less volatile than AUC in this sample.

We can check out the correlations between the scores. AUC only has a correlation of around 0.8, whereas Brier has a correlation of 0.9. (If correlations were 1 the train/test split wouldn’t matter, the same person would always win in the rankings.)

``````# Results tend to be fairly correlated
fin_data.corr()
fin_data.cov()`````` But despite these models having a clear winner in this example, the margins between these two models are larger than the margins in the typical leaderboards. So I did a simulation using the observed leaderboard Brier scores for males for R1 as the means, and used the variance/covariance estimates above to make random draws.

This shows us, given the four observed leaderboard metrics, and my guesstimates for the typical error, how often will the leaders flip. Tighter scores and larger variances mean more flips.

``````# Simulation to see how often rankings flip
mu = np.array([0.1916, 0.1919, 0.1920, 0.1922])
tv = len(mu)
sd = 0.002 # sd and corr based on my earlier simulation
cor = 0.9
var = sd**2
cov = cor*(sd**2)

# filling the var/covariance matrix
r = np.ones((tv,tv)) * cov
np.fill_diagonal(r, var)

# Generating random multivariate normal
np.random.seed(10)
y = np.random.multivariate_normal(mu, r, size=1000)
y_ranks = y.argsort(axis=1)

# Making a nicer long dataset to see how often ranks swapped
persons = ['IdleSpeculation','SRLLC','Sevigny','TeamSmith']
y_rankdf = pd.DataFrame(y_ranks,columns=persons)
longy = y_rankdf.melt()

# How often are the ranks switched?
pd.crosstab(longy['variable'],longy['value'], normalize='columns')`````` How to read this table is that in the observed data for small team Males Round 1, IdleSpeculation was Ranked #1 with a Brier Score of 0.1916. My simulations show that based on those prior estimates, IdleSpeculation takes the top prize the most often (column 0), but only 43% of the time. You can see that even the bottom score here by TeamSmith takes #1 in 10% of the simulations.

This shows that there is some signal in the leaderboard, if it was totally random each of the ranks would have ~25% in each outcome. But it is clearly far from certain though either. This only considers people on the leaderboard who I know their results. It could also easily be someone in 5,6,7 could even have swapped to the #1 results.

What can we learn from this? One, the leaderboard results are unlikely to signify substantively improved models between different competitors. Clearly IdleSpeculation did well across the entire competition, but it would be hard to argue they were clearly better than everyone else (e.g. IdleSpeculations #3 rank in females in round 1 I suspect is just as likely due to bad luck as it is to their model being substantively worse than TeamKlus or TeamSherill).

Two, I think it would be better for competitions like this for people to submit algorithms, and then the algorithms can be judged on various train/tests (or a grid search cross-validation, or whatever). Using a simple train/test split like this will always come with some noise in the rankings.

This also solves the issue with transparency. Currently NIJ is simply asking us to submit a paper saying how we did the results. It would be more transparent to force people to submit code to replicate the results 100% (as well as prevent any obvious cheating).

# Weighting machine learning models

Here is a trick I have seen on several occasions people could take advantage of to make fitting models for big data more convenient. If you are using data that is all categorical and a fairly small number of variables, you can aggregate your data and fit models using weights, instead of fitting models on the original micro level data.

For a recent example at my work, was working on a model that originally has around 6 million observations and around a dozen categorical inputs (each with fairly high cardinality, e.g. around 1000 different categories). When aggregating to unique cases, this model is well under half a million rows though. It is much easier for me to iterate and fit models on the half a million row dataset than the 6 million row one.

Here I will show an example using NIBRS data. See my prior blog post on Association Rules for background on the data, I will be using the 2012 NIBRS dataset, which has over 5 million observations. Below is python code to illustrate, but pretty much all statistical packages allow you weight observations.

So first I load the libraries I will be using, and make a nice function to print out the logit coefficients for a sklearn model.

``````import pandas as pd
import numpy as np
from scipy.stats import binom
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Function to pretty print logit coefficients
def nice_coef(mod,x_vars):
coef = list(mod.coef_) + list(mod.intercept_)
vlist = x_vars + ['Intercept']
return pd.DataFrame(zip(vlist,coef),columns=['Var','Coef'])``````

Next we can read in my NIRBS data directly from the dropbox link (replace www with dl to do this in general for dropbox links).

``````# See https://github.com/apwheele/apwheele.github.io/tree/master/MathPosts/association_rules
# For explanation behind NIBRS data
# If you want to read original NIRBS data
# use https://dl.dropbox.com/sh/puws33uebzt9ckd/AACL3wBhZDr3P_ZbsbUxltERa/NIBRS_2012.csv?dl=0``````

This data is already prepped with the repeated dummy variables for different categories. It is over 5 million observations, but a simple way to work with this data is to use a `groupby` and create a weight variable:

``````group_vars = list(ndum)
ndum['weight'] = 1
ndum_agg = ndum.groupby(group_vars, as_index=False).sum() # sums the weight variable

print(ndum.shape)
print(ndum_agg.shape)`````` So you can see we went from over 5 million observations to only a few over 7000.

A few notes here. One, if querying the data from a DB, it may be better to do the counts on the DB side and only load in the tinier data into memory, e.g. `SELECT COUNT(ID) AS weight, A1,A2... FROM Table GROUP BY A1,A2,....`.

Second, I do not have any missing data here, but pandas groupby will by default drop missing rows. So you may want to do something like `data.fillna(-1).groupby`, or the option to not drop NA values.

Now, lets go onto to fitting a model. Here I am using logit regression, as it is easier to compare the inputs for the weighted/non-weighted model, but you can do this for whatever machine learning model you want. I am predicting the probability an officer is assaulted.

``````logit_mod = LogisticRegression(penalty='none', solver='newton-cg')
y_var = 'ass_LEO_Assault'
x_vars = group_vars.copy() #[0:7]
x_vars.remove(y_var)

# Takes a few minutes on my machine!
logit_mod.fit(ndum[x_vars],ndum[y_var])

# Coefficients for the model
bigres = nice_coef(logit_mod,x_vars)
bigres`````` I was not sure if my computer would actually fit this model without running out of memory. But it did crunch it out in a few minutes. Now lets look at the results when we estimate the model with the weighted data. In all the sklearn models you can just pass in a `sample_weight` into the fit function.

``````logit_mod.fit(ndum_agg[x_vars],ndum_agg[y_var],sample_weight=ndum_agg['weight'])

weight_res = nice_coef(logit_mod,x_vars)
weight_res`````` You can see that these are basically the same model predictions. For a few of the coefficients you can find discrepancies starting at the second decimal, but the majority are within floating point errors.

``bigres['Coef'] - weight_res['Coef']`` This was fit instantly instead of waiting several minutes. For more intense ML models (like random forests), this can dramatically improve the time it takes to fit models.

If you are interested in doing a train/test split, this is quite easy with the weights as well. Basically you just need to allocate some of the weight to the train and some to the test. Here I show how to do that using a binomial random variable. Then you feed the train weights to the fit function:

``````train_prop = 0.5
train_weight = binom.rvs(ndum_agg['weight'].astype(int), train_prop, random_state=10)
test_weight = ndum_agg['weight'] - train_weight

logit_mod.fit(ndum_agg[x_vars],ndum_agg[y_var],sample_weight=train_weight)``````

And in sklearn pretty much all of the evaluation functions also take a sample weight function.

``````pred_probs = logit_mod.predict_proba(ndum_agg[x_vars])

# Eval metrics for the test data
roc_auc_score(ndum_agg[y_var],pred_probs[:,1],sample_weight=test_weight)
roc_auc_score(ndum_agg[y_var],pred_probs[:,1],sample_weight=train_weight)`````` So this shows that the AUC metric decreases in the test dataset (as you would expect it to). Note do not take this model seriously, I would need to think more thoroughly about the selection of rows here, as well as how to effectively interpret these particular categorical inputs in a more reasonable way than eyeballing the coefficients.

I am wondering if weighting the data is actually a more convenient way to do train/test CV splits, just change the weights instead of splitting up datasets. (You could also do fractional weights, e.g. `train_weight = ndum_agg['weight']/2`, which works nice for stratifying the sample, but may cause some issues generalizing.)

So note this does not always work – but works best with sparse/categorical data. If you have numeric data, you can try to discretize the data to a reasonable amount to still model it as a continuous input (e.g. round age to one decimal, e.g. 20.1). But if you have more than a few numeric inputs you will have a much harder time compressing the data into a smaller number of weighted rows.

It also only works if you have a limited number of inputs. If you have 100 variables you will be able to aggregate less than if you are working with 10.

But despite those limitations, I have come across several different examples in my career where aggregating and weighting the data was clearly the easiest approach, and NIBRS is a great example.

# Creating automated reports using python and Jupyter notebooks

Deborah Osborne had a chat with Chris Bruce the other day about general crime analysis, and they discussed the regular reading of reports. I did this as well when I worked at Troy New York as the lone analyst. I would come in and skim the ~50 reports for the prior day.

The chief and mayor wanted a breakdown of particular noteworthy events, so I would place my own notes in a spreadsheet and then make a daily report. My set up was not fully automated but close – I had a pretty detailed HTML template, and once my daily data was inputted, I would run a SPSS script to fill in the HTML. I also did a simple pin map in batch geo (one thing that was not automated about it) and inserted into the report.

I had two other quite regular reports I would work on. One was a weekly command staff report about overall trends and recent upticks, the other was a monthly Compstat meeting going over similar stats. (I also had various other products to release – detective assignments/workload, sending aggregate stats to the Albany Crime Analysis Center.)

If I had to do these again knowing what I know now, I would automate nearly 100% of this work in python. For the reports, I would use jupyter notebooks (I actually do not like coding in these very much, I much prefer plain text IDEs, but they are good for generating nice looking reports I will show.)

## Making Reports in Jupyter Notebooks

I have provided the notes to fully automate a simple report here on Github. To replicate, first you need to download the Dallas PD open data and create a local sqlite database (can’t upload that large of file to github).

So first before you start, if you download the `.py` files, you can run at the command prompt something like:

``````cd D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports
python 00_CreateDB.py``````

Just replace the `cd` path to wherever you saved the files on your local machine (and this assumes you have Anaconda installed). Then once that is done, you can replicate the report locally, but it is really meant as a pedogological tool – you can see how I wrote the jupyter notebook to query the local database. For your own case you would write the SQL code and connection to whereever your local crime data is store.

Here is an example of how you can use string substitution in python to create a query that is date aware for when the code is run: Part of the hardest part of making standardized reports is you can typically make the data formatted how you want, but to get little pieces looking exactly how you want them is hard. So here is a default pivot table exported in a Jupyter notebook for some year to date statistics (note the limitations of this, why I prefer graphs and do not show a percent change). And here is code I wrote to change font sizes and insert a title. A bit of work, but once you figure it out once you are golden. You can go look at the notebook itself, but I also have an example of generating a weekly error bar chart (much preferred over the Compstat YTD tables): Final note, to compile the notebook without showing any code (the police chief does not want to see your python code!), it looks like this from the command line.

``jupyter nbconvert --execute example_report.ipynb --no-input --to html``

I have further notes in the github page on automating this fully via bat files for windows, renaming files to make them update to the current date, etc.

# Why Automate?

I know some analysts are reading this and thinking to themselves – I can generate these reports in 30 minutes using Excel and Powerpoint – why spend time to learn something new to make it 100% automated? There are a few reasons. One is pure time savings in the end. Say for the weekly report you spend 1 hour, and it takes you three work days (24 hours) to fully automate. You will recover your time in 24 weeks.

Time savings is not the only component though. Fully automating means the workflow is 100% reproducible, and makes it easier to transfer that workflow to other analysts in the future. This is an important consideration when scaling – if you need to spend a few hours once a week forever, you can only take on generating so many reports. It is better for your time to do a one time large sink into automating something, than to loan out a portion of your time forever.

A final part is the skills you develop when generating automated reports are more similar to data science roles in the private sector – so consider it an investment in your career as well. The types of machine learning pipelines I create at my current role would not be possible if I could not fully automate. I would only be able to do 2 or maybe 3 projects forever and just maintain them. I fully automate my data pipelines though, and then can hand off that job to a DevOps engineer, and only worry about fixing things when they break. (Or a more junior data scientist can take over that project entirely.)

# PCA does not make sense after one hot encoding

Here is a general data science snafu I have seen on multiple occasions. You have some categorical variable with a very high cardinality, say 1000 categories. Well, we generally represent categorical values as dummy values. Also for high cardinality data, we often use PCA to reduce the number of dimensions.

So why not one hot encode the data and then use PCA to solve the problem? As the title of the post says, this does not make sense (as I will show in a bit). Here on the data science stackexchange we have this advice, and I have gotten this response at a few data science interviews so far. So figured a blog post why this does not make sense is in order.

I will show an example using python. First here are the libraries we will be using, and some helper functions to work with sklearn’s pca models.

``````################################################
# Libraries we need and some functions to
# work with PCA model object

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

list_PC = ['PC' + str(i+1) for i in range(load.shape)]
columns=list_PC,
index=list(data))

# nice print function for variance explained
def var_exp(mod):
var_rat = mod.explained_variance_ratio_
list_PC = ['PC' + str(i+1) for i in range(var_rat.shape)]
ve = pd.DataFrame(var_rat,index=list_PC,columns=['VarExplained'])
print(ve.round(3))
return ve

# Returns a nicer dataframe after transform with named PC columns
def nice_trans(mod,data):
np_dat = mod.transform(data)
list_PC = ['PC' + str(i + 1) for i in range(np_dat.shape)]
pd_dat = pd.DataFrame(np_dat,columns=list_PC,index=data.index)
return pd_dat

################################################``````

Now lets simulate some simple data, where we have 5 categories, and each category has the same overall proportion in the data. Then we turn that into a one-hot encoded matrix (so 5 dummy variables). We will be using the covariance matrix to do the PCA decomposition (but does not really matter), and the covariance matrix has a nice closed form solution just based on the proportions.

``````################################################
# Prepping simulated data

np.random.seed(10) # set seed for reproducing

# number of categories
n_cats = 5

# generating equal probabilities
n_obs = 1000000
sim = np.random.choice(a=n_cats,size=n_obs)
# Or could fix exactly
#sim = np.repeat(range(n_cats),n_obs/n_cats)
sim_pd = pd.Series(sim)
print(sim_pd.value_counts(normalize=True))

# pandas one hot encoding
ohe = pd.get_dummies(sim_pd,prefix='X')

# covariance matrix
ohe.cov()
# covariance is -p(a)*p(b)
# variance is p(a) - p(a)^2
################################################`````` Now we are ready to fit the PCA model on the one hot encoded variables.

``````# Fitting the PCA and extracting loadings
pca_mod = PCA()
pca_mod.fit(ohe) #based on covariance matrix

# showing what the transform looks like on simple data
simple_ohe = pd.DataFrame(np.eye(n_cats,dtype=int),columns=list(ohe))
print( nice_trans(pca_mod,simple_ohe).round(3) )`````` And we can see all looks ok, minus the final principal component is a constant. This is because when we use all five dummy variables in a one-hot encoded matrix, one column is a linear transformation of the others. So the covariance matrix is just short of being full rank by 1 variable. (Hence why for various algorithms, such as regression, we typically drop one of the dummy variables.)

So typically we interpret PCA as an information compressor – we try to squeeze much of our variance into a smaller number of dimensions. You can interpret each principal component as contributing to the overall variance, and here is that breakdown for our example:

``````# Showing variance explained
var_equal = var_exp(pca_mod)`````` So you can see that our reduction in information is not a reduction at all. Typically if we had say 100 dimensions, we may only take the first 10 principal components. While we could do that here, it doesn’t gain us anything. It will essentially be no different than just using the dummy variables for the top k most frequent categories – which with essentially equal categories is just a random set of 10 possible dummy variables.

What about if we do unequal proportions?

``````#################################################
# What happens with variance explained for unequal
# proportions?

n_probs = [0.75,0.15,0.05,0.045,0.005]
sim = np.random.choice(a=n_cats,size=n_obs,p=n_probs)
sim_pd = pd.Series(sim)
print(sim_pd.value_counts(normalize=True))
ohe = pd.get_dummies(sim_pd,prefix='X')
print(ohe.cov())

pca_mod_unequal = PCA()
pca_mod_unequal.fit(ohe)

# Showing variance explained
var_unequal = var_exp(pca_mod_unequal)

#################################################`````` You can see that the variance explained is similar to the overall proportions in the data. Using the first k principal components will not be exactly the same as keeping the k most frequent categories in the data, as you can see the loading matrix has a particular format (signs can be flipped, PC1 loads opposite on top category and 2nd category, then PC2 loads high on 1,2 and opposite on 3, etc., so looks offhand similar to some type of contrast encoding). But there is no reason offhand to think it will result in a better fit for your particular data than taking the top K (or fitting a model for all dummy variables). (This is generally true for any data reduction via PCA though.)

I attempted to derive a general formula for the eigenvalues from the base proportions, but I was unable to (that determinant and characteristic polynomial form is tricky!). But it appears to me that they mostly just follow the overall proportions in the data. (See at the end of post showing the form of the covariance matrix, and failed attempt at using sympy to derive a general form for the eigenvalues.)

Here is another example with 1000 categories with random probabilities drawn from a uniform distribution. I can generate the eigenvalues without generating a big simulated dataset, since the covariance matrix has a particular form just based on the proportions.

``````#################################################
# Lets try with a very high cardinality

many_cats = 1000
n_probs = np.random.random(many_cats)
n_probs /= n_probs.sum()
n_probs = -np.sort(-n_probs)
# ?sorted makes the eigenvalues sorted as well?

# I can calculate the eigenvalues without
# generating the big dataset

def ohe_pca(probs):
# Creating the covariance matrix
p_len = len(probs)
pnp = np.array(probs).reshape((p_len,1)) #2d array
cov = -np.dot(pnp,pnp.T)
np.fill_diagonal(cov, pnp - pnp**2)
plabs = ['PC' + str(i+1) for i in range(p_len)]
cov_pd = pd.DataFrame(cov,columns=plabs,index=plabs)
# SVD of covariance
print('Variance Explained')
print( (eigenvals / eigenvals.sum()).round(3) )

#################################################`````` And here you can see again that the variance explained for each component essentially just follows the overall proportion for each distinct category in the data.

So you don’t gain anything by doing PCA on a one hot encoded matrix, besides creating a more complicated rotation of your binary data.

# Covariance of one-hot-encoded and sympy

So first, I state in the comments that the covariance matrix for one-hot encoded variables takes on the form `Cov(a,b) = -p(a)p(b)`. So the definition of the covariance between two values a and b is below, where `E[]` is the expected value operator.

``Cov(a,b) = E[ (a - E[a])*(b - E[b]) ]``

For binary variables 0/1, `E[a] = p(a)`, where `p(a)` is the proportion of 1’s in the column vector. So we can make that replacement above and then expand the inner multiplications:

``````E[ (a - p(a))*(b - p(b)) ] =

E[ a*b - a*p(b) - p(a)*b + p(a)*p(b) ]``````

Due to bilinearity of the expected value, we can then break this down into:

``E[a*b] - E[a*p(b)] - E[p(a)*b] + E[p(a)*p(b)]``

The value `E[a*b] = 0`, because the categories are mutually exclusive (if one is 1, the other is 0). The value `E[a*p(b)]` means multiply the vector of `a` values by the proportion of b and take the expected value. For a vector of 0’s and 1’s, this reduces to simply `p(a)*p(b)`. So by that logic we have:

``````0 - p(a)*p(b) - p(a)*p(b) + p(a)*p(b)

-p(a)*p(b)``````

This same logic works for the variance, except the term `E[a*b]` is now `E[a*a]`, which does not equal zero, but for a vector of 0’s and 1’s just equals itself, and so reduces to `p(a)`, hence for the variance we have:

``p(a) - p(a)^2 = p(a)*(1 - p(a))``

I have attempted to write the covariance matrix in this form, and then determine a closed form expression for the eigenvalues, but no luck so far. Here are my attempts to let the sympy python library help. You can see even for a covariance matrix of only three categories, the symbolic representation of the eigenvalues is pretty hairy. So not sure if any simplification is possible:

``````import sympy

# Returns a covariance matrix
# and proportion dictionary
# in sympy
def covM(n):
pdict = {}
for i in range(n):
pstr = 'p_' + str(i+1)
pdict[pstr] = sympy.Symbol(pstr,positive=True)
mlist = []
for i in range(n):
a = 'p_' + str(i + 1)
aex = pdict[a]
mrow = []
for j in range(n):
b = 'p_' + str(j + 1)
bex = pdict[b]
if a == b:
mrow.append(aex*(1 - aex))
else:
mrow.append(-aex*bex)
mlist.append(mrow)
M = sympy.Matrix(mlist)
return M, pdict

M, ps = covM(3) #cannot do 5 roots!
ev = M.eigenvals()
ev_list = [k for k in ev.keys()]
e1 = ev_list
print( e1 )

# Not sure how to simplify/collect/whatever
# Into a simpler expression`````` If you can simplify that I am all ears! (Maybe should be working with `M.charpoly()` instead?)