Crime De-Coder LLC Website

So I have created CRIME De-Coder LLC, a firm to do my consulting work with police departments. Check out my website, crimede-coder.com.

Feedback is welcome. In particular check out the services pages, and my first blog post on what distinguishes my services from most firms. Providing computer code to generate the end product is “teaching a man a fish”, whereas most firms just drop a final report and leave.

And of course feel free to reach out to consult@crimede-coder.com if you are interested in pursuing a project. Going forward I plan on making a new post around once a month, so sign up in your feed reader or using a service like IFTTT.


Setting up a stand alone website is not that hard in the end. Currently it is a static site with some custom javascript (hosted on Hostinger). I should do a PHP server for the new blog posts and RSS feed eventually, but for now this is fine. I suggest for those interested in the same get the Jon Duckett books (HTML/Javascript/PHP) for overview of the tech, and then check out Dani Kross’s youtube tutorials (for random things like editing the htaccess file).

I am not doing a newsletter for the blog-posts, as I am concerned it will get my email on random block lists. But if there is demand for it in the future I will figure out some other service I guess to do that.

I wanted a more bare-metal setup (not a hosted wordpress like this site), as in the future I will likely do demo’s of dashboards, host some pyscript, make a sign in for paid content, etc. I just wanted flexibility from the start. So stay tuned for more content from CRIME De-Coder!

Won the algae bloom prediction competition

I recently was one of the winners in the DataDriven competition predicting algaeblooms, username apwheele. So I have written in the past about alternative competition sites. To decide overall whether I will compete in a competition, it is:

  • number of competitors
  • overall prizes
  • my self-assessed skill level

And then whether I had sufficient time to devote to the competition. So for an alternative estimate example, the Astral Codex blog has a book review contest. I can see the prior years competition had 133 competitors, so given prizes of 4k in 2023, 4k/133 ~ $30 in expected return. If you have a burning desire to review a book go for it, but I don’t think I have some secret that gives me a large enough competitive edge over other readers of Scott’s blog to make this competition worth my time.

For the algae bloom competition, DataDriven I saw had previously around 1k competitors per competition, and the prizes for 1/2/3 for this competition were 12k/9k/6k. They have a few other conciliation prizes, so total of $30k. Expected return is basically the same as the Codex blog, $30, but I figured I had a better competitive edge. (Although knew it would be more work than writing a book review.)

I am typically hesitant to do Kaggle competitions, some have over 100k competitors (I feel at that point you are close to the “monkeys typing on a keyboard will produce Shakespeare eventually” stage). I debated recently on doing the Kaggle competition on Neutrino’s in Ice, but due to steeper competition and less time (prizes are similar to the Algae Bloom one) I will not be competing.

In terms of self-assessed skill, you may be thinking “Andy, you have no related skills towards remote-sensing/biology”, which is true. In this specific competition, one of the things that prompted me to compete was the use of ancillary data, so it is not just satellite imagery, you can fold in more data. This tends to favor tabular/tree based models, which I have more experience with. Additionally the example getting started blog post by DataDriven made to me a key critical error – they used a multinomial model (predicting categories) instead of a regression model predicting a continuous outcome. An ordinal model may be defensible, but with the error metric being root mean squared error, the way they used multinomial did not make sense. E.g. if multinomial predicted severity 1 at 51%, and severity 5 at 49%, your prediction should be 3, not 1. Since the majority of people competing in these competitions are clearly just copying code and not understanding the stat models under the hood, I knew this would send a decent number of competitors down a wrong track.

Another aspect I look for in modeling competitions is weird loss functions, this is one of the reasons me and Gio won quite a bit in the NIJ recidivism competition. Essentially things that you need to write custom code for (or think about the math a little under the hood), I suspect give me a decent edge based on quite a few competitors. Things where just your ability to fit and hypertune a deep-learning model based on sensor data is the difference maker I am not going to compete in.

So that was my thinking at the start of the competition. An aspect I did not anticipate though, it was quite a chore to download the data. Unlike many competitions in which the providers gift you data, DataDriven had you download your own satellite data. This was quite alot of work to write code to do this, it wouldn’t surprise me if I spent 40 hours writing code for the competition overall. Also I ended up signing up for the planetary computer resource (I would get rate limited downloading data otherwise). I bet some individuals do not know they should just cache the feature data, not rerun it everytime – it takes me over 2 days of loops in python to download all the data.

So in the end, the competition had listed over 1300 competitors, but many were not serious competitors. Maybe halfway through the competition (which was around Christmas, I suspect that also reduced competition), there was only 300 or 400 competitors signed up. If you signed up in the last few weeks, I knew you would probably not have enough time to write code/download-data/tinker with models. In the end people who submitted 2+ times and beat the benchmark DataDriven results were under 100 people. So that makes me think maybe I should consider Kaggle competitions more seriously, if less than 10% of people competing even give a serious attempt.

So in terms of competitions, I have participated in 4 overall:

Gio was the one who forwarded the recidivism competition. I should have competed in the prior hot spots NIJ competition, so I hopped on that opportunity and we did a good job.

I spent a decent chunk of time on the Maternal Morbidity challenge, with prizes of 50k for multiple teams. So although I did not win I thought that was worth a shot (I am more hesitant for soft assessment competitions though). For the toxic rating competition I failed fast – I spent two days working on a few ideas/models. I was not that high in the leaderboard based on my ideas (pulling in data from alternative sources and building a few different types of models and ensembling them together), so I stopped after a short period. I would have done the same for the Algae competition, but just using a simple regression model (without even including the satellite data), I was #1 on the leaderboard. So I spent more time downloading the data and tinkering over time, which did keep me in the #1 spot in the public leaderboard in the end.

ptools R package update

So as an update to my R package ptools, I have bumped a major version change to 2.0, which is now up on CRAN.

There is no new functionality, but I wanted to bump versions because I swapped out using rgdal/rgeos with sf (rgdal and rgeos are being deprecated). All the functions currently still take as inputs/output sp objects. If I ever get around to it, I will convert the functions to take either. They are somewhat inefficient with conversions, but if you are doing something where it matters you should likely switch data-engineering to another system entirely (such as via SQL in postgis directly). Generating hexagons should actually be faster now, as the sf version I swapped out is vectorized (whereas how I was using sp prior was a loop).

I debate every now and then just letting this go. I can see on cranlogs I have a total of just over 1k (as of 2/7/2023) downloads, and averaged 200 some last month (grand total, last month).

Time is finite, so I have debated on dropping this and just porting most of the functions over into python. Those cumulative downloads are partially bots (I may have racked up 100 of those downloads in my CICD actions). Let me know if you actually use this, as that gives me feedback whether to bother continuing to develop/update this.

Setting up pyspark to run sql tests

So hacker news the other day had a thread on how do you test your SQL code. This is such a nerd/professional data science type of question I love it (and many of the high upvoted answers are good, but lack code example/nitty gritty details, hence this blog post). I figured I would make my notes on this and illustrate some real code. You get a few things in this post:

  • how to set up spark locally on a windows machine
  • how to generate SQL tests in python with pytest
  • how to set up a github action CICD pipeline to test pyspark/sql

Setting up spark on Windows

First, before we get start, getting spark up and running on a windows machine is non-trivial. Maybe I should use Docker, like I do for postgres, but here are my notes on setting it up to run locally on Windows (much of this is my condensed notes + a few extras based on this DataCamp post).

  • Step 1: Install Java. Make sure where you install it has NO SPACES. Then create an environment variable, JAVA_HOME, and point it to where you installed Java. Mine for example is C:\java_jdk (no spaces).
  • Step 2: Install spark. Set the environment variable Spark_home to wherever you downloaded it, mine is set to C:\spark. Additionally set the environment variables PYSPARK_HOME to python.
  • Step 3: Set up your python environment, e.g. something like conda create --name spenv python=3.9 pip pyspark==3.2.1 pyarrow pandas pytest to follow along with this post. (Note to make pyspark and spark downloaded versions line up.)
  • Step 4: Then you need to download winutils.exe that matches your version. So I currently have spark=3.2.1, so I just do a google search for “winutils 3.2” and up pops this github repo with the version I need (minor versions probably don’t matter). (You do not need to double click to install, the exe file just needs to be available for other programs to find.) Then set the environment variable hadoop_home to the folder where you downloaded winutils.exe. Mine is C:\winutils.

I know that is alot – but it is a one time thing. Now once this is done, from the command line try to run:

pyspark --version

You should be greeted with something that looks like:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/

Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 11.0.15

Plus a bunch of other messages (related to logging errors on windows). (You can set the spark-defaults.conf and log4j.properties file to make these less obnoxious.)

But now you are good to go running spark locally on your windows machine. Note you get none of the reasons you want to use spark, such as distributing computations over multiple machines (this is equivalent to running computations on the hnode). But is convenient to do what I am going to show – testing your SQL code locally.

Testing SQL code in python

Now that we have pyspark set up, we can run a test in python. Instead of type python into the REPL, go ahead and run pyspark, so we are in a pyspark REPL session. You will again see the big Spark ascii art at the beginning. Here is an example of now creating a dummy temp table inside of our spark database (this is just a local temp table).

# Inside of pyspark REPL
# so spark is available
import pandas as pd

x = [1,2,3,4]
y = ['a','b','c','c']
df = pd.DataFrame(zip(x,y),columns=['x','y'])

# Can create a temp spark table based on fake data
sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView('test_table')

Now we can run our SQL against the spark database.

# sql to test
query = '''
SELECT
  SUM(x) AS cum_x,
  COUNT(y) AS n,
  y
FROM test_table
GROUP BY y'''

spark.sql(query).show()

And voila, you can now test your spark queries. Although this is showing for spark, you can do something very similar for different databases as well, e.g. using pyodbc or sqlalchemy. You would want to have a test dummy database you can write to though to do that (a few different databases have volatile or temp tables equivalent to spark here).

Setting up pytest + spark

So this shows in just a REPL session how to set up this, but we also want automated tests for multiple scripts via pytest. To illustrate, I am going to refer to my retenmod repo. Even though this library obviously has nothing directly to do with pyspark, I am using this repo as a way to test code/python packaging. (And note if using the same environment above, to replicate 100% of that repo, need to install pip install black flake jupyter jupyter_contrib_nbextensions pre-commit matplotlib.)

So now for pytest, a special thing is you can create python objects that are inherited for all of your tests. Here we will create the spark object one time, and then it can be passed on to your tests. So in ./tests make a file conftest.py, and in this file write:

# This is inside of conftest.py
import pytest
from pyspark.sql import SparkSession

# This sets the spark object for the entire session
@pytest.fixture(scope="session")
def spark():
    sp = SparkSession.builder.getOrCreate()
    # Then can do other stuff to set
    # the spark object
    sp.sparkContext.setLogLevel("ERROR")
    sp.conf.set("spark.sql.execution.array.pyspark.enabled", "true")
    return sp

In pytest lingo, fixtures are then passed around to the different test functions in your library (where appropriate). You can create the equivalent of spark in a pyspark session via SparkSession.builder. This is essentially what a pyspark session is doing under the hood to generate the spark object you can run queries against.

Now we can make a test function, here in a file named test_spark.py (in the same tests folder). That will then look something like:

# pseudo code to show off test
# in test_spark.py
import pandas as pd
from yourlibrary import func

test_table = ...code to create dummy table...

def test_sql(spark):
    # Create the temporary table in spark
    sdf = spark.createDataFrame(test_table)
    sdf.createOrReplaceTempView("test_table")
    # Test out our sql query
    query = func(...your parameters to generate sql...)
    res = spark.sql(query).toPandas()
    # No guarantees on row order in this result
    # so I sort pandas dataframe to make 100% sure
    res.sort_values(by="y", inplace=True, ignore_index=True)
    # Then I do tests
    # Type like tests, shape and column names
    assert res.shape == (3, 3)
    assert list(res) == ["cum_x", "n", "y"]
    # Mathy type tests
    dif_x = res["x"] != [1, 2, 7]
    assert dif_x.sum() == 0
    dif_n = res["n"] != [1, 1, 2]
    assert dif_n.sum() == 0

So note here that I pass in spark as an argument to the function. pytest then uses the fixture we defined earlier to pass into the function when the test is actually run. And then from the command line you can run pytest to check out if your tests pass! (I have placed in the retenmod package a function to show off generating a discrete time survival table in spark, will need to do another blog post though on survival analysis using SQL.)

So this shows off what most of my sql type tests look like. I test that the shape of the data is how I expected (given my fake dataframe I passed in), columns are correct, and that the actual values are what they should be. I find this useful especially to test edge cases (such as missing data) in my SQL queries. My functions in production are mostly parameterized SQL, but it would work the same if you had a series of text SQL files, just read them into python as strings.

Setting up github actions

As a bonus, I show how to set this up in an automatic CICD check via github actions. Check out the actions workflow file for the entire workflow in-situ. But the YAML file has a few extra steps compared to my prior post on generating a wheel file in github actions.

It is short enough an can just paste the whole YAML file for you to check out.

on:
  push:
    branches:
      - main

jobs:
  build_wheel:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-java@v1
        with:
            java-version: 11
      - uses: vemonet/setup-spark@v1
        with:
          spark-version: '3.2.1'
          hadoop-version: '3.2'
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.9
      - name: Build wheel and install
        run: |
          python -m pip install --user --upgrade build
          python -m build
          # installing spark libraries
          pip install pyspark pyarrow pandas pytest
          # now install the wheel file
          find ./dist/*.whl | xargs pip install
          # Now run pytest tests
          pytest --disable-warnings ./tests
      - name: Configure Git
        run: |
          git config --global user.email "apwheele@gmail.com"
          git config --global user.name "apwheele"
      - name: Commit and push wheel
        run: |
          git add -f ./dist/*.whl
          git commit -m 'pushing new wheel'
          git push
      - run: echo "🍏 This job's status is ${{ job.status }}."

So here in addition to python, this also sets up Java and Spark. (I need to learn how to build actions myself, I wish vemonet/setup-spark@v1 had an option to cache the spark file instead of re-downloading everytime). Even with that though, it will only take a minute or two to run (the timeout is just a precaution and good idea to put in any github action).

You can then check this action out on github.

Happy SQL code testing for my fellow data scientists!

Web scraping police data using selenium and python

So I have a few posts in the past on scraping data. One shows downloading and parsing structured PDFs, almost all of the rest though use either JSON API backends, or just grab the HTML data directly. These are fairly straightforward to deal with in python. You generate the url directly, use requests, and then just parse the returned HTML however you want.

Came across a situation recently though where I needed to interact with the webpage. I figured a blog post to illustrate the process would be good. (For both myself and others!) So here I will illustrate entering data into San Antonio’s historical calls for service asp application (which I have seen several PDs use in the past).

It is tough for me to give general advice about scraping, it involves digging into the source code for a website. Here if you click on the Historical Calls button, the url stays the same, but presents you with a new form page to insert your search parameters:

This is a bit of a red-herring though, it ends up being the entire page is embedded in what is called an i-frame, so the host URL stays the same, but the window inside the webpage changes. On the prior opening page, if you hover over the link for Historical Calls you can see it points to https://webapp3.sanantonio.gov/policecalls/Reports.aspx, so that is page we really need to pay attention to.

So for general advice, using Chrome to view a web-pages source html, you can right-click and select view-source:

And you can also go into the Developer tools to check out all the items in a page as well.

Typically before worrying about selenium, I study the network tab in here. You want to pay attention to the items that take the longest/have the most data. Typically I am looking for JSON or text files here if I can’t scrape the data directly from the HTML. (Example blog posts grabbing an entire dump of data here, and another finding a hidden/undocumented JSON api using this approach.) Here is an example network call when inputting the search into the San Antonio web-app.

The data is all being transmitted inside of aspx application, not via JSON or other plain text files (don’t take my terminology here as authoritative, I really know near 0% about servers). So we will need to use selenium here. Using python you can install the selenium library, but you also need to download a driver (here I use chrome), and then wherever you save that exe file, add that location to your PATH environment variable.

Now you are ready for the python part.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
import pandas as pd

# Setting Chrome Options
chrome_options = Options()
#chrome_options.add_argument("-- headless")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("log-level=3")

# Getting the base page
driver = webdriver.Chrome(options=chrome_options)
base_url = "https://webapp3.sanantonio.gov/policecalls/Reports.aspx"
driver = webdriver.Chrome(options=chrome_options)
driver.get(base_url)

Once you run this code, you will see a new browser pop-up. This is great for debugging, but once you get your script finalized, you can see I commented out a line to run in headerless (so it doesn’t bug you by flashing up the browser on your screen).

Now typically what I do is look at the HTML source (like I showed earlier), and then search for the input buttons in HTML. We are trying to figure out the elements we need to insert the data for us to submit a search. Here is the first input for an item we care about, the begin date of the search.

Now we can insert our own date by grabbing the element from the web-page. I grab it here by the “id” attribute in the HTML (many tutorials use xpath, which I am not as familiar with, but at least for these aspx apps what I show works fine). For dates that have a validation stage, you need to not only .send_keys, but to also submit to get past the date validation.

# Inserting date field for begin date
from_date = driver.find_element("id", "txtStart")
from_date.send_keys("10/01/2022")
from_date.submit()

Once you run that code you can actually view the web-page, and see that your date is entered! Now we need to do the same thing for the end date. Then we can put in a plain text zipcode. Since this does not have validation, we do not need to submit it.

# Now for end date
end_date = driver.find_element("id", "txtEndDate")
end_date.send_keys("10/02/2022")
end_date.submit()

# Now inserting text for zipcode
zip = driver.find_element("id", "txtZipcode")
zip.send_keys("78207")
# Sometimes need to clear, zip.clear()

I have a note there on clearing a text box. Sometimes websites have pre-filled options. Sometimes web-sites also do not like .clear(), and you can simulate backspace keystrokes directly. This website does not like it if you clear a date-field for example.

Now the last part, I am going to select a drop-down. If you go into the HTML source again, you can see the list of options.

And now we can use the Select function I imported at the beginning to select a particular element of that drop-down. Here I select the crimes against persons.

# Now selecting dropdown
crime_cat = driver.find_element("id", "ddlCategory")
crime_sel = Select(crime_cat)
crime_sel.select_by_visible_text("Crimes Against Person Calls")

Many of these applications have rate limits, so you need to limit the search to tiny windows and subsets, and then loop over the different sets you want to grab all of the data. (Being nice and using time.sleep() between calls to get the results.

Now we are ready to submit the query. The same way you can enter in text into input forms, buttons you can click are also labeled as inputs in the HTML. Here I find the submit button, and then .click() that button. (If there is a direct button to download CSV or some other format, it may make sense to click that button.)

# Now can find the View Data button and submit
view_data = driver.find_element("id", "btnSearch")
view_data.click()

Now that we have our web-page, we can get the HTML source directly and then parse that. Pandas has a nice method to grab tables, and this application is actually very nicely formatted. (I tend to not use this, as many webpages have some very bespoke tables that are hard to grab directly like this). This method grabs all the tables in the web-page by default, here I just want the calls for service table, which has an id of "gvCFS", which I can pass into the pandas .read_html function.

# Pandas has a nice option to read tables directly
html = driver.page_source
cfs = pd.read_html(html, attrs={"id":"gvCFS"})[0]

And that shows grabbing a single result. Of course to scrape, you will need to loop over many days (and here different search selections), depending on what data you want to grab. Most of these applications have search limits, so if you do too large a search, will only return the first say 500 results. And San Antonio’s is nice because it returns as a single table in the web-page, most you need to page the results though as well. Which takes further scraping the data and interacting with the page. So it is more painful whenever you need to resort to selenium.

Sometimes pages will point to PDF files, and you can set Chrome’s options to download to a particular location in that scenario (and then use os.rename to name the PDF whatever you want after it is downloaded). You can basically do anything in selenium you can manually, it is often just a tricky set of steps to replicate in code.

Using quantile regression to evaluate police response times

Jeff Asher recently had a post on analysis of response times across many agencies. One nitpick though (and ditto for prior analyses I have seen, such as Scott Mourtgos and company), is that you should not use linear models (or means in general) to describe response time distributions. They are very heavily right skewed, and so the mean tends to be not representative of the bulk of the data.

When evaluating changes in response time, imagine two simplistic scenarios. One, every single call increases by 5 minutes, so what used to be 5 is now 10, 20 is now 25, 60 is now 65, etc. That is probably not realistic for response times, it is probably calls in the tail (ones that take a very long time to wait for an opening in the queue) are what changes. E.g. 5 is still 5, 20 is still 20, but 60 is now 120. In the latter scenario, the left tail of the distribution does not change, only the right tail. In both scenarios the mean shifts.

I think a natural way to model the problem is instead of focusing on means, is to use quantile regression. It allows you to generalize the entire distribution (look at the left and right tails) and still control for individual level circumstances. Additionally, often emergency agencies set goals along the lines of “I want to respond to 90% of emergency events with X minutes”. Quantile regression is a great tool to describe that 90% make. Here I am going to show an example using the New Orleans calls for service data and python.

First, we can download the data right inside of python without saving it directly to disk. I am going to be showing off estimating quantile regression with the statsmodel library. I do the analysis for 19 through 22, but NOLA has calls for service going back to the early 2010s if folks are interested.

import pandas as pd
import statsmodels.formula.api as smf

# Download data, combo 19/20/21/22
y19 = 'https://data.nola.gov/api/views/qf6q-pp4b/rows.csv?accessType=DOWNLOAD'
y20 = 'https://data.nola.gov/api/views/hp7u-i9hf/rows.csv?accessType=DOWNLOAD'
y21 = 'https://data.nola.gov/api/views/3pha-hum9/rows.csv?accessType=DOWNLOAD'
y22 = 'https://data.nola.gov/api/views/nci8-thrr/rows.csv?accessType=DOWNLOAD'
yr_url = [y19,y20,y21,y22]
res_pd = [pd.read_csv(url) for url in yr_url]
data = pd.concat(res_pd,axis=0) # alittle over 1.7 million

Now we do some data munging. Here I eliminate self initiated events, as well as those with missing data. There then are just a handful of cases that have 0 minute arrivals, which to be consistent with Jeff’s post I also eliminate. I create a variable, minutes, that is the minutes between the time created and the time arrived on scene (not cleared).

# Prepping data
data = data[data['SelfInitiated'] == 'N'].copy() # no self init
data = data[~data['TimeArrive'].isna()].copy()   # some missing arrive
data['begin'] = pd.to_datetime(data['TimeCreate'])
data['end'] = pd.to_datetime(data['TimeArrive'])
dif = data['end'] - data['begin']
data['minutes'] = dif.dt.seconds/60
data = data[data['minutes'] > 0].copy() # just a few left over 0s

# Lets look at the distribution
data['minutes'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])

For quantiles, for the entire sample the median time is around 20 minutes, the 10th percentile is under 3 minutes and the 90th percentile is around 5 hours. Using the mean here (which in Jeff’s post varies from 50 to 146 minutes over the same 4 year period), can be somewhat misleading.

An important component of response times is differentiating between different priority calls. NOLA in their data, higher numbers are higher priority. Zero priority are things NOLA says don’t necessarily need an officer at all. So it could be those “0 priority” calls are really just dragging the overall average down over time, although they may have little to do with clearance rates or public safety overall. The priority category fields also has sub-categories, e.g. 1A is higher priority than 1B. To keep the post simple I just breakdown by integer leading values, not the sub letter-categories.

# Priority just do 1/2/3
# 3 is highest priority
data['PriorCat'] = data['Priority'].str[0]
# Only 5 cases of 3s, will eliminate these as well
data.groupby('PriorCat')['minutes'].describe()

Here you can really see the right skewness – priority 2 calls the mean is 25 minutes, but the median is under 10 minutes for the entire sample. A benefit of quantile regression I will use in a bit, the few outlying cases (beyond the quantiles of interest), really don’t impact the analysis. So those cases that take almost 24 hours (I imagine they are just auto-filled in like that in the data), really don’t impact estimates of smaller quantiles. But they can have noticeable influence on mean estimates.

Some final data munging, to further simplify I drop the 16 cases of priority 3s and 4s, and add in a few more categorical covariates for hour of the day, and look at months over time as categorical. (These decisions are more so to make the results easier to parse in a blog post in simpler tables, it would take more work to model a non-linear continuous over time variable, say via a spline, and make a reasonable ordinal encoding for the sub-priority categories.)

# only worry about 0/1/2s
data = data[data['PriorCat'].isin(['0','1','2'])].copy()
# Total in the end almost 600k cases

# Some factor date variables
def dummy_stats(vdate,begin_date):
    bd = pd.to_datetime(begin_date)
    year = vdate.dt.year
    month = vdate.dt.month
    week_day = vdate.dt.dayofweek
    hour = vdate.dt.hour
    diff_days = (vdate - bd).dt.days
    # if binary, turn week/month into dummy variables
    return diff_days, week_day, hour, month, year

dn, wd, hr, mo, yr = dummy_stats(data['begin'],'1/1/2022')
data['Hour'] = hr
data['Month'] = mo
data['Year'] = yr

# Lets just look at months over time
data['MoYr'] = data['Year'] + data['Month']/100

Now finally onto the modeling stuff. For those familiar with regression, quantile regression instead of predicting the mean predicts a quantile of the distribution. Here I show predicting the 50th quantile (the median). For those not familiar with regression, this is not all that different than doing a pivot table/group by, but aggregating by quantiles instead of means. Regression is somewhat different than the simpler pivot table, since you “condition on” other continuous factors (here I “control for” hour of day), but in broad strokes is similar.

Here I use a patsy “R style” formula, and fit a categorical covariate for the 0/1/2 categories, hour of day, and the time varying months over time (to see the general trends). The subsequent regression table is big, so will show in parts:

# Quantile regression for median
mod = smf.quantreg("minutes ~ C(PriorCat, Treatment(reference='2')) + C(Hour) + C(MoYr)", data)
res50 = mod.fit(q=0.5)
res50.summary()

First, I use 2 priority events as the referent category, so you can see (in predicting the median of the distribution), priority 1 events have a median 24 minutes longer than priority 2, and priority 0 have a median two hours later. You can see some interesting patterns in the hour of the day effects (which are for the overall effects, not broken down by priority). So there are likely shift changes at 06:00, 14:00, and 22:00 that result in longer wait times.

But of most interest are patterns over time, here is the latter half of that table, showing median estimates over the months in this sample.

You could of course make the model more complicated, e.g. look at spatial effects or incorporate other direct measures of capacity/people on duty. But here it is complicated enough for an illustrative blog post. January-2019 is the referent category month, and we can see some slight decreases in a few minutes around the start of the pandemic, but have been clearly been increasing at the median time fairly noticeably starting later in 2021.

As opposed to interpreting regression coefficients, I think it is easier to see model predictions. We can just make sample data points, here at noon over the different months, and do predictions over each different priority category:

# Predictions for different categories
hour = 12
prior_cat = [0,1,2]
oos = data.groupby(['PriorCat','MoYr'],as_index=False)['Hour'].size()
oos['Hour'] = 12
oos['Q50'] = res50.predict(oos)

print(oos[oos['PriorCat'] == '0'])
print(oos[oos['PriorCat'] == '1'])
print(oos[oos['PriorCat'] == '2'])

So here for priority 0, 130 has creeped up to 143.

And for priority 1, median times 35 to 49.

Note that the way I estimated the regression equation, the increase/decrease per month is forced to be the same across the different priority calls. So, the increase among priority 2 calls is again around 13 minutes according to the model.

But this assumption is clearly wrong. Remember my earlier “fast” and “slow” example, with only the slow calls increasing. That would suggest the distributions for the priority calls will likely have different changes over time. E.g. priority 0 may increase by alot, but priority 2 will be almost the same. You could model this in the formula via an interaction effect, e.g. something like "minutes ~ C(PriorCat)*C(MoYr) + C(Hour)", but to make the computer spit out a solution a bit faster, I will subset the data to just priority 2 calls.

Here the power of quantile regression is we can look at different distributions. Estimating extreme quantiles is tough, but looking at the 10th/90th (as well as the median) is pretty typical. I do those three quantiles, and generate model predictions over the months (again assuming a call at 12).

# To save time, I am only going to analyze
# Priority 2 calls now
p2 = data[data['PriorCat'] == '2'].copy()
m2 = smf.quantreg("minutes ~ C(MoYr) + C(Hour)", p2)
oos2 = oos[oos['PriorCat'] == '2'].copy()

# loop over different quantiles
qlist = [0.1, 0.5, 0.9]
for q in qlist:
    res = m2.fit(q=q)
    oos2[f'Q_{q}'] = res.predict(oos2)

oos2

So you can see my story about fast and slow calls plays out, although even when restricted to purportedly high risk calls. When looking at just priority 2 calls in New Orleans, the 10th percentile stays very similar over the period, although does have a slight increase from under 4 to almost 5 minutes. The 50th percentile has slightly more growth, but is from 10 minutes to 13 minutes. The 90th percentile though has more volatility – grew from 30 to 60 in small increases in 2022, and late 2022 has fairly dramatic further growth to 70/90 minutes. And you can see how the prior model that did not break out priority 0/1 calls changed this estimate for the left tail for the priority 2 left tail as well. (So those groups likely also had large shifts across the entire set.)

So my earlier scenario is overly simplistic, we can see some increase in the left tails of the distribution as well in this analysis. But, the majority of the increase is due to changes in the long right tail – calls that used to take less than 30 minutes are now taking 90 minutes to arrive. Which still likely has implications for satisfaction with police and reporting behavior, maybe not so much though with clearance or direct public safety.

No easy answers here in terms of giving internet advice to New Orleans. If working with NOLA, I would like to get estimates of officer capacity per shift, so I could incorporate into the quantile regression model that factor directly. That would allow you to precisely quantify how officer capacity impacts the distribution of response times. So not just “response times are going up” but “the decrease in capacity from A to B resulted in X increase in the 90th percentile of response times”. So if NOLA had goals set they could precisely state where officer capacity needed to be to have a shot of obtaining that goal.

Using docker to play with postgres

No major end of year updates. Life is boring (in a good way). My data science gig at Gainwell is going well. Still on occasion do scholarly type things (see my series on the American Society of Evidence Based Policing). Blog is still going strong, topping over 130k views this year.

As always, feel free to send me question, will just continue to post random tips over time related to things I am working on.


Post today is about using docker to run a personal postgres database. In the past I have attempted to install postgres on my personal windows machine, and this caused issues with other tool sets (in particular GIS tools that I believe rely on PostGIS somewhere under the hood). So a way around that is to install postgres on its entirely own isolated part of your machine. Using docker you don’t have to worry about messing things up – you can always just destroy the image you build and start fresh.

For background to follow along, you need to 1) install docker desktop on your machine, 2) have a python installation with in addition to the typical scientific stack sqlalchemy and psycopg2 (sqlalchemy may by default be installed on Anaconda distributions, I don’t think pyscopg2 is though).

For those not familiar, docker is a technology that lets you create virtual machines with certain specifications, e.g. something like “build an Ubuntu image with python and postgres installed”. Then you can do more complicated things, in data science we often are either creating an application server, or running batch jobs in such isolated environments. Here we will persist an image with postgres to test it out. (Both understanding docker and learning to work with databases and SQL are skills I expect more advanced data scientists to know.)

So first, again after you have docker installed, you can run something like:

docker run --name test_pg -p 5433:5432 -e POSTGRES_PASSWORD=mypass -d postgres

To create a default docker image that has postgres installed. This just pulls the base image postgres. Now you can get inside of that image, and add a schema/tables if you want:

docker exec -it test_pg bash
psql -U postgres
CREATE SCHEMA ts;
CREATE TABLE ts.test_tab (id serial primary key, val int, task text);
INSERT INTO ts.test_tab (val,task) values (1,'abc'), (2, 'def');
SELECT * FROM ts.test_tab;
\q

And you can do more complicated things, like install python and the postgres python extension.

# while still inside the postgres docker image
apt-get update
apt install python3 pip
apt-get install postgresql-plpython3-15
psql -U postgres
CREATE EXTENSION plpython3u;
\q
# head back out of image entirely
exit

(You could create a dockerfile to do all of this as well, but just showing step by step interactive here as a start. I recommend that link as a getting feet wet with docker as well.)

Now, back on your local machine, we can also interact with that same postgres database. Using python code:

import pandas as pd
import sqlalchemy
import psycopg2
import pickle

uid = 'postgres' # in real life, read these in as secrets
pwd = 'mypass'   # or via config files
ip = 'localhost:5433' # port is set on docker command

# depending on sqlalchemy version, it may be
# 'postgres' insteal of 'postgresql'
conn_str = f'postgresql://{uid}:{pwd}@{ip}/postgres'
eng = sqlalchemy.create_engine(conn_str)

res = pd.read_sql('SELECT * FROM ts.test_tab',eng)
print(res)

And you can save a table into this same database:

res['new'] = [True,False]
res.to_sql('new_tab',schema='ts',con=eng,index=False,if_exists='replace')
pd.read_sql('SELECT * FROM ts.new_tab',eng)

And like I said, I like to have a version I can test things with. One example, I have tested a deployment pattern that passes around binary blobs for model artifacts.

# Can save a python object as binary blob
eng.execute('''create table ts.mod (note VARCHAR(50) UNIQUE, mod bytea);''')

def save_mod(model, note, eng=eng):
    bm = pickle.dumps(model)
    md = psycopg2.Binary(bm)
    in_sql = f'''INSERT INTO ts.mod (note, mod) VALUES ('{note}',{md});'''
    info = eng.url # for some reason sqlalchemy doesn't work for this
    # but using psycopg2 directly does
    pcn = psycopg2.connect(user=info.username,password=info.password,
                          host=info.host,port=info.port,dbname=info.database)
    cur = pcn.cursor()
    cur.execute(in_sql)
    pcn.commit()
    pcn.close()

def load_mod(note, eng=eng):
    sel_sql = f'''SELECT mod FROM ts.mod WHERE note = '{note}';'''
    res = pd.read_sql(sel_sql,eng)
    mod = pickle.loads(res['mod'][0])
    return mod

This does not work out of the box for objects that have more complicated underlying code, but pure python stuff in sklearn it seems to work OK:

# Create random forest
import numpy as np
from sklearn.ensemble import RandomForestRegressor

train = np.array([[1,2],[3,4],[1,3]])
y = np.array([1,2,3])
mod = RandomForestRegressor(n_estimators=5,max_depth=2,random_state=0)
mod.fit(train,y)

save_mod(mod,'ModelRF1')
m2 = load_mod('ModelRF1')

# Showing the outputs are the same
mod.predict(train)
m2.predict(train)

You may say to yourself this is a crazy deployment pattern. And I would say I do what I need to do given the constraints I have sometimes! (I know more startups with big valuations that do SQL insertion to deploy models than ones with more sane deployment strategies. So I am not the only one.)

But this is really just playing around like I said. But you can try out more advanced stuff as well, such as using python inside postgres functions, or something like this postgres extension for ML models.

Startup scripts on windows

Just a very quick post today. For folks working on Unix machines, there is a bash profile script that runs when you first log into your user, and can set various environment variables, symlinks, or just run scripts in general. On windows, you can do something similar by placing program.bat files in the shell startup directory. (That linked post by Microsoft has the directions to find where that is located on your machine. Hit Windows Logo + R and type shell:startup into the prompt. It ends up being C:\Users\andre\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup on my personal machine.)

One I have is to map regular folders I use to particular drive letters (similar in outcomes to symlinks on Unix machines) using the subst command. So in that startup directory I have a file, ghub_map.bat, that contains this single line:

subst g: "D:\Dropbox\Dropbox\PublicCode_Git"

And so when I log into my machine, if I want to go and do work on github I can just do type G: in the windows terminal (to switch drives you don’t use cd in windows). This is in place of the more onerous, D: (default terminal starts in C:), and then cd ./Dropbox/Dropbox/PublicCode_Git.

Happy holidays everyone!

Handling errors in python and R

Just some quick notes on error handling in python and R. For most of my batch processes at work (in python), error handling is necessary. Most of my code has logic that if it fails, it sends an email message about the failure. This is only possible if you capture errors and have conditional logic, if the code just fails without capturing the error (for both R and python) it will just exit the script.

I had another example recently in R that used placebo tests that could fail to converge. So a more traditional stat project, but it should just keep running the loop to collect results, not fail entirely.

Python Example

In python, for most of my batch jobs I have code that looks like this:

import traceback

try:
    # whatever function you are running
    send_success_email()
except Exception:
    er = traceback.format_exc()
    print('Error message is \n\n{er}')
    send_error_email(er)

So it fails gracefully, and just gives me a message either in REPL for interactive debugging or stdout for regularly run batch jobs. I can see for people running servers why they want more specific error handling and using a more formal logger, but IMO that is overkill for running batch jobs.

R Example

R to do error handling looks something like this:

# trycatch for errors
my_func <- function(){
    # try/catch if error
    out <- tryCatch(
       { # part with the potential error
         #r <- ???? #whatever code steps you want
        r
       }, error=function(cond){ 
          print("Function Failed, Error message is \n\n")
          print(Cond)
          return(-1)
          } )
    return(out)
}

So if you have inside the tryCatch something that is “fit complicated model” inside your simulations (that could fail), this will still fail gracefully (and can return the error message if you need to.

Counting lines of code

Was asked recently about how many lines of python code was in my most recent project. A simple command line check, cd into your project directory and run:

find -type f -name "*.py" | xargs wc -l

(If on windows, you can download the GOW tools to be able to use these same tools by default available on unix/mac.) This will include whitespace and non-functional lines (like docstrings), but that I think is ok. Doing this for my current main project at Gainwell, I have about 30k lines of python code. Myself (and now about 4 other people) have been working on that code base for nearly a year.

For my first production project at (then) HMS, the total lines of python code are 20k, and I developed the bulk of that in around 7 months of work. Assuming 20 work days in a month, that results in around 20000/140 ~ 143 lines of code per workday. I did other projects during that time span, but this was definitely my main focus (and I was the sole developer/data scientist). I think that is high (more code is not necessarily better, overall code might have decreased as future development of this project happened over time), but is ballpark reasonable expectations for working data scientists (I would have guessed closer to around 100 per day). In the grand scheme of things, this is like 2 functions or unit tests per work day (when considering white space and docstrings).

Doing this for all of my python code on my personal machine is around 60k (I do around, as I am removing counts for projects that are just cloned). And for all the python code on my work machine is around 140k (for 3 years of work). (I am only giving fuzzy numbers, I have some projects joint work I am half counting, and some cloned code I am not counting at all.)

Doing this same exercise for R code, I only get around 40k lines of code on my personal machine. For instance, my ptools package has under 3k lines of "*.R" code total. I am guessing this is due to not only R code being more precise than python, but to take code into production takes more work. Maybe worth another blog post, but the gist of the difference between an academic project is that you need the code to run one time, whereas for a production project the code needs to keep running on a regular schedule indefinitely.

I have written much more SPSS code over my career than R code, but I have most of it archived on Dropbox, so cannot easily get a count of the lines. I have a total of 1846 sps files (note that this does not use xargs).

find -type f -name "*.sps" | wc -l

It is possible that the average sps file on my machine is 200 lines per file (it definitely is over 100 lines). So my recent python migration I don’t think has eclipsed my cumulative SPSS work going back over a decade (but maybe in two more years will).