Setting up pyspark to run sql tests

So hacker news the other day had a thread on how do you test your SQL code. This is such a nerd/professional data science type of question I love it (and many of the high upvoted answers are good, but lack code example/nitty gritty details, hence this blog post). I figured I would make my notes on this and illustrate some real code. You get a few things in this post:

  • how to set up spark locally on a windows machine
  • how to generate SQL tests in python with pytest
  • how to set up a github action CICD pipeline to test pyspark/sql

Setting up spark on Windows

First, before we get start, getting spark up and running on a windows machine is non-trivial. Maybe I should use Docker, like I do for postgres, but here are my notes on setting it up to run locally on Windows (much of this is my condensed notes + a few extras based on this DataCamp post).

  • Step 1: Install Java. Make sure where you install it has NO SPACES. Then create an environment variable, JAVA_HOME, and point it to where you installed Java. Mine for example is C:\java_jdk (no spaces).
  • Step 2: Install spark. Set the environment variable Spark_home to wherever you downloaded it, mine is set to C:\spark. Additionally set the environment variables PYSPARK_HOME to python.
  • Step 3: Set up your python environment, e.g. something like conda create --name spenv python=3.9 pip pyspark==3.2.1 pyarrow pandas pytest to follow along with this post. (Note to make pyspark and spark downloaded versions line up.)
  • Step 4: Then you need to download winutils.exe that matches your version. So I currently have spark=3.2.1, so I just do a google search for “winutils 3.2” and up pops this github repo with the version I need (minor versions probably don’t matter). (You do not need to double click to install, the exe file just needs to be available for other programs to find.) Then set the environment variable hadoop_home to the folder where you downloaded winutils.exe. Mine is C:\winutils.

I know that is alot – but it is a one time thing. Now once this is done, from the command line try to run:

pyspark --version

You should be greeted with something that looks like:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/

Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 11.0.15

Plus a bunch of other messages (related to logging errors on windows). (You can set the spark-defaults.conf and log4j.properties file to make these less obnoxious.)

But now you are good to go running spark locally on your windows machine. Note you get none of the reasons you want to use spark, such as distributing computations over multiple machines (this is equivalent to running computations on the hnode). But is convenient to do what I am going to show – testing your SQL code locally.

Testing SQL code in python

Now that we have pyspark set up, we can run a test in python. Instead of type python into the REPL, go ahead and run pyspark, so we are in a pyspark REPL session. You will again see the big Spark ascii art at the beginning. Here is an example of now creating a dummy temp table inside of our spark database (this is just a local temp table).

# Inside of pyspark REPL
# so spark is available
import pandas as pd

x = [1,2,3,4]
y = ['a','b','c','c']
df = pd.DataFrame(zip(x,y),columns=['x','y'])

# Can create a temp spark table based on fake data
sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView('test_table')

Now we can run our SQL against the spark database.

# sql to test
query = '''
SELECT
  SUM(x) AS cum_x,
  COUNT(y) AS n,
  y
FROM test_table
GROUP BY y'''

spark.sql(query).show()

And voila, you can now test your spark queries. Although this is showing for spark, you can do something very similar for different databases as well, e.g. using pyodbc or sqlalchemy. You would want to have a test dummy database you can write to though to do that (a few different databases have volatile or temp tables equivalent to spark here).

Setting up pytest + spark

So this shows in just a REPL session how to set up this, but we also want automated tests for multiple scripts via pytest. To illustrate, I am going to refer to my retenmod repo. Even though this library obviously has nothing directly to do with pyspark, I am using this repo as a way to test code/python packaging. (And note if using the same environment above, to replicate 100% of that repo, need to install pip install black flake jupyter jupyter_contrib_nbextensions pre-commit matplotlib.)

So now for pytest, a special thing is you can create python objects that are inherited for all of your tests. Here we will create the spark object one time, and then it can be passed on to your tests. So in ./tests make a file conftest.py, and in this file write:

# This is inside of conftest.py
import pytest
from pyspark.sql import SparkSession

# This sets the spark object for the entire session
@pytest.fixture(scope="session")
def spark():
    sp = SparkSession.builder.getOrCreate()
    # Then can do other stuff to set
    # the spark object
    sp.sparkContext.setLogLevel("ERROR")
    sp.conf.set("spark.sql.execution.array.pyspark.enabled", "true")
    return sp

In pytest lingo, fixtures are then passed around to the different test functions in your library (where appropriate). You can create the equivalent of spark in a pyspark session via SparkSession.builder. This is essentially what a pyspark session is doing under the hood to generate the spark object you can run queries against.

Now we can make a test function, here in a file named test_spark.py (in the same tests folder). That will then look something like:

# pseudo code to show off test
# in test_spark.py
import pandas as pd
from yourlibrary import func

test_table = ...code to create dummy table...

def test_sql(spark):
    # Create the temporary table in spark
    sdf = spark.createDataFrame(test_table)
    sdf.createOrReplaceTempView("test_table")
    # Test out our sql query
    query = func(...your parameters to generate sql...)
    res = spark.sql(query).toPandas()
    # No guarantees on row order in this result
    # so I sort pandas dataframe to make 100% sure
    res.sort_values(by="y", inplace=True, ignore_index=True)
    # Then I do tests
    # Type like tests, shape and column names
    assert res.shape == (3, 3)
    assert list(res) == ["cum_x", "n", "y"]
    # Mathy type tests
    dif_x = res["x"] != [1, 2, 7]
    assert dif_x.sum() == 0
    dif_n = res["n"] != [1, 1, 2]
    assert dif_n.sum() == 0

So note here that I pass in spark as an argument to the function. pytest then uses the fixture we defined earlier to pass into the function when the test is actually run. And then from the command line you can run pytest to check out if your tests pass! (I have placed in the retenmod package a function to show off generating a discrete time survival table in spark, will need to do another blog post though on survival analysis using SQL.)

So this shows off what most of my sql type tests look like. I test that the shape of the data is how I expected (given my fake dataframe I passed in), columns are correct, and that the actual values are what they should be. I find this useful especially to test edge cases (such as missing data) in my SQL queries. My functions in production are mostly parameterized SQL, but it would work the same if you had a series of text SQL files, just read them into python as strings.

Setting up github actions

As a bonus, I show how to set this up in an automatic CICD check via github actions. Check out the actions workflow file for the entire workflow in-situ. But the YAML file has a few extra steps compared to my prior post on generating a wheel file in github actions.

It is short enough an can just paste the whole YAML file for you to check out.

on:
  push:
    branches:
      - main

jobs:
  build_wheel:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-java@v1
        with:
            java-version: 11
      - uses: vemonet/setup-spark@v1
        with:
          spark-version: '3.2.1'
          hadoop-version: '3.2'
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.9
      - name: Build wheel and install
        run: |
          python -m pip install --user --upgrade build
          python -m build
          # installing spark libraries
          pip install pyspark pyarrow pandas pytest
          # now install the wheel file
          find ./dist/*.whl | xargs pip install
          # Now run pytest tests
          pytest --disable-warnings ./tests
      - name: Configure Git
        run: |
          git config --global user.email "apwheele@gmail.com"
          git config --global user.name "apwheele"
      - name: Commit and push wheel
        run: |
          git add -f ./dist/*.whl
          git commit -m 'pushing new wheel'
          git push
      - run: echo "🍏 This job's status is ${{ job.status }}."

So here in addition to python, this also sets up Java and Spark. (I need to learn how to build actions myself, I wish vemonet/setup-spark@v1 had an option to cache the spark file instead of re-downloading everytime). Even with that though, it will only take a minute or two to run (the timeout is just a precaution and good idea to put in any github action).

You can then check this action out on github.

Happy SQL code testing for my fellow data scientists!

Web scraping police data using selenium and python

So I have a few posts in the past on scraping data. One shows downloading and parsing structured PDFs, almost all of the rest though use either JSON API backends, or just grab the HTML data directly. These are fairly straightforward to deal with in python. You generate the url directly, use requests, and then just parse the returned HTML however you want.

Came across a situation recently though where I needed to interact with the webpage. I figured a blog post to illustrate the process would be good. (For both myself and others!) So here I will illustrate entering data into San Antonio’s historical calls for service asp application (which I have seen several PDs use in the past).

It is tough for me to give general advice about scraping, it involves digging into the source code for a website. Here if you click on the Historical Calls button, the url stays the same, but presents you with a new form page to insert your search parameters:

This is a bit of a red-herring though, it ends up being the entire page is embedded in what is called an i-frame, so the host URL stays the same, but the window inside the webpage changes. On the prior opening page, if you hover over the link for Historical Calls you can see it points to https://webapp3.sanantonio.gov/policecalls/Reports.aspx, so that is page we really need to pay attention to.

So for general advice, using Chrome to view a web-pages source html, you can right-click and select view-source:

And you can also go into the Developer tools to check out all the items in a page as well.

Typically before worrying about selenium, I study the network tab in here. You want to pay attention to the items that take the longest/have the most data. Typically I am looking for JSON or text files here if I can’t scrape the data directly from the HTML. (Example blog posts grabbing an entire dump of data here, and another finding a hidden/undocumented JSON api using this approach.) Here is an example network call when inputting the search into the San Antonio web-app.

The data is all being transmitted inside of aspx application, not via JSON or other plain text files (don’t take my terminology here as authoritative, I really know near 0% about servers). So we will need to use selenium here. Using python you can install the selenium library, but you also need to download a driver (here I use chrome), and then wherever you save that exe file, add that location to your PATH environment variable.

Now you are ready for the python part.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
import pandas as pd

# Setting Chrome Options
chrome_options = Options()
#chrome_options.add_argument("-- headless")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("log-level=3")

# Getting the base page
driver = webdriver.Chrome(options=chrome_options)
base_url = "https://webapp3.sanantonio.gov/policecalls/Reports.aspx"
driver = webdriver.Chrome(options=chrome_options)
driver.get(base_url)

Once you run this code, you will see a new browser pop-up. This is great for debugging, but once you get your script finalized, you can see I commented out a line to run in headerless (so it doesn’t bug you by flashing up the browser on your screen).

Now typically what I do is look at the HTML source (like I showed earlier), and then search for the input buttons in HTML. We are trying to figure out the elements we need to insert the data for us to submit a search. Here is the first input for an item we care about, the begin date of the search.

Now we can insert our own date by grabbing the element from the web-page. I grab it here by the “id” attribute in the HTML (many tutorials use xpath, which I am not as familiar with, but at least for these aspx apps what I show works fine). For dates that have a validation stage, you need to not only .send_keys, but to also submit to get past the date validation.

# Inserting date field for begin date
from_date = driver.find_element("id", "txtStart")
from_date.send_keys("10/01/2022")
from_date.submit()

Once you run that code you can actually view the web-page, and see that your date is entered! Now we need to do the same thing for the end date. Then we can put in a plain text zipcode. Since this does not have validation, we do not need to submit it.

# Now for end date
end_date = driver.find_element("id", "txtEndDate")
end_date.send_keys("10/02/2022")
end_date.submit()

# Now inserting text for zipcode
zip = driver.find_element("id", "txtZipcode")
zip.send_keys("78207")
# Sometimes need to clear, zip.clear()

I have a note there on clearing a text box. Sometimes websites have pre-filled options. Sometimes web-sites also do not like .clear(), and you can simulate backspace keystrokes directly. This website does not like it if you clear a date-field for example.

Now the last part, I am going to select a drop-down. If you go into the HTML source again, you can see the list of options.

And now we can use the Select function I imported at the beginning to select a particular element of that drop-down. Here I select the crimes against persons.

# Now selecting dropdown
crime_cat = driver.find_element("id", "ddlCategory")
crime_sel = Select(crime_cat)
crime_sel.select_by_visible_text("Crimes Against Person Calls")

Many of these applications have rate limits, so you need to limit the search to tiny windows and subsets, and then loop over the different sets you want to grab all of the data. (Being nice and using time.sleep() between calls to get the results.

Now we are ready to submit the query. The same way you can enter in text into input forms, buttons you can click are also labeled as inputs in the HTML. Here I find the submit button, and then .click() that button. (If there is a direct button to download CSV or some other format, it may make sense to click that button.)

# Now can find the View Data button and submit
view_data = driver.find_element("id", "btnSearch")
view_data.click()

Now that we have our web-page, we can get the HTML source directly and then parse that. Pandas has a nice method to grab tables, and this application is actually very nicely formatted. (I tend to not use this, as many webpages have some very bespoke tables that are hard to grab directly like this). This method grabs all the tables in the web-page by default, here I just want the calls for service table, which has an id of "gvCFS", which I can pass into the pandas .read_html function.

# Pandas has a nice option to read tables directly
html = driver.page_source
cfs = pd.read_html(html, attrs={"id":"gvCFS"})[0]

And that shows grabbing a single result. Of course to scrape, you will need to loop over many days (and here different search selections), depending on what data you want to grab. Most of these applications have search limits, so if you do too large a search, will only return the first say 500 results. And San Antonio’s is nice because it returns as a single table in the web-page, most you need to page the results though as well. Which takes further scraping the data and interacting with the page. So it is more painful whenever you need to resort to selenium.

Sometimes pages will point to PDF files, and you can set Chrome’s options to download to a particular location in that scenario (and then use os.rename to name the PDF whatever you want after it is downloaded). You can basically do anything in selenium you can manually, it is often just a tricky set of steps to replicate in code.

Using quantile regression to evaluate police response times

Jeff Asher recently had a post on analysis of response times across many agencies. One nitpick though (and ditto for prior analyses I have seen, such as Scott Mourtgos and company), is that you should not use linear models (or means in general) to describe response time distributions. They are very heavily right skewed, and so the mean tends to be not representative of the bulk of the data.

When evaluating changes in response time, imagine two simplistic scenarios. One, every single call increases by 5 minutes, so what used to be 5 is now 10, 20 is now 25, 60 is now 65, etc. That is probably not realistic for response times, it is probably calls in the tail (ones that take a very long time to wait for an opening in the queue) are what changes. E.g. 5 is still 5, 20 is still 20, but 60 is now 120. In the latter scenario, the left tail of the distribution does not change, only the right tail. In both scenarios the mean shifts.

I think a natural way to model the problem is instead of focusing on means, is to use quantile regression. It allows you to generalize the entire distribution (look at the left and right tails) and still control for individual level circumstances. Additionally, often emergency agencies set goals along the lines of “I want to respond to 90% of emergency events with X minutes”. Quantile regression is a great tool to describe that 90% make. Here I am going to show an example using the New Orleans calls for service data and python.

First, we can download the data right inside of python without saving it directly to disk. I am going to be showing off estimating quantile regression with the statsmodel library. I do the analysis for 19 through 22, but NOLA has calls for service going back to the early 2010s if folks are interested.

import pandas as pd
import statsmodels.formula.api as smf

# Download data, combo 19/20/21/22
y19 = 'https://data.nola.gov/api/views/qf6q-pp4b/rows.csv?accessType=DOWNLOAD'
y20 = 'https://data.nola.gov/api/views/hp7u-i9hf/rows.csv?accessType=DOWNLOAD'
y21 = 'https://data.nola.gov/api/views/3pha-hum9/rows.csv?accessType=DOWNLOAD'
y22 = 'https://data.nola.gov/api/views/nci8-thrr/rows.csv?accessType=DOWNLOAD'
yr_url = [y19,y20,y21,y22]
res_pd = [pd.read_csv(url) for url in yr_url]
data = pd.concat(res_pd,axis=0) # alittle over 1.7 million

Now we do some data munging. Here I eliminate self initiated events, as well as those with missing data. There then are just a handful of cases that have 0 minute arrivals, which to be consistent with Jeff’s post I also eliminate. I create a variable, minutes, that is the minutes between the time created and the time arrived on scene (not cleared).

# Prepping data
data = data[data['SelfInitiated'] == 'N'].copy() # no self init
data = data[~data['TimeArrive'].isna()].copy()   # some missing arrive
data['begin'] = pd.to_datetime(data['TimeCreate'])
data['end'] = pd.to_datetime(data['TimeArrive'])
dif = data['end'] - data['begin']
data['minutes'] = dif.dt.seconds/60
data = data[data['minutes'] > 0].copy() # just a few left over 0s

# Lets look at the distribution
data['minutes'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])

For quantiles, for the entire sample the median time is around 20 minutes, the 10th percentile is under 3 minutes and the 90th percentile is around 5 hours. Using the mean here (which in Jeff’s post varies from 50 to 146 minutes over the same 4 year period), can be somewhat misleading.

An important component of response times is differentiating between different priority calls. NOLA in their data, higher numbers are higher priority. Zero priority are things NOLA says don’t necessarily need an officer at all. So it could be those “0 priority” calls are really just dragging the overall average down over time, although they may have little to do with clearance rates or public safety overall. The priority category fields also has sub-categories, e.g. 1A is higher priority than 1B. To keep the post simple I just breakdown by integer leading values, not the sub letter-categories.

# Priority just do 1/2/3
# 3 is highest priority
data['PriorCat'] = data['Priority'].str[0]
# Only 5 cases of 3s, will eliminate these as well
data.groupby('PriorCat')['minutes'].describe()

Here you can really see the right skewness – priority 2 calls the mean is 25 minutes, but the median is under 10 minutes for the entire sample. A benefit of quantile regression I will use in a bit, the few outlying cases (beyond the quantiles of interest), really don’t impact the analysis. So those cases that take almost 24 hours (I imagine they are just auto-filled in like that in the data), really don’t impact estimates of smaller quantiles. But they can have noticeable influence on mean estimates.

Some final data munging, to further simplify I drop the 16 cases of priority 3s and 4s, and add in a few more categorical covariates for hour of the day, and look at months over time as categorical. (These decisions are more so to make the results easier to parse in a blog post in simpler tables, it would take more work to model a non-linear continuous over time variable, say via a spline, and make a reasonable ordinal encoding for the sub-priority categories.)

# only worry about 0/1/2s
data = data[data['PriorCat'].isin(['0','1','2'])].copy()
# Total in the end almost 600k cases

# Some factor date variables
def dummy_stats(vdate,begin_date):
    bd = pd.to_datetime(begin_date)
    year = vdate.dt.year
    month = vdate.dt.month
    week_day = vdate.dt.dayofweek
    hour = vdate.dt.hour
    diff_days = (vdate - bd).dt.days
    # if binary, turn week/month into dummy variables
    return diff_days, week_day, hour, month, year

dn, wd, hr, mo, yr = dummy_stats(data['begin'],'1/1/2022')
data['Hour'] = hr
data['Month'] = mo
data['Year'] = yr

# Lets just look at months over time
data['MoYr'] = data['Year'] + data['Month']/100

Now finally onto the modeling stuff. For those familiar with regression, quantile regression instead of predicting the mean predicts a quantile of the distribution. Here I show predicting the 50th quantile (the median). For those not familiar with regression, this is not all that different than doing a pivot table/group by, but aggregating by quantiles instead of means. Regression is somewhat different than the simpler pivot table, since you “condition on” other continuous factors (here I “control for” hour of day), but in broad strokes is similar.

Here I use a patsy “R style” formula, and fit a categorical covariate for the 0/1/2 categories, hour of day, and the time varying months over time (to see the general trends). The subsequent regression table is big, so will show in parts:

# Quantile regression for median
mod = smf.quantreg("minutes ~ C(PriorCat, Treatment(reference='2')) + C(Hour) + C(MoYr)", data)
res50 = mod.fit(q=0.5)
res50.summary()

First, I use 2 priority events as the referent category, so you can see (in predicting the median of the distribution), priority 1 events have a median 24 minutes longer than priority 2, and priority 0 have a median two hours later. You can see some interesting patterns in the hour of the day effects (which are for the overall effects, not broken down by priority). So there are likely shift changes at 06:00, 14:00, and 22:00 that result in longer wait times.

But of most interest are patterns over time, here is the latter half of that table, showing median estimates over the months in this sample.

You could of course make the model more complicated, e.g. look at spatial effects or incorporate other direct measures of capacity/people on duty. But here it is complicated enough for an illustrative blog post. January-2019 is the referent category month, and we can see some slight decreases in a few minutes around the start of the pandemic, but have been clearly been increasing at the median time fairly noticeably starting later in 2021.

As opposed to interpreting regression coefficients, I think it is easier to see model predictions. We can just make sample data points, here at noon over the different months, and do predictions over each different priority category:

# Predictions for different categories
hour = 12
prior_cat = [0,1,2]
oos = data.groupby(['PriorCat','MoYr'],as_index=False)['Hour'].size()
oos['Hour'] = 12
oos['Q50'] = res50.predict(oos)

print(oos[oos['PriorCat'] == '0'])
print(oos[oos['PriorCat'] == '1'])
print(oos[oos['PriorCat'] == '2'])

So here for priority 0, 130 has creeped up to 143.

And for priority 1, median times 35 to 49.

Note that the way I estimated the regression equation, the increase/decrease per month is forced to be the same across the different priority calls. So, the increase among priority 2 calls is again around 13 minutes according to the model.

But this assumption is clearly wrong. Remember my earlier “fast” and “slow” example, with only the slow calls increasing. That would suggest the distributions for the priority calls will likely have different changes over time. E.g. priority 0 may increase by alot, but priority 2 will be almost the same. You could model this in the formula via an interaction effect, e.g. something like "minutes ~ C(PriorCat)*C(MoYr) + C(Hour)", but to make the computer spit out a solution a bit faster, I will subset the data to just priority 2 calls.

Here the power of quantile regression is we can look at different distributions. Estimating extreme quantiles is tough, but looking at the 10th/90th (as well as the median) is pretty typical. I do those three quantiles, and generate model predictions over the months (again assuming a call at 12).

# To save time, I am only going to analyze
# Priority 2 calls now
p2 = data[data['PriorCat'] == '2'].copy()
m2 = smf.quantreg("minutes ~ C(MoYr) + C(Hour)", p2)
oos2 = oos[oos['PriorCat'] == '2'].copy()

# loop over different quantiles
qlist = [0.1, 0.5, 0.9]
for q in qlist:
    res = m2.fit(q=q)
    oos2[f'Q_{q}'] = res.predict(oos2)

oos2

So you can see my story about fast and slow calls plays out, although even when restricted to purportedly high risk calls. When looking at just priority 2 calls in New Orleans, the 10th percentile stays very similar over the period, although does have a slight increase from under 4 to almost 5 minutes. The 50th percentile has slightly more growth, but is from 10 minutes to 13 minutes. The 90th percentile though has more volatility – grew from 30 to 60 in small increases in 2022, and late 2022 has fairly dramatic further growth to 70/90 minutes. And you can see how the prior model that did not break out priority 0/1 calls changed this estimate for the left tail for the priority 2 left tail as well. (So those groups likely also had large shifts across the entire set.)

So my earlier scenario is overly simplistic, we can see some increase in the left tails of the distribution as well in this analysis. But, the majority of the increase is due to changes in the long right tail – calls that used to take less than 30 minutes are now taking 90 minutes to arrive. Which still likely has implications for satisfaction with police and reporting behavior, maybe not so much though with clearance or direct public safety.

No easy answers here in terms of giving internet advice to New Orleans. If working with NOLA, I would like to get estimates of officer capacity per shift, so I could incorporate into the quantile regression model that factor directly. That would allow you to precisely quantify how officer capacity impacts the distribution of response times. So not just “response times are going up” but “the decrease in capacity from A to B resulted in X increase in the 90th percentile of response times”. So if NOLA had goals set they could precisely state where officer capacity needed to be to have a shot of obtaining that goal.

Where are they now? Job outcomes for recent SUNY crim Phds

The other day I noticed one of my PhD cohort mates, like me, took a private sector data science job. So of the 6 that finished their Phds in my cohort, 2 of us are now in private sector and the rest are professors. I was curious the overall rate for a larger sample.

There is probably some better official source, but I was able to do a search in Proquest dissertations (SUNY we needed to submit it there), for "State University of New York at Albany" AND "School of Criminal Justice" published between 2010 through 2020 and it scooped up a pretty good sample (with a few false positives I eliminated). I then added in a few people I noticed missing in that set, in the end 69 total over the 11 years (6 defenses per year actually seemed high to me). (Using the WayBack machine you can look at old Phd profiles or the old list of dissertations, but I am not sure of the completeness of either.) Then I filled in their current main job best I could into professor, private sector, university research center, think tank, government (and a few I did not even hazard a guess), based on LinkedIn/google searches/personal knowledge.

Here is the spreadsheet, let me know if you think I miscategorized you or your dissertation is missing altogether. Filtering based on the year of the dissertation is not the same as cohort (you could have started along time ago and more recently defended), but looks to me a pretty reasonable sample of “recent” Phd’s from SUNY Albany Criminal Justice program. Also missing at this Proquest search phase is likely to be missing at random (the few who were not scooped up in my search I see no reason to think are systematic based on Proquest’s idiosyncratic search). But missing in terms of me being able to look once you are in the sample is not (since if you are a professor you probably come up in a general google search for your university).

I tended to be liberal for who I listed as professor (this includes temp teaching jobs and postdocs, but not people who are adjuncts). Many people not in the professor list though were formerly professors (myself included), but tried to figure out the current main job for individuals.

The breakdown for the 69 dissertations is then:

Prof          34  49%
Gov           18  26%
Private        6   9%
Univ Research  3   4%
Think Tank     1   1%
Don't Know     7  10%

So private sector is lower overall than in my cohort, only 10% over the time period (and highest possible sample estimate is 19%, if all 7 don’t know are actually in private sector). Government jobs being at 26% I don’t find surprising, think tank and private is lower than I would have guessed though.

But from this I take away around 50% of recent PhDs in criminal justice from SUNY go on to be professors. For prospective PhDs, this estimate is also conditional on completing the PhD (they aren’t in the sample if they did not finish). If you include those individuals Gov/Private would go up in overall proportions.

Again if missing in the list or miscategorized let me know and I will update the post.

Using docker to play with postgres

No major end of year updates. Life is boring (in a good way). My data science gig at Gainwell is going well. Still on occasion do scholarly type things (see my series on the American Society of Evidence Based Policing). Blog is still going strong, topping over 130k views this year.

As always, feel free to send me question, will just continue to post random tips over time related to things I am working on.


Post today is about using docker to run a personal postgres database. In the past I have attempted to install postgres on my personal windows machine, and this caused issues with other tool sets (in particular GIS tools that I believe rely on PostGIS somewhere under the hood). So a way around that is to install postgres on its entirely own isolated part of your machine. Using docker you don’t have to worry about messing things up – you can always just destroy the image you build and start fresh.

For background to follow along, you need to 1) install docker desktop on your machine, 2) have a python installation with in addition to the typical scientific stack sqlalchemy and psycopg2 (sqlalchemy may by default be installed on Anaconda distributions, I don’t think pyscopg2 is though).

For those not familiar, docker is a technology that lets you create virtual machines with certain specifications, e.g. something like “build an Ubuntu image with python and postgres installed”. Then you can do more complicated things, in data science we often are either creating an application server, or running batch jobs in such isolated environments. Here we will persist an image with postgres to test it out. (Both understanding docker and learning to work with databases and SQL are skills I expect more advanced data scientists to know.)

So first, again after you have docker installed, you can run something like:

docker run --name test_pg -p 5433:5432 -e POSTGRES_PASSWORD=mypass -d postgres

To create a default docker image that has postgres installed. This just pulls the base image postgres. Now you can get inside of that image, and add a schema/tables if you want:

docker exec -it test_pg bash
psql -U postgres
CREATE SCHEMA ts;
CREATE TABLE ts.test_tab (id serial primary key, val int, task text);
INSERT INTO ts.test_tab (val,task) values (1,'abc'), (2, 'def');
SELECT * FROM ts.test_tab;
\q

And you can do more complicated things, like install python and the postgres python extension.

# while still inside the postgres docker image
apt-get update
apt install python3 pip
apt-get install postgresql-plpython3-15
psql -U postgres
CREATE EXTENSION plpython3u;
\q
# head back out of image entirely
exit

(You could create a dockerfile to do all of this as well, but just showing step by step interactive here as a start. I recommend that link as a getting feet wet with docker as well.)

Now, back on your local machine, we can also interact with that same postgres database. Using python code:

import pandas as pd
import sqlalchemy
import psycopg2
import pickle

uid = 'postgres' # in real life, read these in as secrets
pwd = 'mypass'   # or via config files
ip = 'localhost:5433' # port is set on docker command

# depending on sqlalchemy version, it may be
# 'postgres' insteal of 'postgresql'
conn_str = f'postgresql://{uid}:{pwd}@{ip}/postgres'
eng = sqlalchemy.create_engine(conn_str)

res = pd.read_sql('SELECT * FROM ts.test_tab',eng)
print(res)

And you can save a table into this same database:

res['new'] = [True,False]
res.to_sql('new_tab',schema='ts',con=eng,index=False,if_exists='replace')
pd.read_sql('SELECT * FROM ts.new_tab',eng)

And like I said, I like to have a version I can test things with. One example, I have tested a deployment pattern that passes around binary blobs for model artifacts.

# Can save a python object as binary blob
eng.execute('''create table ts.mod (note VARCHAR(50) UNIQUE, mod bytea);''')

def save_mod(model, note, eng=eng):
    bm = pickle.dumps(model)
    md = psycopg2.Binary(bm)
    in_sql = f'''INSERT INTO ts.mod (note, mod) VALUES ('{note}',{md});'''
    info = eng.url # for some reason sqlalchemy doesn't work for this
    # but using psycopg2 directly does
    pcn = psycopg2.connect(user=info.username,password=info.password,
                          host=info.host,port=info.port,dbname=info.database)
    cur = pcn.cursor()
    cur.execute(in_sql)
    pcn.commit()
    pcn.close()

def load_mod(note, eng=eng):
    sel_sql = f'''SELECT mod FROM ts.mod WHERE note = '{note}';'''
    res = pd.read_sql(sel_sql,eng)
    mod = pickle.loads(res['mod'][0])
    return mod

This does not work out of the box for objects that have more complicated underlying code, but pure python stuff in sklearn it seems to work OK:

# Create random forest
import numpy as np
from sklearn.ensemble import RandomForestRegressor

train = np.array([[1,2],[3,4],[1,3]])
y = np.array([1,2,3])
mod = RandomForestRegressor(n_estimators=5,max_depth=2,random_state=0)
mod.fit(train,y)

save_mod(mod,'ModelRF1')
m2 = load_mod('ModelRF1')

# Showing the outputs are the same
mod.predict(train)
m2.predict(train)

You may say to yourself this is a crazy deployment pattern. And I would say I do what I need to do given the constraints I have sometimes! (I know more startups with big valuations that do SQL insertion to deploy models than ones with more sane deployment strategies. So I am not the only one.)

But this is really just playing around like I said. But you can try out more advanced stuff as well, such as using python inside postgres functions, or something like this postgres extension for ML models.

Startup scripts on windows

Just a very quick post today. For folks working on Unix machines, there is a bash profile script that runs when you first log into your user, and can set various environment variables, symlinks, or just run scripts in general. On windows, you can do something similar by placing program.bat files in the shell startup directory. (That linked post by Microsoft has the directions to find where that is located on your machine. Hit Windows Logo + R and type shell:startup into the prompt. It ends up being C:\Users\andre\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup on my personal machine.)

One I have is to map regular folders I use to particular drive letters (similar in outcomes to symlinks on Unix machines) using the subst command. So in that startup directory I have a file, ghub_map.bat, that contains this single line:

subst g: "D:\Dropbox\Dropbox\PublicCode_Git"

And so when I log into my machine, if I want to go and do work on github I can just do type G: in the windows terminal (to switch drives you don’t use cd in windows). This is in place of the more onerous, D: (default terminal starts in C:), and then cd ./Dropbox/Dropbox/PublicCode_Git.

Happy holidays everyone!

Handling errors in python and R

Just some quick notes on error handling in python and R. For most of my batch processes at work (in python), error handling is necessary. Most of my code has logic that if it fails, it sends an email message about the failure. This is only possible if you capture errors and have conditional logic, if the code just fails without capturing the error (for both R and python) it will just exit the script.

I had another example recently in R that used placebo tests that could fail to converge. So a more traditional stat project, but it should just keep running the loop to collect results, not fail entirely.

Python Example

In python, for most of my batch jobs I have code that looks like this:

import traceback

try:
    # whatever function you are running
    send_success_email()
except Exception:
    er = traceback.format_exc()
    print('Error message is \n\n{er}')
    send_error_email(er)

So it fails gracefully, and just gives me a message either in REPL for interactive debugging or stdout for regularly run batch jobs. I can see for people running servers why they want more specific error handling and using a more formal logger, but IMO that is overkill for running batch jobs.

R Example

R to do error handling looks something like this:

# trycatch for errors
my_func <- function(){
    # try/catch if error
    out <- tryCatch(
       { # part with the potential error
         #r <- ???? #whatever code steps you want
        r
       }, error=function(cond){ 
          print("Function Failed, Error message is \n\n")
          print(Cond)
          return(-1)
          } )
    return(out)
}

So if you have inside the tryCatch something that is “fit complicated model” inside your simulations (that could fail), this will still fail gracefully (and can return the error message if you need to.

Counting lines of code

Was asked recently about how many lines of python code was in my most recent project. A simple command line check, cd into your project directory and run:

find -type f -name "*.py" | xargs wc -l

(If on windows, you can download the GOW tools to be able to use these same tools by default available on unix/mac.) This will include whitespace and non-functional lines (like docstrings), but that I think is ok. Doing this for my current main project at Gainwell, I have about 30k lines of python code. Myself (and now about 4 other people) have been working on that code base for nearly a year.

For my first production project at (then) HMS, the total lines of python code are 20k, and I developed the bulk of that in around 7 months of work. Assuming 20 work days in a month, that results in around 20000/140 ~ 143 lines of code per workday. I did other projects during that time span, but this was definitely my main focus (and I was the sole developer/data scientist). I think that is high (more code is not necessarily better, overall code might have decreased as future development of this project happened over time), but is ballpark reasonable expectations for working data scientists (I would have guessed closer to around 100 per day). In the grand scheme of things, this is like 2 functions or unit tests per work day (when considering white space and docstrings).

Doing this for all of my python code on my personal machine is around 60k (I do around, as I am removing counts for projects that are just cloned). And for all the python code on my work machine is around 140k (for 3 years of work). (I am only giving fuzzy numbers, I have some projects joint work I am half counting, and some cloned code I am not counting at all.)

Doing this same exercise for R code, I only get around 40k lines of code on my personal machine. For instance, my ptools package has under 3k lines of "*.R" code total. I am guessing this is due to not only R code being more precise than python, but to take code into production takes more work. Maybe worth another blog post, but the gist of the difference between an academic project is that you need the code to run one time, whereas for a production project the code needs to keep running on a regular schedule indefinitely.

I have written much more SPSS code over my career than R code, but I have most of it archived on Dropbox, so cannot easily get a count of the lines. I have a total of 1846 sps files (note that this does not use xargs).

find -type f -name "*.sps" | wc -l

It is possible that the average sps file on my machine is 200 lines per file (it definitely is over 100 lines). So my recent python migration I don’t think has eclipsed my cumulative SPSS work going back over a decade (but maybe in two more years will).

Random notes, digital art, and pairwise comparisons is polynomial

So not too much in the hopper for the blog at the moment. Have just a bunch of half-baked ideas (random python tips, maybe some crime analysis using osmnx, scraping javascript apps using selenium, normal nerd data science stuff).

Still continuing my blog series on the American Society of Evidence Based Policing, and will have a new post out next week on officer use of force.

If you have any suggestions for topics always feel free to ask me anything!


Working on some random digital art (somewhat focused on maps but not entirely). For other random suggestions I like OptArt and Rick Wicklin’s posts.

Dall-E is impressive, and since it has an explicit goal of creating artwork I think it is a neat idea. Chat bots I have nothing good to say. Computer scientists working on them seem to be under the impression that if you build a large/good enough language model out pops general intelligence. Wee bit skeptical of that.


At work a co-worker was working on timing applications for a particular graph-database/edge-detection project. Initial timings on fake data were not looking so good. Here we have number of nodes and timings for the application:

  Nodes    Minutes
   1000       0.16
  10000       0.25
 100000       1.5
1000000      51

Offhand people often speak about exponential functions (or growth), but here what I expect is we are really looking at is pairwise comparisons (not totally familiar with the tech the other data scientist is using, so I am guessing the algorithmic complexity). So this likely scales something like (where n is the number of nodes in the graph):

Time = Fixed + C1*(n) + C2*(n choose 2) + e

Fixed is just a small constant, C1 is setting up the initial node database, and C2 is the edge detection which I am guessing uses pairwise comparisons, (n choose 2). We can rewrite this to show that the binomial coefficient is really polynomial time (not exponential) in terms of just the number of nodes.

C2*[n choose 2] = C2*[{n*(n-1)}/2]
                  C2*[ (n^2 - n)/2 ]
                  C2/2*[n^2 - n]
                  C2/2*n^2 - C2/2*n

And so we can rewrite our original equation in terms of simply n:

Time = Fixed + (C1 - C2/2)*N + C2/2*N^2

Doing some simple R code, we can estimate our equation:

n <- 10^(3:6)
m <- c(0.16,0.25,1.5,51)
poly_mod <- lm(m ~ n + I(n^2))

Since this fits 3 parameters with only 4 observations, the fit is (not surprisingly) quite good. Which to be clear does not mean much, if really cared would do much more sampling (or read the docs more closely about the underlying tech involved):

> pred <- predict(poly_mod)
> cbind(n,m,pred)
      n     m       pred
1 1e+03  0.16  0.1608911
2 1e+04  0.25  0.2490109
3 1e+05  1.50  1.5000989
4 1e+06 51.00 50.9999991

And if you do instead poly_2 <- lm(m ~ n + choose(n,2)) you get a change in scale of the coefficients, but the same predictions.

We really need this to scale in our application at work to maybe over 100 million records, so what would we predict in terms of minutes based on these initial timings?

> nd = data.frame(n=10^(7:8))
> predict(poly_mod,nd)/60 # convert to hours
         1          2
  70.74835 6934.56850

So doing 10 million records will take a few days, and doing 100 million will be close to 300 days.

With only 4 observations not much to chew over (really it is too few to say it should be a different model). I am wondering though how to best handle errors for these types of extrapolations. Errors are probably not homoskedastic for such timing models (error will be larger for larger number of nodes). Maybe better to use quantile regression (and model the median?). I am not sure (and that advice I think will also apply to modeling exponential growth as well).

Preprint: Analysis of LED street light conversions on firearm crimes in Dallas, Texas

I have a new pre-print out, Analysis of LED street light conversions on firearm crimes in Dallas, Texas. This work was conducted in collaboration with the Child Poverty Action Lab, in reference to the Dallas Taskforce report. Instead of installing the new lights though at hotspots that CPAL suggested, Dallas stepped up conversion of street lamps to LED. Here is the temporal number of conversions over time:

And here is an aggregated quadrat map at quarter square mile grid cells (of the total number of LED conversions):

I use a diff-in-diff design (compare firearm crimes in daytime vs nighttime) to test whether the cumulative LED conversions led to reduced firearm crimes at nighttime. Overall I don’t find any compelling evidence that firearm crimes were reduced post LED installs (for a single effect or looking at spatial heterogeneity). This graph shows in the aggregate the DiD parallel trends assumption holds citywide (on the log scale), but the identification strategy really relies on the DiD assumption within each grid cell (any good advice for graphically showing that with noisy low count data for many units I am all ears!).

For now just wanted to share the pre-print. To publish in peer-review I would need to do a bunch more work to get the lit review where most CJ reviewers would want it. Also want to work on spatial covariance adjustments (similar to here, but for GLM models). Have some R code started for that, but needs much more work/testing before ready for primetime. (Although as I say in the pre-print, these should just make standard errors larger, they won’t impact the point estimates.)

So no guarantees that will be done in anytime in the near future. But no reason to not share the pre-print in the meantime.