Web scraping police data using selenium and python

So I have a few posts in the past on scraping data. One shows downloading and parsing structured PDFs, almost all of the rest though use either JSON API backends, or just grab the HTML data directly. These are fairly straightforward to deal with in python. You generate the url directly, use requests, and then just parse the returned HTML however you want.

Came across a situation recently though where I needed to interact with the webpage. I figured a blog post to illustrate the process would be good. (For both myself and others!) So here I will illustrate entering data into San Antonio’s historical calls for service asp application (which I have seen several PDs use in the past).

It is tough for me to give general advice about scraping, it involves digging into the source code for a website. Here if you click on the Historical Calls button, the url stays the same, but presents you with a new form page to insert your search parameters:

This is a bit of a red-herring though, it ends up being the entire page is embedded in what is called an i-frame, so the host URL stays the same, but the window inside the webpage changes. On the prior opening page, if you hover over the link for Historical Calls you can see it points to https://webapp3.sanantonio.gov/policecalls/Reports.aspx, so that is page we really need to pay attention to.

So for general advice, using Chrome to view a web-pages source html, you can right-click and select view-source:

And you can also go into the Developer tools to check out all the items in a page as well.

Typically before worrying about selenium, I study the network tab in here. You want to pay attention to the items that take the longest/have the most data. Typically I am looking for JSON or text files here if I can’t scrape the data directly from the HTML. (Example blog posts grabbing an entire dump of data here, and another finding a hidden/undocumented JSON api using this approach.) Here is an example network call when inputting the search into the San Antonio web-app.

The data is all being transmitted inside of aspx application, not via JSON or other plain text files (don’t take my terminology here as authoritative, I really know near 0% about servers). So we will need to use selenium here. Using python you can install the selenium library, but you also need to download a driver (here I use chrome), and then wherever you save that exe file, add that location to your PATH environment variable.

Now you are ready for the python part.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
import pandas as pd

# Setting Chrome Options
chrome_options = Options()
#chrome_options.add_argument("-- headless")

# Getting the base page
driver = webdriver.Chrome(options=chrome_options)
base_url = "https://webapp3.sanantonio.gov/policecalls/Reports.aspx"
driver = webdriver.Chrome(options=chrome_options)

Once you run this code, you will see a new browser pop-up. This is great for debugging, but once you get your script finalized, you can see I commented out a line to run in headerless (so it doesn’t bug you by flashing up the browser on your screen).

Now typically what I do is look at the HTML source (like I showed earlier), and then search for the input buttons in HTML. We are trying to figure out the elements we need to insert the data for us to submit a search. Here is the first input for an item we care about, the begin date of the search.

Now we can insert our own date by grabbing the element from the web-page. I grab it here by the “id” attribute in the HTML (many tutorials use xpath, which I am not as familiar with, but at least for these aspx apps what I show works fine). For dates that have a validation stage, you need to not only .send_keys, but to also submit to get past the date validation.

# Inserting date field for begin date
from_date = driver.find_element("id", "txtStart")

Once you run that code you can actually view the web-page, and see that your date is entered! Now we need to do the same thing for the end date. Then we can put in a plain text zipcode. Since this does not have validation, we do not need to submit it.

# Now for end date
end_date = driver.find_element("id", "txtEndDate")

# Now inserting text for zipcode
zip = driver.find_element("id", "txtZipcode")
# Sometimes need to clear, zip.clear()

I have a note there on clearing a text box. Sometimes websites have pre-filled options. Sometimes web-sites also do not like .clear(), and you can simulate backspace keystrokes directly. This website does not like it if you clear a date-field for example.

Now the last part, I am going to select a drop-down. If you go into the HTML source again, you can see the list of options.

And now we can use the Select function I imported at the beginning to select a particular element of that drop-down. Here I select the crimes against persons.

# Now selecting dropdown
crime_cat = driver.find_element("id", "ddlCategory")
crime_sel = Select(crime_cat)
crime_sel.select_by_visible_text("Crimes Against Person Calls")

Many of these applications have rate limits, so you need to limit the search to tiny windows and subsets, and then loop over the different sets you want to grab all of the data. (Being nice and using time.sleep() between calls to get the results.

Now we are ready to submit the query. The same way you can enter in text into input forms, buttons you can click are also labeled as inputs in the HTML. Here I find the submit button, and then .click() that button. (If there is a direct button to download CSV or some other format, it may make sense to click that button.)

# Now can find the View Data button and submit
view_data = driver.find_element("id", "btnSearch")

Now that we have our web-page, we can get the HTML source directly and then parse that. Pandas has a nice method to grab tables, and this application is actually very nicely formatted. (I tend to not use this, as many webpages have some very bespoke tables that are hard to grab directly like this). This method grabs all the tables in the web-page by default, here I just want the calls for service table, which has an id of "gvCFS", which I can pass into the pandas .read_html function.

# Pandas has a nice option to read tables directly
html = driver.page_source
cfs = pd.read_html(html, attrs={"id":"gvCFS"})[0]

And that shows grabbing a single result. Of course to scrape, you will need to loop over many days (and here different search selections), depending on what data you want to grab. Most of these applications have search limits, so if you do too large a search, will only return the first say 500 results. And San Antonio’s is nice because it returns as a single table in the web-page, most you need to page the results though as well. Which takes further scraping the data and interacting with the page. So it is more painful whenever you need to resort to selenium.

Sometimes pages will point to PDF files, and you can set Chrome’s options to download to a particular location in that scenario (and then use os.rename to name the PDF whatever you want after it is downloaded). You can basically do anything in selenium you can manually, it is often just a tricky set of steps to replicate in code.

Using quantile regression to evaluate police response times

Jeff Asher recently had a post on analysis of response times across many agencies. One nitpick though (and ditto for prior analyses I have seen, such as Scott Mourtgos and company), is that you should not use linear models (or means in general) to describe response time distributions. They are very heavily right skewed, and so the mean tends to be not representative of the bulk of the data.

When evaluating changes in response time, imagine two simplistic scenarios. One, every single call increases by 5 minutes, so what used to be 5 is now 10, 20 is now 25, 60 is now 65, etc. That is probably not realistic for response times, it is probably calls in the tail (ones that take a very long time to wait for an opening in the queue) are what changes. E.g. 5 is still 5, 20 is still 20, but 60 is now 120. In the latter scenario, the left tail of the distribution does not change, only the right tail. In both scenarios the mean shifts.

I think a natural way to model the problem is instead of focusing on means, is to use quantile regression. It allows you to generalize the entire distribution (look at the left and right tails) and still control for individual level circumstances. Additionally, often emergency agencies set goals along the lines of “I want to respond to 90% of emergency events with X minutes”. Quantile regression is a great tool to describe that 90% make. Here I am going to show an example using the New Orleans calls for service data and python.

First, we can download the data right inside of python without saving it directly to disk. I am going to be showing off estimating quantile regression with the statsmodel library. I do the analysis for 19 through 22, but NOLA has calls for service going back to the early 2010s if folks are interested.

import pandas as pd
import statsmodels.formula.api as smf

# Download data, combo 19/20/21/22
y19 = 'https://data.nola.gov/api/views/qf6q-pp4b/rows.csv?accessType=DOWNLOAD'
y20 = 'https://data.nola.gov/api/views/hp7u-i9hf/rows.csv?accessType=DOWNLOAD'
y21 = 'https://data.nola.gov/api/views/3pha-hum9/rows.csv?accessType=DOWNLOAD'
y22 = 'https://data.nola.gov/api/views/nci8-thrr/rows.csv?accessType=DOWNLOAD'
yr_url = [y19,y20,y21,y22]
res_pd = [pd.read_csv(url) for url in yr_url]
data = pd.concat(res_pd,axis=0) # alittle over 1.7 million

Now we do some data munging. Here I eliminate self initiated events, as well as those with missing data. There then are just a handful of cases that have 0 minute arrivals, which to be consistent with Jeff’s post I also eliminate. I create a variable, minutes, that is the minutes between the time created and the time arrived on scene (not cleared).

# Prepping data
data = data[data['SelfInitiated'] == 'N'].copy() # no self init
data = data[~data['TimeArrive'].isna()].copy()   # some missing arrive
data['begin'] = pd.to_datetime(data['TimeCreate'])
data['end'] = pd.to_datetime(data['TimeArrive'])
dif = data['end'] - data['begin']
data['minutes'] = dif.dt.seconds/60
data = data[data['minutes'] > 0].copy() # just a few left over 0s

# Lets look at the distribution

For quantiles, for the entire sample the median time is around 20 minutes, the 10th percentile is under 3 minutes and the 90th percentile is around 5 hours. Using the mean here (which in Jeff’s post varies from 50 to 146 minutes over the same 4 year period), can be somewhat misleading.

An important component of response times is differentiating between different priority calls. NOLA in their data, higher numbers are higher priority. Zero priority are things NOLA says don’t necessarily need an officer at all. So it could be those “0 priority” calls are really just dragging the overall average down over time, although they may have little to do with clearance rates or public safety overall. The priority category fields also has sub-categories, e.g. 1A is higher priority than 1B. To keep the post simple I just breakdown by integer leading values, not the sub letter-categories.

# Priority just do 1/2/3
# 3 is highest priority
data['PriorCat'] = data['Priority'].str[0]
# Only 5 cases of 3s, will eliminate these as well

Here you can really see the right skewness – priority 2 calls the mean is 25 minutes, but the median is under 10 minutes for the entire sample. A benefit of quantile regression I will use in a bit, the few outlying cases (beyond the quantiles of interest), really don’t impact the analysis. So those cases that take almost 24 hours (I imagine they are just auto-filled in like that in the data), really don’t impact estimates of smaller quantiles. But they can have noticeable influence on mean estimates.

Some final data munging, to further simplify I drop the 16 cases of priority 3s and 4s, and add in a few more categorical covariates for hour of the day, and look at months over time as categorical. (These decisions are more so to make the results easier to parse in a blog post in simpler tables, it would take more work to model a non-linear continuous over time variable, say via a spline, and make a reasonable ordinal encoding for the sub-priority categories.)

# only worry about 0/1/2s
data = data[data['PriorCat'].isin(['0','1','2'])].copy()
# Total in the end almost 600k cases

# Some factor date variables
def dummy_stats(vdate,begin_date):
    bd = pd.to_datetime(begin_date)
    year = vdate.dt.year
    month = vdate.dt.month
    week_day = vdate.dt.dayofweek
    hour = vdate.dt.hour
    diff_days = (vdate - bd).dt.days
    # if binary, turn week/month into dummy variables
    return diff_days, week_day, hour, month, year

dn, wd, hr, mo, yr = dummy_stats(data['begin'],'1/1/2022')
data['Hour'] = hr
data['Month'] = mo
data['Year'] = yr

# Lets just look at months over time
data['MoYr'] = data['Year'] + data['Month']/100

Now finally onto the modeling stuff. For those familiar with regression, quantile regression instead of predicting the mean predicts a quantile of the distribution. Here I show predicting the 50th quantile (the median). For those not familiar with regression, this is not all that different than doing a pivot table/group by, but aggregating by quantiles instead of means. Regression is somewhat different than the simpler pivot table, since you “condition on” other continuous factors (here I “control for” hour of day), but in broad strokes is similar.

Here I use a patsy “R style” formula, and fit a categorical covariate for the 0/1/2 categories, hour of day, and the time varying months over time (to see the general trends). The subsequent regression table is big, so will show in parts:

# Quantile regression for median
mod = smf.quantreg("minutes ~ C(PriorCat, Treatment(reference='2')) + C(Hour) + C(MoYr)", data)
res50 = mod.fit(q=0.5)

First, I use 2 priority events as the referent category, so you can see (in predicting the median of the distribution), priority 1 events have a median 24 minutes longer than priority 2, and priority 0 have a median two hours later. You can see some interesting patterns in the hour of the day effects (which are for the overall effects, not broken down by priority). So there are likely shift changes at 06:00, 14:00, and 22:00 that result in longer wait times.

But of most interest are patterns over time, here is the latter half of that table, showing median estimates over the months in this sample.

You could of course make the model more complicated, e.g. look at spatial effects or incorporate other direct measures of capacity/people on duty. But here it is complicated enough for an illustrative blog post. January-2019 is the referent category month, and we can see some slight decreases in a few minutes around the start of the pandemic, but have been clearly been increasing at the median time fairly noticeably starting later in 2021.

As opposed to interpreting regression coefficients, I think it is easier to see model predictions. We can just make sample data points, here at noon over the different months, and do predictions over each different priority category:

# Predictions for different categories
hour = 12
prior_cat = [0,1,2]
oos = data.groupby(['PriorCat','MoYr'],as_index=False)['Hour'].size()
oos['Hour'] = 12
oos['Q50'] = res50.predict(oos)

print(oos[oos['PriorCat'] == '0'])
print(oos[oos['PriorCat'] == '1'])
print(oos[oos['PriorCat'] == '2'])

So here for priority 0, 130 has creeped up to 143.

And for priority 1, median times 35 to 49.

Note that the way I estimated the regression equation, the increase/decrease per month is forced to be the same across the different priority calls. So, the increase among priority 2 calls is again around 13 minutes according to the model.

But this assumption is clearly wrong. Remember my earlier “fast” and “slow” example, with only the slow calls increasing. That would suggest the distributions for the priority calls will likely have different changes over time. E.g. priority 0 may increase by alot, but priority 2 will be almost the same. You could model this in the formula via an interaction effect, e.g. something like "minutes ~ C(PriorCat)*C(MoYr) + C(Hour)", but to make the computer spit out a solution a bit faster, I will subset the data to just priority 2 calls.

Here the power of quantile regression is we can look at different distributions. Estimating extreme quantiles is tough, but looking at the 10th/90th (as well as the median) is pretty typical. I do those three quantiles, and generate model predictions over the months (again assuming a call at 12).

# To save time, I am only going to analyze
# Priority 2 calls now
p2 = data[data['PriorCat'] == '2'].copy()
m2 = smf.quantreg("minutes ~ C(MoYr) + C(Hour)", p2)
oos2 = oos[oos['PriorCat'] == '2'].copy()

# loop over different quantiles
qlist = [0.1, 0.5, 0.9]
for q in qlist:
    res = m2.fit(q=q)
    oos2[f'Q_{q}'] = res.predict(oos2)


So you can see my story about fast and slow calls plays out, although even when restricted to purportedly high risk calls. When looking at just priority 2 calls in New Orleans, the 10th percentile stays very similar over the period, although does have a slight increase from under 4 to almost 5 minutes. The 50th percentile has slightly more growth, but is from 10 minutes to 13 minutes. The 90th percentile though has more volatility – grew from 30 to 60 in small increases in 2022, and late 2022 has fairly dramatic further growth to 70/90 minutes. And you can see how the prior model that did not break out priority 0/1 calls changed this estimate for the left tail for the priority 2 left tail as well. (So those groups likely also had large shifts across the entire set.)

So my earlier scenario is overly simplistic, we can see some increase in the left tails of the distribution as well in this analysis. But, the majority of the increase is due to changes in the long right tail – calls that used to take less than 30 minutes are now taking 90 minutes to arrive. Which still likely has implications for satisfaction with police and reporting behavior, maybe not so much though with clearance or direct public safety.

No easy answers here in terms of giving internet advice to New Orleans. If working with NOLA, I would like to get estimates of officer capacity per shift, so I could incorporate into the quantile regression model that factor directly. That would allow you to precisely quantify how officer capacity impacts the distribution of response times. So not just “response times are going up” but “the decrease in capacity from A to B resulted in X increase in the 90th percentile of response times”. So if NOLA had goals set they could precisely state where officer capacity needed to be to have a shot of obtaining that goal.

Handling errors in python and R

Just some quick notes on error handling in python and R. For most of my batch processes at work (in python), error handling is necessary. Most of my code has logic that if it fails, it sends an email message about the failure. This is only possible if you capture errors and have conditional logic, if the code just fails without capturing the error (for both R and python) it will just exit the script.

I had another example recently in R that used placebo tests that could fail to converge. So a more traditional stat project, but it should just keep running the loop to collect results, not fail entirely.

Python Example

In python, for most of my batch jobs I have code that looks like this:

import traceback

    # whatever function you are running
except Exception:
    er = traceback.format_exc()
    print('Error message is \n\n{er}')

So it fails gracefully, and just gives me a message either in REPL for interactive debugging or stdout for regularly run batch jobs. I can see for people running servers why they want more specific error handling and using a more formal logger, but IMO that is overkill for running batch jobs.

R Example

R to do error handling looks something like this:

# trycatch for errors
my_func <- function(){
    # try/catch if error
    out <- tryCatch(
       { # part with the potential error
         #r <- ???? #whatever code steps you want
       }, error=function(cond){ 
          print("Function Failed, Error message is \n\n")
          } )

So if you have inside the tryCatch something that is “fit complicated model” inside your simulations (that could fail), this will still fail gracefully (and can return the error message if you need to.

Counting lines of code

Was asked recently about how many lines of python code was in my most recent project. A simple command line check, cd into your project directory and run:

find -type f -name "*.py" | xargs wc -l

(If on windows, you can download the GOW tools to be able to use these same tools by default available on unix/mac.) This will include whitespace and non-functional lines (like docstrings), but that I think is ok. Doing this for my current main project at Gainwell, I have about 30k lines of python code. Myself (and now about 4 other people) have been working on that code base for nearly a year.

For my first production project at (then) HMS, the total lines of python code are 20k, and I developed the bulk of that in around 7 months of work. Assuming 20 work days in a month, that results in around 20000/140 ~ 143 lines of code per workday. I did other projects during that time span, but this was definitely my main focus (and I was the sole developer/data scientist). I think that is high (more code is not necessarily better, overall code might have decreased as future development of this project happened over time), but is ballpark reasonable expectations for working data scientists (I would have guessed closer to around 100 per day). In the grand scheme of things, this is like 2 functions or unit tests per work day (when considering white space and docstrings).

Doing this for all of my python code on my personal machine is around 60k (I do around, as I am removing counts for projects that are just cloned). And for all the python code on my work machine is around 140k (for 3 years of work). (I am only giving fuzzy numbers, I have some projects joint work I am half counting, and some cloned code I am not counting at all.)

Doing this same exercise for R code, I only get around 40k lines of code on my personal machine. For instance, my ptools package has under 3k lines of "*.R" code total. I am guessing this is due to not only R code being more precise than python, but to take code into production takes more work. Maybe worth another blog post, but the gist of the difference between an academic project is that you need the code to run one time, whereas for a production project the code needs to keep running on a regular schedule indefinitely.

I have written much more SPSS code over my career than R code, but I have most of it archived on Dropbox, so cannot easily get a count of the lines. I have a total of 1846 sps files (note that this does not use xargs).

find -type f -name "*.sps" | wc -l

It is possible that the average sps file on my machine is 200 lines per file (it definitely is over 100 lines). So my recent python migration I don’t think has eclipsed my cumulative SPSS work going back over a decade (but maybe in two more years will).

NIJ grants funding gun violence research

Before I get into the nitty gritty of this post, a few notes. First, my next post in the Criminal Justician series on ASEBP is up, Violent Crime Interventions That are Worth it. I discuss more of the costs with implementing hot spots policing and focussed deterrence from the police departments perspective, and why they are clearly worthwhile investments for many police departments facing violence problems.

Second, I want to point folks to Jacob Kaplan’s blog, most recent post The Covid Kings of Salami. Some of Jacob’s thoughts I disagree with (I think smaller papers are OK, or that policing what is big enough is a waste of time). But if you like my posts on CJ topics, you should check out Jacob’s as well.

Now onto the title – a work in progress at the moment, but working with Scott Jacques on the openness of funded US criminology research. A short post in response to the oft mistaken idea that gun violence research is banned in the US. This is confused logic related to the Dickey act saying awards for gun control advocacy are banned as being federally funded by the CDC.

There are other agencies who fund gun violence research, in particular here I have scraped data from the National Institute of Justice (what I think is likely to be the largest funder in this area). Here is some python code showing some analyses of those awards.

So first, here you can download and see the size of the scraped dataset of NIJ awards:

import pandas as pd

# award data scraped, stay tuned for code for that!
award_url = 'https://dl.dropbox.com/s/eon4iokv0qpllgl/NIJ_Awards.csv?dl=0'
award_df = pd.read_csv(award_url)

So as a first blush check for awards related to gun violence, we can just search the text for the award narrative for relevant terms, here I just search for GUN VIOLENCE and FIREARM. A more thorough investigation would either code the 7k awards or the original solicitations, but I think this will likely be largely accurate (probably slightly more false positives than false negatives).

award_df['award_textU'] = award_df['award_text'].str.upper()

# Lets try to find any of these (other text?)
word_list = ['GUN VIOLENCE','FIREARM']

for w in word_list:
    award_df[w] = 1*(award_df['award_textU'].str.find(w) > -1)

award_df['AnyGun'] = 1*(award_df[word_list].sum(axis=1) > 0)

So we can see that we have 1,082 awards related to gun violence (out of 7,215 listed by the NIJ). Lets check out the total funding for these awards:

# Lets figure out the total allocated
award_df['AwardVal'] = award_df['field-award-amount'].str.strip()
award_df['AwardVal'] = award_df['AwardVal'].replace('[\$,]', '', regex=True)
award_df['AwardVal'] = pd.to_numeric(award_df['AwardVal'])
award_df['Tot'] = 1

cf = ['Tot','AwardVal']

So we have in the listed awards (that go back to 1998 but appear more consistently filled in starting in 2002), over 300 million in grant awards related to gun violence/firearm research. Here we can see the breakdown over time.

# See awards over time
gun_awards = award_df[award_df['AnyGun'] == 1].copy()

So the awards gifted by NIJ no doubt have a different flavor/orientation than if you had the same money from CDC. (There are other orgs though, like NSF, who I am sure have funded research projects relevant to gun violence over time as well.) Sometimes people distinguish between “public health” vs “criminal justice” approaches, but this is a pretty superficial dichotomy (plenty of people in public health have gotten NIJ awards).

So you certainly could argue the Dickey amendment changed the nature of gun violence research being conducted. And since the CDC budget is so massive, I suppose you could argue that it reduced the overall amounts of gun violence research being funded (although it is likely 0 sum, more for firearm research would have slashed some other area). You could use the same argument to say NIJ though is underfunded instead of advocating for the CDC to write the checks though.

But the stronger statement I often see stated, that firearm research is entirely banned in the US, is not even close to being correct.

Injury rates at Amazon warehouses

I follow several of the News & Observer (The Raleigh/Durham newspaper) newsletters, and Brian Gordon and Tyler Dukes had a story recently about fainting and ambulance runs at the Amazon warehouse in Raleigh, Open Source: Ambulances at Amazon. He did some great sleuthing, and showed that while the number on its face seemed high (an ambulance call around 1 out of 3 days) the rate of ambulance runs when accounting for the size of the workforce is pretty similar to other warehouses.

Here I will show an example of downloading the OSHA injury data to show a similar finding. Using python it is pretty quick work.

So first can import the libraries we need (the typical scientific stack). I download the OSHA data for 2021, and I calculate injury rates per person work year, so how to interpret these are at the workplace level. Per full time people per year, it is the expected number of injuries across the workforce.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import beta

inj_2021url = "https://www.osha.gov/sites/default/largefiles/ITA-data-cy2021.zip"
inj_dat = pd.read_csv(inj_2021url)
# Calculate injuries per person full year
inj_dat['InjPerYear'] = (inj_dat['total_injuries']/inj_dat['total_hours_worked'])*2080

We can filter out warehouse workers via NAICS code 493110. I also just limit to warehouses in North Carolina. Sorting by the injury rate, Amazon is not even in the top 10 in the state:

warehouses = inj_dat[inj_dat['naics_code'] == 493110].copy()
warehouses_nc = warehouses[warehouses['state'] == 'NC'].reset_index(drop=True)
warehouses_nc['AmazonFlag'] = 1*(warehouses_nc['company_name'].str.find('Amazon.com') >= 0)

# Rate per year of work per person, 2080 

But note that I don’t think Bonded Logistics is a terribly dangerous place. One thing you need to watch out for when evaluating rate data is that places with smaller denominators (here lower total hours worked) tend to be more volatile. So a useful plot is to plot the total hours work (cumulative for the entire warehouse) against the overall rate of injuries per hour worked.

fig, ax = plt.subplots(figsize=(12,6))
ax.scatter(nam_ware['total_hours_worked'], nam_ware['InjPerYear'], 
           c='grey', s=30, edgecolor='k', alpha=0.5, label='Other Warehouses')
ax.scatter(amz_ware['total_hours_worked'], amz_ware['InjPerYear'], 
           c='blue', s=80, edgecolor='k', alpha=0.9, label='Amazon Warehouses')
ax.set_xlabel('Total Warehouse Hours Worked')
ax.set_ylabel('Injury Rate per Person Work Year (2080 hours)')
ax.legend(loc='upper right')
plt.savefig('InjRate.png', dpi=500, bbox_inches='tight')

You can see by this plot the Amazon warehouses have the largest total number of hours worked (by quite a few) relative to many other warehouses in North Carolina. But their overall rate of injuries is right in line with the rest of the crowd. Looking at the overall rate, it is around 0.04 (so you would expect around 1/20 full time workers to have an injury per year at a warehouse according to this data).

tot_rate = warehouses_nc['total_injuries'].sum()/warehouses_nc['total_hours_worked'].sum()

If we do this plot again, but add funnel bound lines to show the typical volatility we would expect with estimating these rates:

# Binomial confidence interval
def binom_int(num,den,confint=0.95):
    quant = (1 - confint)/ 2.
    low = beta.ppf(quant, num, den - num + 1)
    high = beta.ppf(1 - quant, num + 1, den - num)
    return (np.nan_to_num(low), np.where(np.isnan(high), 1, high))

den = np.geomspace(1000,8700000,500)
num = den*tot_rate
low_int, high_int = binom_int(num,den,0.99)
high_int = high_int*2080

fig, ax = plt.subplots(figsize=(12,6))
ax.plot(den,high_int, c='k', linewidth=0.5)
ax.hlines(tot_rate*2080,1000,8700000,colors='k', linewidths=0.5)
ax.scatter(nam_ware['total_hours_worked'], nam_ware['InjPerYear'], 
           c='grey', s=30, edgecolor='k', alpha=0.5, label='Other Warehouses')
ax.scatter(amz_ware['total_hours_worked'], amz_ware['InjPerYear'], 
           c='blue', s=80, edgecolor='k', alpha=0.5, label='Amazon Warehouses')
ax.set_xlabel('Total Warehouse Hours Worked')
ax.set_ylabel('Injury Rate per Person Work Year (2080 hours)')
plt.xscale('log', basex=10)
ax.legend(loc='upper right')
ax.annotate('Straight line is average overall injury rate\nCurved line is Binomial 99% Interval', 
            xy = (0.00, -0.13), xycoords='axes fraction')
plt.savefig('InjRate_wBin.png', dpi=500, bbox_inches='tight')

So you can see even Bonded Logistics is well within the average rate you would expect to still be consistent with the average overall injury rate relative to all the other warehouses in North Carolina.

As a note, I imagine I saw someone using this data recently looking at police departments in a criminal justice paper (I have in my notes police departments are NAICS code 922120). (Maybe Justin Nix/Michael Sierra-Arévalo/Ian Adams?) But sorry do not remember the paper (so I owe credit to someone else for pointing out this data, but not sure who).

Another way to do the analysis is to calculate the lower/upper confidence intervals per the rates, and then sort by the lower confidence interval. This way you can filter out high rate variance locations.

# Can look at police departments
# NAICS code 922120
police = inj_dat[inj_dat['naics_code'] == 922120].copy()
low_police, high_police = binom_int(police['total_injuries'],police['total_hours_worked'])
police['low_rate'] = low_police*2080
check_fields = ['establishment_name','city','state','total_injuries','total_hours_worked','InjPerYear','low_rate']

So you can see we have some funny business going on with the LA data reporting (which OSHA mentions on the data webpage). Maybe it is just admin duty, so people are already injured and get assigned to those bureaus (not sure why LAPD reports seperate bureaus at all).

Using IO objects in python to read data

Just a quick post on a technique I’ve used a few times recently, in particular when reading web data.

First for a very quick example, in python when reading data with pandas, it often expects a filename on disk. For pandas, e.g. pd.read_csv('my_file.csv'). But if you happen to already have the contents of the csv in a text object in memory, you can use io.StringIO to just read that object.

import pandas as pd
from io import StringIO

# Example csv file inline
examp_csv = """a,b


Where this has come up for me recently is reading in different data from web servers. For example, here is Cary’s API for crime data, you can batch download the whole thing at the below url, but via this approach I currently get an SSL error:

# Town of Cary CSV for crimes
cary_url = 'https://data.townofcary.org/explore/dataset/cpd-incidents/download/?format=csv&timezone=America/New_York&lang=en&use_labels_for_header=true&csv_separator=%2C'

# Returns SSL Error for me
cary_df = pd.read_csv(cary_url)

Note I don’t know the distinction in web server tech that causes this (as sometimes you can just grab a CSV via url, here is an example I have grabbing PPP loan data or with the NIJ recidivism data).

But we can grab the data via requests, and use the same StringIO trick I just showed to get this data:

# Using string IO for reading text
import requests
res_cary = requests.get(cary_url)
cary_df = pd.read_csv(StringIO(res_cary.text))

Again I don’t know why some servers you need to go through this approach, but this works for me for Socrata and CartoDB api’s for different cities open data. I also used in recently for converting geojson in ESRI’s api.

The second example I want to show is downloading zipfiles. For this, we will use io.BytesIO instead of StringIO. The census stores various data in zipfiles on their FTP server:

# Example 2, grabbing zipped contents
import zipfile
from io import BytesIO

census_url = 'https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/2019_5yr_Summary_FileTemplates.zip'
req = requests.get(census_url)

# Can use BytesIO for this content
zf = zipfile.ZipFile(BytesIO(req.content))

The zipfile library would be equivalent to reading/extracting a zipfile already on disk. But when downloading there is no need to save to disk, then deal with that file. BytesIO here cuts out the middleman.

Then we gan either grab a specific file inside of our zf object, or extract all the contents one-by-one:

# Now can loop through the list
# or grab specific file
temp_geo = pd.read_excel(zf.open('2019_SFGeoFileTemplate.xlsx'))

Hot spots of crime in Raleigh and home buying

So my realtor, Ellen Pitts (who is highly recommended, helped us a ton remotely moving into Raleigh), has a YouTube channel where she talks about real estate trends. Her most recent video she discussed a bit about crime in Raleigh relative to other cities because of the most recent shooting.

My criminologist hot take is that generally most cities in the US are relatively low crime. So Ellen shows Dallas has quite a few more per-capita shootings than Raleigh, but Dallas is quite safe “overall”. Probably somewhat contra to what most people think, the cities that in my opinion really have the most crime problems tend to be smaller rust belt cities. I love Troy, NY (where I was a crime analyst for a few years), but Troy is quite a bit rougher around the edges than Raleigh or Dallas.

So this post is more about, you have already chosen to move to Raleigh – if I am comparing house 1 and house 2 (or looking at general neighborhoods), do I need to worry about crime in this specific location?

So for a few specific resources/strategies for the home hunter. Not just in Raleigh, but many cities now have an open data portal. You can often look at crime. Here is an example with the Raleigh open data:

So if you have a specific address in mind, you can go and see the recent crime around that location (cities often fuzz the address a bit, so the actual points are just nearby on that block of the street). Blue dots in that screenshot are recent crimes in 2022 against people (you can click on each dot and get a more specific breakdown). Be prepared when you do this – crime is everywhere. But that said the vast majority of minor crime incidents should not deter you from buying a house or renting at a particular location.

Note I recommend looking at actual crime data (points on a map) for this. Several vendors release crime stats aggregated to neighborhoods or zipcodes, but these are of very low quality. (Often they “make up” data when it doesn’t exist, and when data does exist they don’t have a real great way to rank areas of low or high crime.)

For the more high level, should I worry about this neighborhood, I made an interactive hotspot map.

For the methodology, I focused on crimes that I would personally be concerned with as a homeowner. If I pull larceny crimes, I am sure the Target in North Hills would be a hotspot (but I would totally buy a condo in North Hills). So this pulls the recent crime data from Raleigh open data starting in 2020, but scoops up aggravated assaults, interpersonal robberies, weapon violations, and residential burglaries. Folks may be concerned about drug incidents and breaking into cars as well, but my experience those also do not tend to be in residential areas. The python code to replicate the map is here.

Then I created DBScan clusters that had at least 34 crimes – so these areas average at least one of these crimes per month over the time period I sampled. Zooming in, even though I tried to filter for more potentially residential related crimes, you can see the majority of these hot spots of crime are commercial areas in Raleigh. So for example you can zoom in and check out the string of hot spots on Capital Blvd (and if you click a hot spot you can get breakdowns of specific crime stats I looked at):

Very few of these hot spots are in residential neighborhoods – most are in more commercial areas. So when considering looking at homes in Raleigh, there are very few spots I would worry about crime at all in the city when making a housing choice. If moving into a neighborhood with a higher proportion of renters I think is potentially more important long term signal than crime here in Raleigh.

Hyperparameter tuning for Random Forests

Motivated to write this post based on a few different examples at work. One, we have periodically tried different auto machine learning (automl) libraries at work (with quite mediocre success). They are OK for a baseline, not so much for production. Two, a fellow data scientist was trying some simple hyperparameter search in random forests, and was searching over all the wrong things.

For those not familiar, automl libraries, such as data robot or databricks automl, typically do a grid search over different models given a particular problem (e.g. random forest, logistic regression, XGBoost, etc.). And within this you can have variants of similar models, e.g. RFMod1(max-depth=5) vs RFMod2(max-depth=10). Then given a train/test split, see which model wins in the test set, and then suggests promoting that model into production.

My experience at work with these different libraries (mostly tabular medical records data), they choose XGBoost variants at a very high frequency. I think this is partially due to poor defaults for the hyperparameter search in several of these automl libraries. Here I will just be focused on random forests, and to follow along I have code posted on Github.

Here I am using the NIJ recidivism challenge data I have written about in the past, and for just some upfront work I am loading in various libraries/functions (I have the recid functions in the githup repo). And then making it easy to just load in the train/test data (with some smart feature engineering already completed).

from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
from sklearn.metrics import brier_score_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import recid # my custom NIJ recid funcs

# Read in data and feature engineering
train, test = recid.fe()

Now first, in terms of hyperparameter tuning you need to understand the nature of the model you are fitting, and how those hyperparameters interact with the model. In terms of random forests for example, I see several automl libraries (and my colleague) doing a search over the number of trees (or estimators in sklearn parlance). Random forests the trees are independent, and so having more trees is always better. It is sort of like saying would you rather me do 100 simulations or 1,000,000 simulations to estimate a parameter – the 1,000,000 will of course have a more accurate estimate (although may be wasteful, as the extra precision is not needed).

So here, using the NIJ defined train/test split, and a set of different fixed parameters (close to what I typically default to for random forests). I show how AUC or the Brier Score changes with the number of trees, over a grid from 10 to 1000 trees:

# Loop over tree sizes
ntrees = np.arange(10,1010,10)
auc, bs = [], []

# This is showing number of estimators is asymptotic
# should approach best value with higher numbers
# but has some variance in out of sample test
for n in ntrees:
    print(f'Fitting Model for ntrees:{n} @ {datetime.now()}')
    # Baseline model
    mod = RandomForestClassifier(n_estimators=n, max_depth=5, min_samples_split=50)
    # Fit Model
    mod.fit(train[recid.xvars], train[recid.y1])
    # Evaluate AUC/BrierScore out of sample
    pred_out = mod.predict_proba(test[recid.xvars])[:,1]
    auc.append(roc_auc_score(test[recid.y1], pred_out))
    bs.append(brier_score_loss(test[recid.y1], pred_out))

# Making a plot of the 
fig, (ax1, ax2) = plt.subplots(2, figsize=(6,8))
ax1.plot(ntrees, auc)
ax2.plot(ntrees, bs)
ax2.set_title('Brier Score')
plt.savefig('Ntree_grid.png', dpi=500, bbox_inches='tight')

We can see that the relationship is noisy, but the trend clearly decreases with tree size, and perhaps asymptotes post 200 some trees for both metrics in this particular set of data.

So of course it depends on the dataset, but when I see automl libraries choosing trees in the range of 10, 50, 100 for random forests I roll my eyes a bit. You always get more accurate (in a statistical sense), with more trees. You would only choose that few for convenience in fitting and time savings. The R library ranger has a better default of 500 trees IMO (sklearns 100 is often too small in doing tests). But there isn’t much point in trying to hyperparameter tune this – your hyperparameter library may choose a smaller number of trees in a particular run, but this is due to noise in the tuning process itself.

So what metrics do matter for random forests? All machine learning models you need to worry about over-fitting/under-fitting. This depends on the nature of the data (number of rows, can fit more parameters, fewer columns less of opportunity to overfit). Random forests this is dependent on the complexity of the trees – more complex trees can overfit the data. If you have more rows of data, you can fit more complex trees. So typically I am doing searches over (in sklearn parlance) max-depth (how deep a tree can grow), min-samples-split (can grow no more trees is samples are too tiny in a leaf node). Another parameter to search for is how many columns to subsample as well (more columns can find more complex trees).

It can potentially be a balancing act – if you have more samples per leaf, it by default will create less complex trees. Many of the other hyperparameters limiting the complexity of the trees are redundant with these as well (so you could really swap out with max-depth). Here is an example of using the Optuna library a friend recommended to figure out the best parameters for this particular data set, using a train/eval approach (so this splits up the training set even further):

# Consistent train/test split for all evaluations
tr1, tr2 = train_test_split(train, test_size=2000)

def objective(trial):
    param = {
        "n_estimators": 500,
        "max_depth": trial.suggest_int("max_depth", 2, 10),
        "min_samples_split": trial.suggest_int("min_samples_split", 10, 200, step=10),
        "max_features": trial.suggest_int("max_features", 3, 15),
    mod = RandomForestClassifier(**param)
    mod.fit(tr1[recid.xvars], tr1[recid.y1])
    pred_out = mod.predict_proba(tr2[recid.xvars])[:,1]
    auc = roc_auc_score(tr2[recid.y1], pred_out)
    return auc

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
trial = study.best_trial

print(f"Best AUC {trial.value}")
print("Best Params")

So here it chooses fairly deep trees, 9, but limited complexity via the sample size in a leaf (90). The number of features per tree is alsio slightly larger than default (here 25 features, so default is sqrt(25)). So it appears my personal defaults were slightly under-fitting the data.

My experience regression prefers smaller depth of trees but lower sample sizes (splitting to 1 is OK), but for binary classification limiting sample sizes in trees is preferable.

One thing to pay attention to in these automl libraries, a few do not retrain on the full dataset with the selected hyperparameters. For certain scenarios you don’t want to do this (e.g. if using the eval set to do calibrations/prediction intervals), but if you don’t care about those then you typically want to retrain on the full set. Another random personal observation, random forests really only start to outperform simpler regression strategies at 20k cases, with smaller sample sizes and further splitting train/eval/test, the hyperparameter search is very noisy and mostly a waste of time. If you only have a few hundred cases just fit a reasonable regression model and call it a day.

So here I retrain on the full test set, and then look at the AUC/Brier Score compared to my default model.

# Lets refit according to these params for the full
# training set
mod2 = RandomForestClassifier(n_estimators=500,**trial.params)
mod2.fit(train[recid.xvars], train[recid.y1])
pred_out = mod2.predict_proba(test[recid.xvars])[:,1]
auc_tuned = roc_auc_score(test[recid.y1], pred_out)
bs_tuned = brier_score_loss(test[recid.y1], pred_out)
print(f"AUC tuned {auc_tuned:.4f} vs AUC default {auc[-1]:.4f}")
print(f"Brier Score tuned {bs_tuned:.4f} vs default {bs[-1]:.4f}")

It is definately worth hyperparameter tuning once you have your feature set down, but lift of ~0.01 AUC (which is typical for hypertuned tabular models in my experience) is not a big deal with initial model building. Figuring out a smart way to encode some relevant feature (based on domain knowledge of the problem you are dealing with) typically has more lift than this in my experience.

Column storage for wide datasets

The notion of big data is quite a bit overhyped. I do get some exposure to it at work, but many tools like Hadoop are not needed over doing things in chunks on more typical machines. One thing though I have learned more generally about databases, many relational databases (such as postgres) store data under the hood like this:


So they are stacked cumulatively in an underlying file format. And then to access the information the system has to know “ok I need to find Index2 and column 2”, so it calculates offsets for where those pieces of information should be located in the file.

This however is not so nice for databases that have many columns. In particular administrative record databases that have few rows, but many thousands of columns (often with many missing fields and low dimensional categories), this default format is not very friendly. In fact many databases have limits on the number of columns a table can have. (“Few rows” is relative, but really I am contrasting with sensor data that often has billions/trillions of rows, but very few columns.)

In these situations, a columnar database (or data format) makes more sense. Instead of having to calculate many large number of offsets, the database often will represent the key-column pairs in some easier base format, and then will only worry about grabbing specific columns. Both representing the underlying data in a more efficient manner will lower on computer disk space (e.g. only takes up 1 gig instead of many gigabytes) as well as improve input/output operations on the data (e.g. it is faster to read the data/write new data).

I will show some examples via the American Community Survey data for micro areas. I have saved the functions/code here on github to follow along.

So first, I load in pandas and DuckDB. DuckDB is an open source database that uses by default a columnar storage format, but can consider it a very similar drop in replacement for sqllite for persistantly storing data.

import pandas as pd
import duckdb
import os

# my local census functions
# grabbing small area 2019 data
# here defaults to just Texas/Delaware
# still takes a few minutes
import census_funcs
cen2019_small, fields2019 = census_funcs.get_data(year=2019)
print(cen2019_small.shape) # (21868, 17145)

Here I grab the small area data for just Texas/Delaware (this includes mostly census tracts and census block groups in the census FTP data). You can see not so many rows, almost 22k, but quite a few columns, over 17k. This is after some data munging to drop entirely missing/duplicated columns even. The census data just has very many slightly different aggregate categories.

Next I save these data files to disk. For data that does not change very often, you just want to do this process a single time and save that data somewhere you can more easily access it. No need to re-download the data from the internet/census site everytime you want to do a new analysis.

I don’t have timing data here, but you can just experiment to see the parquet data format is quite a bit faster to save (and csv formats are the slowest). DuckDB is smart and just knows to look at your local namespace to find the referenced pandas dataframe in the execute string.

# Save locally to CSV file

# Trying zip compression

# Save to parquet files

# Save to DuckDB, should make the DB if does not exist
con = duckdb.connect(database='census.duckdb',read_only=False)
con.execute('CREATE TABLE census_2019 AS SELECT * FROM cen2019_small')

If we then go and check out the datasizes on disk, csv for just these two states is 0.8 gigs (the full set of data for all 50 states is closer to 20 gigs). Using zip compression reduces this to around 1/4th of the size for the csv file. Using parquet format (which can be considered an alternative fixed file format to CSV although columnar oriented) and gzip compression is pretty much the same as zip compression for this data (not many missing values or repeat categories of numbers), but if you have repeat categorical data with a bit of missing data should be slightly better compression (I am thinking NIBRS here). The entire DuckDB database is actually smaller than the CSV file (I haven’t checked closely, I try to coerce the data to smarter float/int formats before saving, but there are probably even more space to squeeze out of my functions).

# Lets check out the file sizes
files = ['census_test.csv','census_test.csv.zip',

for fi in files:
    file_size = os.path.getsize(fi)
    print(f'File size for {fi} is {file_size/1024**3:,.1f} gigs')

# File size for census_test.csv is 0.8 gigs
# File size for census_test.csv.zip is 0.2 gigs
# File size for census.parquet.gzip is 0.2 gigs
# File size for census.duckdb is 0.7 gigs

The benefit of having a database here like DuckDB is for later SQL querying, as well as the ability to save additional tables. If I scoop up the 2018 data (that has slightly different columns), I can save to an additional table. Then later on downstream applications can select out the limited columns/years as needed (unlikely any real analysis workflow needs all 17,000 columns).

# Can add in another table into duckdb
cen2018_small, fields2018 = census_funcs.get_data(year=2018)
con.execute('CREATE TABLE census_2018 AS SELECT * FROM cen2018_small')

ducksize = os.path.getsize(files[-1])
print(f'File size for {files[-1]} with two tables is {ducksize/1024**3:,.1f} gigs')
# File size for census.duckdb with two tables is 1.5 gigs

Sqllite is nice, as you can basically share a sqllite file and you know your friend will be able to open it. I have not worked with DuckDB that much, but hopefully it has similar functionality and ability to share without too much headache.

I have not worried about doing timing – I don’t really care about write timings of these data formats compared to CSV (slow read timing is annoying, but 10 seconds vs 1 minute to read data is not a big deal). But it is good practice to not be gluttonous with on disk space, which saving a bunch of inefficient csv files can be a bit wasteful.