Podcast and Video Shout Outs

So y’all know I really enjoy blogs. So much so I think they often have a higher value added than traditional peer review papers. There are other mediums I would like to recognize, and those are Podcasts and video tutorials. So while I like to do lab tutorials (pretty much like my blog posts in which I step through some code), I know many students would prefer I do videos and lectures. And I admit some of these I have seen done quite well on Coursera for example.

Another source I have been consuming quite a bit lately are Podcasts. These often take the form of an interview. So are not technical in nature, but are more soft story telling, such as talking about a particular topical area the interviewee is expert in, or that persons career path. So here are my list of these resources I have personally learned from and enjoyed.

None of these I have listened/watched 100% of the offerings, but have listened/watch multiple episodes (and will continue to listen/watch more)! These are very criminal justice focused, so would love to branch out to data science and health care resources if folks have suggestions!

Podcasts

Reducing Crime – Jerry Ratcliffe interviews a mix of academics and folks working in the criminal justice field. I have quite a few of these episodes I found personally very informative. John Eck, Kim Rossmo, and Phil Goff were perhaps my favorites of academics. Danny Murphy and Thomas Abt were really good as well (for my favorite non-academics offhand).

Niro Knowledge – Nicholas Roy is a current crime analyst, and interviews other crime analysts and academics. Favorite interviews so far are Cynthia Lum and Renee Mitchell. Similar to reducing crime is typically more focused on a particular topic of interest to the person being interviewed (e.g. Renee talked about her work on crime harm indices).

Analyst Talk – This is a podcast hosted by Jason Elder where he interviews crime analysts from all over about their careers. Annie Thompson and my former colleague Shelagh Dorn’s are my favorite so far, but I also need to listen in sometime on Sean Bair’s series of talks as well.

Abt Podcasts – This I only came across a week ago, but have listened to several on data science, CJ, and social determinants of health. These are a bit different than the other podcasts here, they are shorter and have two individuals from different fields discuss social science relevant to the chosen topic.

Videos

Canadian Society of Evidence Based Policing – Has many interviews of academics in crim/cj. I have an interview with them (would not recommend, I need to work on sitting still!) I really enjoyed the Peter Neyroud interview though is my favorite.

UARK CASDAL – These are instructional videos uploaded by Grant Drawve, mostly around doing crime analysis in Excel, but also has a few in ArcGIS.

StatQuest with Josh Starmer – This is one of the few non crim/cj examples I watch regularly. As interview questions at my work place for entry data scientists we often ask folks to explain machine learning models (such as random forests or XGBoost) in some simple terms. These videos are excellent resources to get you to understand the basics of the mathematics behind the techniques.

Again let me know if of podcasts/video series I am missing out on in the comments!

The WDD test with different area sizes

So I have two prior examples of weighting the WDD test (a simple test for pre-post crime counts in an experimental setting):

And a friend recently asked about weighting for different areas, so the test is crime reduction per area density instead of overall counts. First before I get into the example, this isn’t per se necessary. All that matters in the end for this test to be valid is 1) the crime data are Poisson distributed, 2) the control areas follow parallel trends to the treated area. So based on this I’ve advocated that it is ok to have a control area be ‘the rest of the city’ for example.

Some of my work on long term crime trends at micro places, shows low-crime and high-crime areas all tend to follow the same overall temporal trends (and Martin Andresen’s related work one would come to the same conclusion). So that would suggest you can aggregate up many low crimes to make a reasonable control comparison to a hot spot treated area.

So as I will show weighting by area is possible, but it actually changes the identification strategy slightly (whereas the prior two weighting examples do not) – the parallel trends assumption needs to be on the crime per area estimate, as opposed to the original count scale. Since the friend who asked about this is an Excel GURU (check out Grant’s very nice YouTube videos for crime analysis) I will show how to do the calculations in Excel, as well as how to do a simulation to show my estimator behaves as it should. (And the benefit of that is you can do power analysis based on the simulations.)

Example Calculations in Excel

I have posted an Excel spreadsheet to show the calculations here. But for a quick overview, I made the spreadsheet very similar to the original WDD calculation, you just need to insert your areas for the different treated/control/displacement areas.

And you can check out the formulas, again it is just weighting the estimator by the areas, and then making the appropriate transformations to the variance estimates.

I have an added extra portion of this though – a simulation tab to show the estimator works.

Only thing to note, a way to simulate to Poisson data in Excel is to generate a random number on the unit interval (0,1), and then for the distribution of interest use the inverse CDF function. There is no inverse Poisson function in Excel, but you can reasonably approximate it via the inverse binomial with a very large number of trials. I’ve tested and it is good enough for my purposes to use a base of 10k for the binomial trials.

The simulation tab on this spreadsheet you can input your own numbers for planning purposes as well. So the idea is if you think you can only reasonably reduce crimes by X amount in your targeted areas, this lets you do power analysis. So in this example, going from 60 to 40 crimes results in a power estimate of only 0.44 (so you will fail to reject the null over 5 out of 10 times, even if your intervention actually works as well as you think). But if you think you can reduce crimes from 60 to 30, the power in this example gets close to 0.8 (what you typically shoot for in up-front experiments, although there is no harm for going for higher power!). So if you have low power you may want to expand the time periods under study or expand the number of treated areas.

Wrap Up

Between this and the prior WDD examples, I have about wrapped up all the potential permutations of weighting I can think of offhand. So you can mix/match all of these different weighting strategies together (e.g. you could do multiple time periods and area weighting). It is just algebra and carrying through the correct changes to the variance estimates.

I do have one additional blog post slated in the future. David Wilson has a recent JQC article using a different estimate, but essentially the same pre/post data I am using here. The identifying assumptions are different again for this (parallel trends on the ratio scale, not the linear scale), and I will have more to say when I think you would prefer the WDD to David’s estimator. (In short I think David’s is good for meta-analysis, but I prefer my WDD for individual evaluations.)

Transforming predicted variables in regression

The other day on LinkedIn I made a point about how I think scikits TransformedTargetRegressor is very likely to mislead folks. In fact, the example use case in the docs for this function is a common mistake, fitting a model for log(y), then getting predictions phat, and then simply exponentiating those predictions exp(phat).

On LinkedIn I gave an example of how this is problematic for random forests, but here is a similar example for linear regression. For simplicity pretend we only have 3 potential residuals (all equally likely), either a residual of -1, 0, or 1.

Now pretend our logged prediction is 5, so if we simply do exp(5) we get about 148. Now what are our predictions is we consider those 3 potential residuals?

Resid  Pred-Resid Modified_Pred LinPred
  -1     5 - -1        exp(6)     403
   0     5 -  0        exp(5)     148
   1     5 -  1        exp(4)      55

So if we take the mean of our LinPred column, we then get a prediction of about 202. The prediction using this approach is much higher than the naive approach of simply exponentiating 5. The difference is that the exp(5) estimate is the median, and the above estimate taking into account residuals is the mean estimate.

While there are some cases you may want the median estimate, in that case it probably makes more sense to use a quantile estimator of the median from the get go, as opposed to doing the linear regression on log(y). I think for many (probably most) use cases in which you are predicting dollar values, this underestimate can be very problematic. If you are using these estimates for revenue, you will be way under for example. If you are using these estimates for expenses, holy moly you will probably get fired.

This problem will happen for any non-linear transformation. So while some transformations are ok, in scikit for example minmax or standardnormal scalars are ok, things like logs, square roots, or box-cox transformations are not. (To know if it is a linear transformation, if you do a scatterplot of original vs transformed, if it is a straight line it is ok, if it is a curved line it is not!)

I had a friend go back and forth with me for a bit after I posted this. I want to be clear this is not me saying the model of log(y) is the wrong model, it is just to get the estimates for the mean predictions, you need to take a few steps. In particular, one approach to get the mean estimates is to use Duan’s Smearing estimator. I will show how to do that in python below using simulated data.

Example Duan’s Smearing in python

So first, we import the libraries we will be using. And since this is simulated data, will be setting the seed as well.

######################################################
import pandas as pd
import numpy as np
np.random.seed(10)

from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
######################################################

Next I will create a simple linear model on the log scale. So the regression of the logged values is the correct one.

######################################################
# Make a fake dataset, say these are housing prices
n = (10000,1)
error = np.random.normal(0,1,n)
x1 = np.random.normal(10,3,n)
x2 = np.random.normal(5,1,n)
log_y = 10 + 0.2*x1 + 0.6*x2 + error
y = np.exp(log_y)

dat = pd.DataFrame(np.concatenate([y,x1,x2,log_y,error], axis=1),
                   columns=['y','x1','x2','log_y','error'])
x_vars = ['x1','x2']

# Lets look at a histogram of y vs log y
dat['y'].hist(bins=100)
dat['log_y'].hist(bins=100)
######################################################

Here is the histogram of the original values:

And here is the histogram of the logged values:

So although the regression is the conditional relationship, if you see histograms like this I would also by default use a regression to predict log(y).

Now here I do the same thing as in the original function docs, I fit a linear regression using the log as the function and exponential as the inverse function.

######################################################
# Now lets see what happens with the usual approach
tt = TransformedTargetRegressor(regressor=LinearRegression(),
                                func=np.log, inverse_func=np.exp)
tt.fit(dat[x_vars], dat['y'])
print( (tt.regressor_.intercept_, tt.regressor_.coef_) ) #Estimates the correct values

dat['WrongTrans'] = tt.predict(dat[x_vars])

dat[['y','WrongTrans']].describe()
######################################################

So here we estimate the correct simulated values for the regression equation:

But as we will see in a second, the exponentiated predictions are not so well behaved. To illustrate how the WrongTrans variable behaves, I show its distribution compared to the original y value. You can see that on average it is a much smaller estimate. Our sample values have a mean of 7.5 million, and the naive estimate here only has a mean of 4.6 million.

Now here is a way to get an estimate of the mean value. In a nutshell, what you do is take the observed residuals, pretty much like that little table I did in the intro of this blog post, generate predictions given those residuals, and then back transform them and take the mean.

Although this example is using logged regression, I’ve made it pretty general. So if you used any box cox transformation instead of the logged (e.g. sklearns power_transform, it will work.

######################################################
# Duan's smearing, non-parametric approach via residuals

# Can make this general for any function inside of 
# TransformedTargetRegressor
f = tt.get_params()['func']              #function
inv_f = tt.get_params()['inverse_func']  #and inverse function

# Non-parametric approach, approximate via residuals
# Using numpy broadcasting
log_pred = f(dat['WrongTrans'])
resids = f(dat['y']) - log_pred
resids = resids.values.reshape(1,n[0])
dp = inv_f(log_pred.values.reshape(n[0],1) + resids)
dat['DuanPreds'] = dp.mean(axis=1)

dat[['y','WrongTrans','DuanPreds']].describe()
######################################################

So you can see that the Duan Smeared predictions are looking better, at least the mean of the predictions is much closer to the original.

I’ve intentionally done this example without using train/test, as we know the true answers. But in that case, you will want to use the residuals from the training dataset to apply this transformation to the test dataset.

So the residuals and the Duan smearing estimator do not need to be the same dimension. So for example if you have a big data application, you may want to do something like resids = resids.sample(1000) above.

Also another nice perk of this is you can use dp above to give you prediction intervals, so np.quantile(dp,[0.025,0.975], axis=1).T would give you a 95% prediction interval of the mean on the linear scale as well.

Extra, Parametric Estimation

Another approach, which may make sense given the application, is instead of using the observed residuals to give a non-parametric estimate, you can estimate the distribution of the residuals, and then use that to make either an integral estimate of the Smeared estimate back on the original scale. Or in the case of the logged regression there is a closed form solution.

I show how to construct the integral estimator below, again trying to be more general. The integral approach will work for say any box-cox transformation.

######################################################
# Parametric approach, approximating residuals via normal

from scipy.stats import norm
from scipy.integrate import quad

# Look at the residuals again
resids = f(dat['y']) - f(tt.predict(dat[x_vars]))

# Check to make sure that the residuals are really close to normal
# Before doing this
resids.hist(bins=100)

# Fit to a normal distribution 
loc, scale = norm.fit(resids)

# Define integral
def integrand(x,pred):
    return norm.pdf(x, loc, scale)*inv_f(pred - x)

# Pred should be the logged prediction
# -50,50 should be changed if the residuals are scaled differently
def duan_param(pred):
    return quad(integrand, -50, 50, args=(pred))[0]

# This takes awhile to apply to the whole data frame!
dat['log_pred'] = f(tt.predict(dat[x_vars]))
sub_dat = dat.head(100).copy()
sub_dat['DuanParam'] = sub_dat['log_pred'].apply(duan_param)

# Can see that these are very similar to the non-parametric
print( sub_dat[['DuanPreds','DuanParam']].head(10) )

And you can see that this normal based approximation works just fine here, since by construction the model residuals are pretty well behaved in my simulation.

It happens to be the case that there is a simpler estimate than the integral approach (which you can see in my notes takes awhile to estimate).

###########
# Easier way, but only applicable to log transform
# https://en.wikipedia.org/wiki/Smearing_retransformation
test_val = np.log(5000000)

# Integral approach
print( duan_param(test_val) ) 

# Approach for just log transformed
mult = np.exp(0.5*resids.var())
print( np.exp(test_val)*mult )
##########

So you can see the integral vs the closed form function are very close:

The differences could be due to the the integral is simply an estimate (and you can see I did not do negative to positive infinity, but chopped it off, I do not know if there is a better function to estimate the integral or general approach here).

It wouldn’t surprise me if there are closed form solutions for box-cox transforms as well, but I am not familiar with them offhand. Again the integral approach (or the non-parametric approach) will work for whatever function you want. The function itself could be whatever crazy/discontinuous function you want. But this parametric Duan’s Smearing approach relies on the residuals being normally distributed. (I suppose you could use some other types of continuous distribution estimate if you have reason to, I have only seen normal distribution estimates though in practice.)

Other Notes

While this focuses on regression, I do not think this will perform all that badly for other types of models (such as random forests or xgboost). But for forests it may make sense to simply pull out the individual tree estimates, back transform them, and get the mean of that backtransformed estimate. I have a different blog post that has a function showing how to scoop up the individual predictions from a random forest model.

It should also apply the same to any regression model with regularization. But if you want to do this, there are of course other alternative models you may consider that may be better suited towards your end goals of predictions on the linear/original scale.

For example, if you really want prediction intervals, it may make sense to not transform the data, and estimate a quantile regression model at the 5% and 95% quantiles. This would give you a 90% prediction interval.

Another approach is that it may make sense to use a different model, such as Poisson regression or negative binomial regression (or another generalized linear model in general). Even if your data are not integer counts, you can still use these models! (They just need to be 0 and above, no negative values.)

That Stata blog suggests to use Poisson and then robust standard errors, but that is a bad idea if you are really interested in predictions as well (see Gary Kings comment and linked paper). But you can just do negative binomial models in most cases then, and that is a better default than Poisson for many real world datasets.

Reproducible research and code review for journals

Recently came across two different groups broaching the subject of code reviews and reproducible research more broadly for criminal justice. There are certainly aspects of either that make it difficult in the context of peer review. But I am not one to let the perfect be the enemy of the good, so I will layout the difficulties and give some comments on potential good enough solutions that still make marked improvements on the current state of affairs in crim/cj research.

Reproducible Research

So what do I mean by reproducible research? Jeromy Anglim on crossvalidated has a good breakdown on different ways we may apply the term. So to some it may mean if you did a hot spots policing experiment, can I replicate the same crime reduction results in another city.

These are important to publish (simply because social science experiments will inevitably have quite a bit of variance), but this is often not what we are talking about when we talk about replication. We are often talking about a much smaller in scope goal – if I give you the exact same data, can you reproduce the tables/figures in the manuscript you used to make your inferences?

One problem that is often the case with CJ research is that we are working with sensitive data. If I do analysis on a survey of a sensitive topic, I often cannot share the data. But, I do not believe that should entirely put a spike in the question of reproducible data. I have broken down different levels that are possible in making research more reproducible:

  1. A Sharing data and code files to reproduce the paper results
  2. B Sharing code files and simulated data that illustrate the results
  3. C Sharing the plain-text log files showing the code and results of tables/figures

So I have not seen C proposed anywhere, but it is a dead simple solution that almost everyone should be able to accommodate. It simply involves typing log using "output.txt", text at the top of your Stata file, or OUTPUT EXPORT /PDF DOCUMENTFILE="output.pdf" at the end of your SPSS analysis (or could be done via the GUI), etc. These are the log/output files used to generate the results you report in the paper, and typically contain both the commands run, as well as the resulting tables. These files can quite easily not contain privileged information (in fact they won’t be default most of the time, unless you printed out individual names in a table for example in intermediate results).

To accomplish C does take some modicum of wherewithal in terms of writing code, but it is a pretty low bar. So I see no reason why all quantitative analyses cannot require at least this step right now. I realize it is not foolproof – a bad actor could go and edit the results (same as they could edit the results without this information). But it ups the level of effort to manipulate results by quite a bit, and more importantly has the potential to catch more mundane transcription errors that occur quite frequently.

Sometimes I want more details on the code used, the nature of the data etc. (Most quasi-experimental design for example can be summed up as shape your data in a special way and run a particular regression model.) For people like me who care about that, B helps with that, in that I can see the code front-to-back, can actually go and inspect the shape and values in a particular rectangular dataset, and see how the code interacts with those objects. The only full on example of this I am aware of is a recent example paper in Nature Behavior that shares the code using simulated data.

B is also very similar to people who release statistical packages to reproduce their code. So if you release an R package that conducts your new fancy technique, even if you can’t share your data it is really good for people to be able to view the underlying code even by itself to understand the technique better and in conjunction build on your work more. If you do a new technique, it is a crazy ton of work to replicate that on your own, so most people will not bother.

A is most of the way there to the gold standard – if you can share both the data and the code used to reproduce the analysis. Both A and B take a significant amount of knowledge of statistical programming to accomplish. Most people in our field do not have the skills to write an analysis front-to-back that can run in a series of scripts though. To get to A/B grad programs in crim/cj need to spend crazy more time on teaching these skills, which is near zero now almost across the board.

One brief thing to mention about A is that the boundary is difficult to define. So for example, I share code to reproduce analysis in my 311 and crime at micro places in DC paper (paper link, code). But this starts from a dataset that has the street units in DC and all of the covariates already compiled. But where did that dataset come from? I created it by compiling many different sources, so the base dataset is itself very difficult to replicate. Again not letting the perfect be the enemy of the good, I think just starting from your compiled dataset, and replicating the tables/graphs in the manuscript is better than letting the fuzzy boundary prevent you from sharing anything.

Code Reviews for Journal Submissions

The hardest part of A is that even after you share your data, some journals want to be able to run the code locally to entirely reproduce your results. So while I have shared data code (A above) for many papers, see this spreadsheet, they have not been externally vetted by any of those journals. This vetting is the standard in some economic journals now I believe, and would not be surprised in some poli-sci journals as well. This is a very hard problem though, and requires significant resources from both the journal and the researcher to be able to do that.

The biggest hurdle is that even if you share your data/code, your particular system may be idiosyncratic. You may have different R libraries installed than me. You may have different versions of python packages. I may have used a program on Windows to do some analysis you cannot do on a Mac. You may rely on some paid API I cannot access.

These are often solvable problems, but take quite a bit of time to work out. A comparable example to my work is when data scientists say ‘going to production’. This often involves taking some analysis I did on my local machine, and making it run autonomously on my companies servers. There are some things that make it more or less difficult than the typical academic situation, but I think it is broadly comparable. To go to production for a project will typically take me 3-6 months at 50% of my time, so maybe something like 300 hours for a lowish end estimate. And that is just the time it takes from the researchers end, from the journals end it will also take a significant amount of time to compile every ones code and verify the results.

Because of this, I don’t think the fully reproducible re-run my code and generate the exact same tables are feasible in the current way we do academic research and peer review. But again that is why I list C above – we shouldn’t let the perfect be the enemy of the good.

Validating New Empirical Techniques

The code review above is not really code review in the sense that someone looks at your code and says this is correct, it is simply just saying can I get the same results as you. You may want peer review to accomplish the task of not only saying is it reproducible, but is it valid/correct? There are a few things towards this end I would like to see more often in crim/cj. I realize we are not statistics, so cannot often ask for formal proofs. But there are simpler things we can do to verify the results. These are the responsibility of the researcher to provide, not the reviewer to script up on their own to validate someone elses work.

One, illustrate the technique using a very simplified example. So for instance, in my p-median patrol areas paper, I show an example of constructing the linear program with only four areas. You should be able to calculate what the result should be by hand, so can verify the correctness of your algorithm. This has the added benefit of being a very good pedagogical way to describe your method.

Two, illustrate the technique on a larger sample of simulated data in which you again know the correct result. For one example of this, I showed how to estimate group based trajectory models using deep learning libraries. Again your model/method should be able to recover the correct result (which you know) given the simulated fake data.

Three, validate the result using real data compared to the current standard. For crime mapping papers, this means comparing forecasts compared to RTM, or simpler regression models, or simply prior crime = future crime on out of sample data. Amazingly many machine learning papers in CJ do not do out of sample predictions. If it is an inferential procedure, comparing the results to some other status quo technique is similar, such as showing conformal prediction intervals have smaller widths (so more statistical power) than placebo results for synthetic control designs (at least for that example with state panel level crime data).

You may not have all three of these examples in any particular paper, but I think for very new techniques 1 or 2 is necessary. 3 is often a by-product on the analysis anyway. So I do not believe any of these asks are that onerous. If you have the skills to create some new technique, you should be able to accomplish 1 or 2.

I do not have any special advice in terms of the reviewers perspective. When I do code reviews at work, what we do is go line by line, and my co-workers give high level design advice. E.g. you should use a config file for this instead of defining it inline, you should turn this block into a function, you should make a class to open/close the database connections etc. The code reviews do not validate the technical correctness, so if I queried the wrong data they wouldn’t know in the code review. The proof is in the pudding so to speak, so if my results are performing really badly in the real world I know I am doing something wrong. (And the obverse, if my results are on the mark and making money I am pretty sure I did nothing terribly wrong.)

Because there are not these real world mechanisms to validate code in peer reviewed papers, my suggestions for 1/2/3 are the closest I think we can get in many circumstances. That and simply making your code available will dramatically improve the reproducibility and validity of your research compared to the current status quo in our field.

Geocoding the CMS NPI Registry (python)

So previously I wrote out creating service deserts. I have since found a nicer data source to use for this, the NPI CMS registry. This data file has over 6 million service providers across the US.

Here I will provide an example of using that data to geocode all the pharmacy’s in Texas, again using the census geocoding API and python.

Chunking up the NPI database

So first, you can again download the entire NPI database from here. So I have already downloaded and unzipped that file, which contains a CSV for the January version, named npidata_pfile_20050523-20210110.csv. So as some upfront, here are the libraries I will be using, and I also set the directory to where my data is located.

###############################
import pandas as pd
import numpy as np
import censusgeocode as cg
import time
from datetime import datetime
import os
os.chdir(r'D:\HospitalData\NPPES_Data_Dissemination_January_2021')
###############################

The file is just a bit too big for me to fit in memory on my machine. On Windows, you can use Get-Content npidata_pfile_20050523-20210110.csv | Measure-Object -Line in powershell to get the line counts, or on Unix use wc -l *.csv for example. So I know the file is not quite 6.7 million rows.

So what I do here is create a function to read in the csv file in chunks, only select the columns and rows that I want, and then return that data frame. In the end, you need to search across all of the Taxonomy codes to pull out the type of service provider you want. So for community pharmacies, the code is 3336C0003X, but it is not always in the first Taxonomy slot (some CVS’s have it in the second slot for example). You can see the big list of taxonomy codes here, so my criminology friends may say be interested in mental health or substance abuse service providers for other examples.

In addition to the taxonomy code, I always select organizations, not individuals (Entity Type = 2). And then I only select out pharmacies in Texas (although I bet you could fit all of the US pharmacies in memory pretty easily, maybe 60k in total?) Caveat emptor, I am not 100% sure how to use the deactivation codes properly in this database, as that data is always NaN for Texas pharmacies.

######################################################################
# Prepping the input data in chunks

keep_col = ['NPI','Entity Type Code','Provider Organization Name (Legal Business Name)',
            'NPI Deactivation Reason Code','NPI Deactivation Date','NPI Reactivation Date',
            'Provider First Line Business Practice Location Address',
            'Provider Business Practice Location Address City Name',
            'Provider Business Practice Location Address State Name',
            'Provider Business Practice Location Address Postal Code']
            
taxon_codes = ['Healthcare Provider Taxonomy Code_' + str(i+1) for i in range(15)]
keep_col += taxon_codes
community_pharm = '3336C0003X'
npi_csv = 'npidata_pfile_20050523-20210110.csv' #Newer files will prob change the name

# This defines the rows I want
def sub_rows(data):
    ec = data['Entity Type Code'] == "2"
    st = data['Provider Business Practice Location Address State Name'] == 'TX'
    ta = (data[taxon_codes] == community_pharm).any(axis=1)
    #ac = data['NPI Deactivation Reason Code'].isna()
    all_together = ec & st & ta #& ac
    sub = data[all_together]
    return sub

def csv_chunks(file,chunk_size,keep_cols,row_sub):
    # First lets get the header and figure out the column indices
    header_fields = list(pd.read_csv(npi_csv, nrows=1))
    header_locs = [header_fields.index(i) for i in keep_cols]
    # Now reading in a chunk of data
    skip = 1
    it_n = 0
    sub_n = 0
    ret_chunk = chunk_size
    fin_li_dat = []
    while ret_chunk == chunk_size:
        file_chunk = pd.read_csv(file, usecols=header_locs, skiprows=skip, 
                     nrows=chunk_size, names=header_fields, dtype='str')
        sub_dat = row_sub(file_chunk)
        fin_li_dat.append( sub_dat.copy() )
        skip += chunk_size
        it_n += 1
        sub_n += sub_dat.shape[0]
        print(f'Grabbed iter {it_n} total sub n so far {sub_n}')
        ret_chunk = file_chunk.shape[0]
    fin_dat = pd.concat(fin_li_dat, axis=0)
    return fin_dat


# Takes about 3 minutes
print( datetime.now() )
pharm_tx = csv_chunks(npi_csv, chunk_size=1000000, keep_cols=keep_col, row_sub=sub_rows)
print( datetime.now() )

# No deactivated codes in all of Texas
print( pharm_tx['NPI Deactivation Reason Code'].value_counts() )
######################################################################

So this ends up returning not quite 6800 pharmacies in all of Texas.

Geocoding using the census API

So first, the address data is pretty well formatted. But for those new to geocoding, if you have end parts of address strings like Apt 21 or Suite C, those endings will typically throw geocoders off the mark. So in just a few minutes, I noted the different strings that marked the parts of the addresses I should chop off, and wrote a function to clean those up. Besides that I just limit the zip code to 5 digits, as that field is a mix of 5 and 9 digit zipcodes.

######################################################################
# Now prepping the data for geocoding

ph_tx = pharm_tx.drop(columns=taxon_codes).reset_index(drop=True)

#['Provider First Line Business Practice Location Address', 'Provider Business Practice Location Address City Name', 'Provider Business Practice Location Address State Name', 'Provider Business Practice Location Address Postal Code']

# I just looked through the files and saw that after these strings are not needed
end_str = [' STE', ' SUITE', ' BLDG', ' TOWER', ', #', ' UNIT',
           ' APT', ' BUILDING',',', '#']

 
def clean_add(address):
    add_new = address.upper()
    for su in end_str:
        sf = address.find(su)
        if sf > -1:
            add_new = add_new[0:sf]
    add_new = add_new.replace('.','')
    add_new = add_new.strip()
    return add_new

# Some examples
clean_add('5700 S GESSNER DR STE G')
clean_add('10701-B WEST BELFORT SUITE 170')
clean_add('100 EAST UNIVERSITY BLVD.')
clean_add('5800 BELLAIRE BLVD BLDG 1')
clean_add('2434 N I-35 # S')

ph_tx['Zip5'] = ph_tx['Provider Business Practice Location Address Postal Code'].str[0:5]
ph_tx['Address'] = ph_tx['Provider First Line Business Practice Location Address'].apply(clean_add)
ph_tx.rename(columns={'Provider Business Practice Location Address City Name':'City',
                      'Provider Business Practice Location Address State Name':'State2'},
             inplace=True)
######################################################################

Next is my function to use the batch geocoding in the census api. Note the census api is a bit finicky – technically the census api says you can do batches of up to 5k rows, but I tend to get kicked off for higher values. So here I have a function that chunks it up into tinier batch portions and submits to the API. (A better function would cache intermediate results and wrap all that jazz in a try function.)

 ######################################################################
 #This function breaks up the input data frame into chunks
 #For the census geocoding api
 def split_geo(df, add, city, state, zipcode, chunk_size=500):
     df_new = df.copy()
     df_new.reset_index(inplace=True)
     splits = np.ceil( df.shape[0]/chunk_size)
     chunk_li = np.array_split(df_new['index'], splits)
     res_li = []
     pick_fi = []
     for i,c in enumerate(chunk_li):
         # Grab data, export to csv
         sub_data = df_new.loc[c, ['index',add,city,state,zipcode]]
         sub_data.to_csv('temp_geo.csv',header=False,index=False)
         # Geo the results and turn back into df
         print(f'Geocoding round {int(i)+1} of {int(splits)}, {datetime.now()}')
         result = cg.addressbatch('temp_geo.csv') #should try/except?
         # May want to dump the intermediate results
         #pi_str = f'pickres_{int(i)}.p'
         #pickle.dump( favorite_color, open( pi_str, "wb" ) )
         #pick_fi.append(pi_str.copy())
         names = list(result[0].keys())
         res_zl = []
         for r in result:
             res_zl.append( list(r.values()) )
         res_df = pd.DataFrame(res_zl, columns=names)
         res_li.append( res_df.copy() )
         time.sleep(10) #sleep 10 seconds to not get cutoff from request
     final_df = pd.concat(res_li)
     final_df.rename(columns={'id':'row'}, inplace=True)
     final_df.reset_index(inplace=True, drop=True)
     # Clean up csv file
     os.remove('temp_geo.csv')
     return final_df
 ######################################################################

And now we are onto the final stage, actually running the geocoding function, and piping the end results to a csv file. (Which you can see the final version here.)

######################################################################
# Geocoding the data in chunks

# Takes around 35 minutes
geo_pharm = split_geo(ph_tx, add='Address', city='City', state='State2', zipcode='Zip5', chunk_size=100)

#What is the geocoding hit rate?
print( geo_pharm['match'].value_counts() )
# Only around 65%

# Now merging back with the original data if you want
# Not quite sorted how I need them
geo_pharm['rowN'] = geo_pharm['row'].astype(int)
gp2 = geo_pharm.sort_values(by='rowN').reset_index(drop=True)

# Fields I want
kg = ['address','match','lat','lon']
kd = ['NPI',
      'Provider Organization Name (Legal Business Name)',
      'Provider First Line Business Practice Location Address',
      'Address','City','State2','Zip5']

final_pharm = pd.concat( [ph_tx[kd], gp2[kg]], axis=1 )

final_pharm.to_csv('Pharmacies_Texas.csv',index=False)
######################################################################

Unfortunately the geocoding hit rate is pretty disappointing, only around 65% in this sample. So if I were using this for a project, I would likely do a round of geocoding using the Google API (which is a bit more unforgiving for varied addresses), or perhaps build my own openstreet map geocoder for the US. (Or in general if you don’t have too many to review, doing it interactively in ArcGIS is very nice as well if you have access to Arc.)

The spatial dispersion of NYC shootings in 2020

If you had asked me at the start of widespread Covid lockdown measures what the effect would be on crime, I am pretty sure I would have guessed it will make crime go down. Fewer people out and about causes fewer interactions that can lead to a crime. That isn’t how it has shaped up though, quite a few places have seen increases in serious violent crime. One of the most dramatic examples of this is that shootings in NYC doubled from 900 in 2019 to over 1800 in 2020. I am going to show how to generate this chart later via some R code, but it is easier to show than to say. NYPD’s open data on shootings (historical, current) go back to 2006.

I know I am critical on this site of folks overinterpreting crime increases, for example going from 20 to 35 is pretty weak evidence of an increase given the inherent variance for low count Poisson data (a Poisson e-test has a p-value of 0.04 in that case). But going from 900 to 1800 is a much clearer signal.

Jerry Ratcliffe recently posted an R library to do his crime dispersion analysis, so I figured this would be an excellent example use case. The idea behind this analysis is spatial – we know there is a crime increase, but did the increase happen everywhere, or did it just happen in a few locations. Here I am going to use the NYPD shooting data aggregated at the precinct level to test this.

As another note, while I often use micro-spatial units of analysis in my work, this method, along with others (such as the sppt test), are just not going to work out for very low count, very tiny spatial units of analysis. I would suggest offhand to only do this analysis if the spatial units of analysis under study have an average of at least 10 crimes per area in the pre time period. Which is right about on the mark for the precinct analysis in NYC.

Here is the data and R code to follow along, below I will give a walkthrough.

Crime increase dispersion analysis in R

So first as some front matter, I load in my libraries (Jerry’s crimedispersion you can install from github via devtools, see his page for an example), and the function I define here I’ve gone over in a prior blog post of mine as well.

###############################
library(ggplot2)
library(crimedispersion)

# Increase contours, see https://andrewpwheeler.com/2020/02/21/some-additional-plots-to-go-with-crime-increase-dispersion/
make_cont <- function(pre_crime,post_crime,levels=c(-3,0,3),lr=10,hr=max(pre_crime)*1.05,steps=1000){
    #calculating the overall crime increase
    ov_inc <- sum(post_crime)/sum(pre_crime)
    #Making the sequence on the square root scale
    gr <- seq(sqrt(lr),sqrt(hr),length.out=steps)^2
    cont_data <- expand.grid(gr,levels)
    names(cont_data) <- c('x','levels')
    cont_data$inc <- cont_data$x*ov_inc
    cont_data$lines <- cont_data$inc + cont_data$levels*sqrt(cont_data$inc)
    return(as.data.frame(cont_data))
}

my_dir <- 'D:\\Dropbox\\Dropbox\\Documents\\BLOG\\NYPD_ShootingIncrease\\Analysis'
setwd(my_dir)
###############################

Now we are ready to import our data and stack them into a new data frame. (These are individual incident level shootings, not aggregated. If I ever get around to it I will do an analysis of fatality and distance to emergency rooms like I did with the Philly data.)

###############################
# Get the NYPD data and stack it
# From https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Year-To-Date-/5ucz-vwe8
# And https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Historic-/833y-fsy8
# On 2/1/2021
old <- read.csv('NYPD_Shooting_Incident_Data__Historic_.csv', stringsAsFactors=FALSE)
new <- read.csv('NYPD_Shooting_Incident_Data__Year_To_Date_.csv', stringsAsFactors=FALSE)

# Just one column off
print( cbind(names(old), names(new)) )
names(new) <- names(old)
shooting <- rbind(old,new)
###############################

Now we just want to do aggregate counts of these shootings per year and per precinct. So first I substring out the year, then use table to get aggregate counts in R, then make my nice time series graph using ggplot.

###############################
# Create the current year and aggregate
shooting$Year <- substr(shooting$OCCUR_DATE, 7, 10)
year_stats <- as.data.frame(table(shooting$Year))
year_stats$Year <- as.numeric(as.character(year_stats$Var1))
year_plot <- ggplot(data=year_stats, aes(x=Year,y=Freq)) + 
             geom_line(size=1) + geom_point(shape=21, colour='white', fill='black', size=4) +
             scale_y_continuous(breaks=seq(900,2100,by=100)) +
             scale_x_continuous(breaks=2006:2020) +
             theme(axis.title.x=element_blank(), axis.title.y=element_blank(),
                   panel.grid.minor = element_blank()) + 
             ggtitle("NYPD Shootings per Year")

year_plot
# Not quite the same as Petes, https://copinthehood.com/shooting-in-nyc-2020/
###############################

Part of the reason I do this is not because I don’t trust Pete’s analysis, but because I don’t want to embed pictures from someone elses website! So wanted to recreate the time series graph myself. So next up we need to do the same aggregating, but not for the whole city, but by each precinct. You can use the same table method again, but simply pass in additional columns. That gets you the data in long format, so then I reshape it to wide for later analysis (so each row is a single precinct and each column is a yearly count of shootings). (Note there have been some splits in precincts over the years IIRC, I don’t worry about that here, will cause it to be 0,0 in the 2019/2020 data I look at.)

###############################
#Now aggregating to year and precinct
counts <- as.data.frame(table(shooting$Year, shooting$PRECINCT))
names(counts) <- c('Year','PCT','Count')
# Reshape long to wide
count_wide <-  reshape(counts, idvar = "PCT", timevar = "Year", direction = "wide")
###############################

And now we can give Jerry’s package a test run, where you just pass it your variable names.

# Jerrys function for crime increase dispersion
output <- crimedispersion(count_wide, 'PCT', 'Count.2019', 'Count.2020')
output

The way to understand this is in a hypothetical world in which we could reduce shootings in one precinct at a time, we would need to reduce shootings in 57 of the 77 precincts to reduce 2020 shootings to 2019 levels. So this suggests very widespread increases, it isn’t just concentrated among a few precincts.

Another graph I have suggested to explore this, while taking into account the typical variance with Poisson count data, is to plot the pre crime counts on the X axis, and the post crime counts on the Y axis.

###############################
# My example contour with labels
cont_lev <- make_cont(count_wide$Count.2019, count_wide$Count.2020, lr=5)

eq_plot <- ggplot() + 
           geom_line(data=cont_lev, color="darkgrey", linetype=2, 
                     aes(x=x,y=lines,group=levels)) +
           geom_point(data=count_wide, shape = 21, colour = "black", fill = "grey", size=2.5, 
                      alpha=0.8, aes(x=Count.2019,y=Count.2020)) +
           scale_y_continuous(breaks=seq(0,140,by=10) +
           scale_x_continuous(breaks=seq(0,70,by=5)) +
           coord_cartesian(ylim = c(0, 140)) +
           xlab("2019 Shootings Per Precinct") + ylab("2020 Shootings")
eq_plot
###############################

The contour lines show the hypothesis that crime increased (by around 100% here). So if a point is near the middle line, it follows that doubled mark almost exactly. The upper/lower lines indicate the typical variance, which is a very good fit to the data here you can see. Very few points are outside the boundaries.

Both of these analyses point to the fact that shooting increases were widespread across NYC precincts. Pretty much everywhere doubled in the number of shootings, it is just some places had a larger baseline to double than others (and the data has some noise, you can pick out some places that did not increase if you cherry pick the data).

And as a final R note, if you want to save these graphs as a nice high resolution PNG, here is an example with Jerry’s dispersion object:

# Saving dispersion plot as a high res PNG
png(file = "ODI.png", bg = "transparent", height=5, width=9, units="in", res=1000, type="cairo")
output #this is the object from Jerrys crimedispersion() function earlier
dev.off()

Going forward I am wondering if there is a good way to do spatial monitoring for crime data like this, like some sort of control chart that takes into account both space and time. So isn’t retrospective a year later recap, but in near real time identify spatial increases.

Other References of Interest

  • Justin Nix & company have a few blog posts looking at NYC data as well. In the first they talk about the variance in cities, many are up but several are down as well in violence. A later post though updated with the clear increase in shootings in NYC.
  • There are too many papers at this point for me to do a bibliography of all the Covid and crime updates, but two open examples are Matt Ashby did a paper on several US cities, and Campedelli et al have an analysis of Chicago. Each show variance again, so no universal up or down in trends, but various examples of increases or decreases both between cities and between different crime types within a city.

Filled contour plot in python

I’ve been making a chart that looks similar to this for a few different projects at work, so figured a quick blog post to show the notes of it would be useful.

So people often talk about setting a decision threshold to turn a predicted probability into a binary yes/no decision. E.g. do I do some process to this observation if the probability is 20%, 30%, 60%, etc. If you can identify the costs and benefits of making particular decisions, you can set a simple threshold to make that decision. For example, say you are sending adverts in the mail for a product. If the person buys the product, your company makes $50, and the advert only costs $1 to send. In this framework, if you have a predictive model for the probability the advert will be successful, then your decision threshold will look like this:

$50*probability - $1

So in this case you need the predicted probability to be above 2% to have an expected positive return on the investment of sending the advert. So if you have a probability of 10% for 2000 customers, you would expect to make 2000 * (50*0.1 - 1) = 8000. The probabilities you get from your predictive model can be thought of as in the long run averages. Any single advert may be a bust, but if your model is right and you send out a bunch, you should make this much money in the end. (If you have a vector of varying probabilities, in R code the estimated revenue will then look like prob <- runif(2000,0,0.1); pover <- prob > 0.02; sum( (50*prob - 1)*pover ).)

But many of the decisions I work with are not a single number in the benefits column. I am working with medical insurance claims data at HMS, and often determining models to audit those claims in some way. In this framework, it is more important to audit a high dollar claim than a lower dollar claim, even if the higher dollar value claim has a lower probability. So I have been making the subsequent filled contour plot I am going to show in the next section to illustrate this.

python contour plot

The code snippet is small enough to just copy-paste entirely. First, I generate data over a regular grid to illustrate different claim amounts and then probabilities. Then I use np.meshgrid to get the data in the right shape for the contour plot. The revenue estimates are then simply the probability times the claims amount, minus some fixed (often labor to audit the claim) cost. After that is is just idiosyncratic matplotlib code to make a nice filled contour.

# Example of making a revenue contour plot
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import numpy as np

n = 500 #how small grid cells are
prob = np.linspace(0,0.5,n)
dollar = np.linspace(0,10000,n)
#np.logspace(0,np.log10(10000),n) #if you want to do logged

# Generate grid
X, Y = np.meshgrid(prob, dollar)

# Example generating revenue
fixed = 200
Rev = (Y*X) - fixed

fig, ax = plt.subplots()
CS = ax.contourf(X, Y, Rev, cmap='RdPu')
clb = fig.colorbar(CS)
#clb.ax.set_xlabel('Revenue') #Abit too wide
clb.ax.set_title('dollar') #html does not like the dollar sign
ax.set_xlabel('Probability')
ax.set_ylabel('Claim Amount')
ax.yaxis.set_major_formatter(StrMethodFormatter('${x:,.0f}'))
plt.title('Revenue Contours')
plt.xticks(np.arange(0,0.6,0.1))
plt.yticks(np.arange(0,11000,1000))
plt.annotate('Revenue subtracts $200 of fixed labor costs',
(0,0), (0, -50),
xycoords='axes fraction',
textcoords='offset points', va='top')
#plt.savefig('RevContour.png',dpi=500,bbox_inches='tight')
plt.show()

The color bar does nice here out of the box. Next up in my personal learning will be how to manipulate color bars a bit more. Here I may want to either use a mask to not show negative expected returns, or a diverging color scheme (e.g. blue for negative returns).

Buffers and hospital deserts with geopandas

Just a quick blog post today. As a bit of a side project at work I have been looking into medical service provider deserts. Most people simply use a geographic cutoff of say 1 mile (see Wissah et al., 2020 for example for Pharmacy deserts). Also for CJ folks, John Hipp has done some related work for parolees being nearby service providers (Hipp et al., 2009; 2011), measuring nearby as 2 miles.

So I wrote some code to calculate nice sequential buffer areas and dissolve them in geopandas. Files and code to showcase are here on GitHub. First, as an example dataset, I geocode (using the census geocoding API) CMS certified Home Healthcare facilities, so these are hospice facilities. To see a map of those facilities across the US, and you can click on the button to get info, go to here, CMS HOME FACILITY MAP. Below is a screenshot:

Next I then generate sequential buffers in kilometers of 2, 4, 8, 16, and then the leftover (just for Texas). So you can then zoom in and darker areas are at a higher risk of not having a hospice facility nearby. HOSPICE DESERT MAP

Plotting some of these in Folium were giving me fits, so I will need to familiarize myself with that more in the future. The buffers for the full US as well were giving me trouble (these just for Texas result in fairly large files, surprised Github doesn’t yell at me for them being too big).

Going forward, what I want to do is instead of relying on a fixed function of distance, is to fit a model to identify individuals probability of going to the doctor based on distance. So instead of just saying 1+ mile and you are at high risk, fit a function that defines that distance based on behavioral data (maybe using insurance claims). Also I think the distances matter quite a bit for urban/rural and car/no-car. So rural folks traveling a mile is not a big deal, since you need a car to really do anything in rural areas. But for folks in the city relying on public transportation going a mile or two is a bigger deal.

The model then would be similar to the work I did with Gio on gunshot death risk (Circo & Wheeler, 2020), although I imagine the model would spatially vary (so maybe geographically weighted regression may work out well).

References

A Tableau walkthough: Seasonal chart

So my workplace uses Tableau quite a bit, and I know it is becoming pretty popular for crime analysis units as well. So I was interested in trying to pick some up. It can be quite daunting though. I’ve tried to sit through a few general tutorials, but they make my head spin.

Students of mine when I teach ArcGIS have said it is so many buttons it can be overwhelming, and Tableau is much the same way. I can see the appeal of it though, in particular for analysts who exclusively use Excel. The drag/drop you can somewhat intuitively build more detailed charts that are difficult to put together in Excel. And of course out of the box it produces interactive charts you can share, which is really the kicker that differentiates Tableau from other tools.

So instead of sitting through more tutorials I figured I would just jump in and make a few interactive graphics. And along the way I will do tutorials, same as for my other crime analysis labs, for others to follow along.

And I’ve finished/posted my first tutorial, making a seasonal chart. It is too big to fit into a blog post (over 30 screenshots!). But shows how to make a monthly seasonal chart, which is a nice interactive to have for Compstat like meetings.

Here is the final interactive version, and here is a screenshot of the end result:

And you can find the full walkthrough with screenshots here:

TABLEAU SEAONAL CHART TUTORIAL/WALKTHROUGH

Some Things Crime Analysts Should Consider When Using Tableau

So first, I built this using the free version of Tableau. I don’t think the free version will cut it though for most crime analysts.

One of the big things I see Tableau as being convenient is a visualization layer on top of a database. It can connect to the live database, and so automatically update. You cannot do this though with the free version. (And likely you will need some SQL chops to get views for data in formats you can’t figure out how to coerce Tableau functions.)

So if you go through the above tutorial and say that is alot of work, well it is, but you can set it up once on a live data stream, and it just works going forward.

The licensing isn’t crazy though, and if you are doing this for data that can be shared with the public, I think that can make sense for crime analysts. For detailed report info that cannot be shared with the public, it is a bit more tricky though (and I definitely cannot help with the details for doing your own on prem server).

There are other totally free interactive dashboard like options as well, such as Shiny in R, plotly libraries (in R and python), and python has a few other interactive ones as well. The hardest part really is the server portion for any of them (making it so others can see the interactive graphic). Tableau is nice and reactive though in my experience, even when hooked up to a live data stream (but not crazy big data).

I hope to expand to my example Poisson z-score charts with error bands, and then maybe see if I can build a dashboard with some good cross-linking between panes with geo data.

For this example I am almost 100% happy with the end result. One thing I would like is for the hover behavior to select the entire line (but the tooltips still be individual months). Also would like the point at the very end to be larger, and not show the label. But these are very minor things in the end.

My online course lab materials and musings about online teaching

I often refer folks to the courses I have placed online. Just for an update for everyone, if you look at the top of my website, I have pages for each of my courses at the header of my page. Several of these are just descriptions and syllabi, but the few lab based courses I have done over the years I have put my materials entirely online. So those are:

And each of those pages links to a GitHub page where all the lab goodies are stored.

The seminar in research focuses on popular quasi-experimental designs in CJ, and has code in R/Stata/SPSS for the weekly lessons. (Will need to update with python, I may need to write my own python margins library though!)

Grad GIS is mostly old ArcGIS tutorials (I don’t think I will update ArcPro, will see when Eric Piza’s new book comes out and just suggest that probably). Even though the screenshots are perhaps old at this point though the ideas/workflow are not. (It also has some tutorials on other open source tools, such as CrimeStat, Jerry’s Near Repeat Calculator, GeoDa, spatial regression analysis in R, and Mallesons/Andresens SPPT tool are examples I remember offhand.)

Undergrad Crime Analysis is mostly focused on number crunching relevant to crime analysts in Excel, although has a few things in Access (making SQL queries), and making a BOLO in publisher.

So for folks self-learning of course use those resources however you want. My suggestion is to skim through the syllabus, see if you want to learn about any particular lesson, and then jump right to that one. No need to slog through the whole course if you are just interested in one specific thing.

They are also freely available to any instructors who want to adapt those materials for their own courses as well.


One of the things that has disappointed me about the teaching response to Covid is instead of institutions taking the opportunity to really invest in online teaching, people are just running around with their heads cut off and offering poor last minute hybrid courses. (This is both for the kiddos as well as higher education.)

If you have ever taken a Coursera course, they are a real production! And the ones I have tried have all been really well done; nice videos, interactive quizzes with immediate feedback, etc. A professor on their own though cannot accomplish that, we would need investment from the University in filming and in scripting the webpage. But once it is finished, it can be delivered to the masses.

So instead of running courses with a tiny number of students, I think it makes more sense for Universities to actually pony up resources to help professors make professional looking online courses. Not the nonsense with a bad recorded lecture and a discussion board. It is IMO better to give someone a semester sabbatical to develop a really nice online course than make people develop them at the last minute. Once the course is set up, you really only need to administer the course, which takes much less work.

Another interested party may be professional organizations. For example, the American Society of Criminology could make an ad-hoc committee to develop a model curriculum for an intro criminology course. You can see in my course pages I taught this at one point – there is no real reason why every criminology teacher needs to strike out on their own. This is both more work for the individual teacher, as well as introduces quite a bit of variation in the content that crim/cj students receive.

Even if ASC started smaller, say promoting individual lessons, that would be lovely. Part of the difficulty in teaching a broad course like Intro to Criminology is that I am not an expert on all of criminology. So for example if someone made a lesson plan/video for bio-social criminology, I would be more apt to use that. Think instead of a single textbook, leveraging multi-media.


It is a bit ironic, but one of the reasons I was hired at HMS was to internally deliver data science training. So even though I am in the private sector I am still teaching!

Like I said previously, you are on your own for developing teaching content at the University. There is very little oversight. I imagine many professors will cringe at my description, but one of the things I like at HMS is the collaboration in developing materials. So I initially sat down with my supervisor and project manager to develop the overall curricula. Then for individual lessons I submit my slides/lab portion to my supervisor to get feedback, and also do a dry run in front of one of my peers on our data science team to get feedback. Then in the end I do a recorded lecture – we limit to something like 30 people on WebEx so it is not lagging, but ultimately everyone in the org can access the video recording at a later date.

So again I think this is a better approach. It takes more time, and I only do one lecture at a time (so take a month or two to develop one lecture). But I think that in the end this will be a better long term investment than the typical way Uni’s deliver courses.