ptools R package

It has been on my bucket list for a bit, but I wanted to take the time to learn how to construct an R package (same as for a python package). So I crafted a package with only a few functions in it so far, ptools, short for Poisson tools.

These are a handful of functions I have blogged about over the years, including functions for various WDD tests and the variants I have blogged about (weighted harm scores, different time periods, and different area sizes).

Small sample counts in bins (which can be used for Benford’s test), or my original application was checking if a chronic offender had a propensity to commit crimes on certain days of the week.

The Poisson e-test, and a function to check whether a distribution is Poisson distributed and two more Poisson related functions as well.

I think I will add quite a few more functions in the soup before I bother submitting to CRAN. (Installing via devtools via github is quite easy, so I do not feel too bad about that.) If you have functions you think I should add just let me know. (Or just make a pull request and add them yourself!) I also need to work on unit tests, and getting github actions set up. I will probably crunch on this for a bit, and then migrate personal projects back to creating some python libraries for my other work.

I do not use R-studio, but the open book R packages has been immensely helpful. On my windows box I had to bother to add R to my system path, so I can start my R session at the appropriate directory, but besides that very minor hassle it has been quite easy to follow.


I probably have not put in my 10k total hours as a guesstimate to mastery in computer programming. I think maybe closer to 5000, and that is spread out (maybe quite evenly at this point) between python, R, SPSS (and just a little Stata). And I still learn new stuff all the time. Being in an environment where I need to work with more people has really hammered down getting environments right, and making it shareable with other teammates. And part and parcel with that is documenting code in a much more thorough manner than most of the code snippets I leave littered on this blog.

So it probably is worth me posting less, but spending more time making nicer packages to share with everyone.

I do not know really how folks do R programming for making packages. I know a little at this point about creating separate conda environments for python to provide some isolation – is there something equivalent to conda environments for R? Do the R CMD checks make this level of isolation unnecessary? Should I just be working on an isolated docker image for all development work? I do not know. I do not have to worry about that at the moment though.


Part of this self learning journey is because I am trying to start a journal aimed at criminologists where you can submit software packages. Similar to the Journal of Open Source Software or the Journal of Statistical Software, etc. For submission to there I want people to have documentation for functions, and really that necessitates having a nice package (whether in R or python or whatever). So I can’t tell people you need to make a package if I don’t do that myself!

The software papers are not a thing yet (I would call it a soft launch at this point), but I have been bugging folks about submitting papers to get a dry run of the process. If you have something you would like to submit, feel free to get in touch and we can get you set up.

Similarities between crime and health insurance data

One of the things I was mildly worried about when making the jump to the private sector was that the knowledge I had built up from my work in crime analysis over the years would not be transferable. I had basically 10+ year experience working with crime data (directly as a crime analyst at Troy, or when I was a research analyst at the Finn Institute, or when I was doing other collaborations with PDs).

PDs all basically have a similar records management set up. Typical tables are CAD, incident reports, arrests, charges, etc. PDs will have somewhat different fields – but the way they all related to each other are very similar.

Because the company I work for now aggregates health insurance claims from multiple insurance agencies it is a bit more complicated, but there are similarities between how people analyze health insurance claims that in broad strokes are similar to issues with crime data. Below are my musings on that front.

Classifying Events: UCR vs DRG

Historically the predominate way in which people classify what type of crime occurs in a particular incident is via the Uniform Crime Report (UCR) hierarchy. Imagine a crime incident in which someone breaks into a house (burglary), and then also assaults the individual within the home (aggravated assault). When we count these crimes for reporting purposes, we typically take ‘the top charge’, and analyze the event strictly as an assault.

Inpatient health insurance claims (when someone goes to a hospital) have a somewhat unifying classification, Diagnostic Related Groupings, DRG for short. Unlike UCR for general crime reporting though, these are used to bill insurance claims. The idea being that instead of itemizing your hospital bill, insurance companies broadly compensate according to the DRG. This purportedly discourages tacking on extra medical procedures, although brings with it some other problems instead (see the later section in this post on discretion).

Unlike the UCR, DRGs have quite a few more categories, check out the APR DRG weights for New York State for example. For the APR DRG, the DRG also includes a severity category. This I think would be a neat idea for crime incidents – it is somewhat codified in penal laws, but not so much in typical crime reporting. It is somewhat accomplished by folks creating harm weights for crimes (e.g. Ratcliffe, 2015). (There is also a second major DRG used by insurance agencies here in the states, the MS-DRG. That is not a good idea to take from medical records, having multiple common ways to group events!)

One major difference between crimes and health insurance claims are ICD codes. One insurance claim can have multiple ICD codes. For example a claim with an APR DRG of 161 could have ICD codes for:

  • I214: Heart Attack
  • E119: Diabetes
  • I2510: Heart Disease
  • E785: High Cholesterol

So there are a mix of chronic conditions (that for billing purposes can modify the severity of the claim), but are not directly related to the current claim/incident/hospital stay.

This could be a neat idea for crime records – say a domestic incident happens, and there is a field to record prior history of domestic incidents. I can see how that would be useful both in the immediate term for the officer handling the call, as well as for an analyst crunching numbers/trends. That being said, ICD codes are crazy in their specificity, so that is not a good thing.

You could also maybe do some other crunching to create your own crime categories based on the individual crime types, see for example Kuang et al. (2017). This is sort of like creating your own DRG for crimes.

Aggregate vs Individual

The point of creating high level groupings is to aggregate multiple events together. In policing, UCR statistics are commonly used to evaluate crime trends over time. Health insurance claims are typically not used for monitoring disease outcomes – since there isn’t any standardized location where they are all collated it would be pretty difficult to use them in that manner for the general pop.

But, overall aggregate statistics pooling claims from particular healthcare providers (e.g. hospitals) are sometimes used for different reimbursement policies. For examples, MIPS is intended as a metric for healthcare providers to promote value based care (Liao & Navathe, 2021), or the CaseMix system (Steinbusch et al., 2007). If you checked out the prior APR DRG list I linked to, you can see they had weights, and higher weights have higher standard billing. The idea behind CaseMix is that if a provider takes on many high weight cases, they get a modifier that ups the weights/billing by a certain percent.

You could maybe consider MIPS to be similar to agencies that give PDs scorecards, aggregating many different metrics together. I rather look at individual metrics though, such as this funnel chart example I give for monitoring use of force. I don’t see much point in aggregating different metrics all together into one final score.

Currently in policing many agencies are migrating from the UCR system, which is just an aggregate tally of events, to NIBRS, which is a database that reports individual events (Kaplan, 2021a, 2021b).

Discretion

Police departments and health care providers (the ones creating the incidents/claims) both have discretion. For PDs, they often want to downgrade the severity of crime incidents, see Thomas and Wolff (2021) for example. Health providers have incentives going the other way though, they have incentives to upcode claims to increase insurance payouts (Farbmacher et al., 2020). Some claims are more fuzzy than others, for example CPT codes that determine a doctors time on a particular office visit are one good example – doctors can just claim they spent larger amounts of time on the office visit (Brunt, 2011).

Like I said previously, health insurance claims are not typically used to monitor overall health outcomes, so non-reporting is not something people really worry about (although researchers should be cognizant of non reporting if they are using insurance claims to look at say policy analysis). The dark figure of crime though is a perpetual threat to the validity of interpreting crime trends.

Health insurance claims have a somewhat opposite problem – submitting claims for when events actually did not happen. One example this occurs is ambulance ghost rides, ambulance billing for events that appear to not have occurred at all (Sanghavi et al., 2021).

Similar to crime events, these reporting/claim errors can either be the result of unintentional accidents, or they can be malicious. Often times, even in retrospect if you know something was in error, it can be difficult to impossible to tell the difference between the two scenarios.

The big difference is $$

The scale of healthcare insurance in the US is massive. Because of this, there is a market to audit these health insurance claims. For example, Georgia is likely to recover nearly half a billion in medical overpayments for the past year. Some of the work I am doing at HMS is related to using machine learning to identify these overpaid Medicare claims. My work is spread across multiple states, but I have easily identified over 8 digits of medical overpayments based on that work in the past year.

There is nothing equivalent to this for policing. There is no monetary incentive for individuals to audit how crime complaints are handled/recorded/resolved.

I wonder if there were a market how much criminal justice would look differently in the United States? For example, say if you had victimization insurance, and detectives worked for the insurance agencies instead of the public sector. This could maybe improve clearance rates, but of course would place more economic burdens on individuals to be insured. That is pure speculation though.

References

Solving the P-Median model

Wendy Jang, my former student at UTD and now Data Scientist for the Stanislaus County Sheriff’s Dept. writes in:

Hello Andy,

I have some questions about your posting in GitHub. Honestly, I am not good at Python at all. I was playing with your python code and trying to understand the work flow. Then, I can pretty much mimic what you did; however, I encountered some errors along the way.

I was able to follow through lines but then I am stuck on PMed function. Do you know what possibly caused this error? See the screenshot below. Please advise. Thanks!

Specifically, Wendy is asking about my patrol redistricting example with workload inequality constraints. Wendy’s problem here is specifically she likely does not have CPLEX installed. CPLEX is free for academics, but unfortunately costs a bit of money for anyone else (not sure if they have cheaper licensing for public sector, looks like currently about $200 a month).

This was a good opportunity to update some of my code, so now see the two .py files in the DataCreated folder, in particular 01_pmed_class.py has a nice class function. I have additionally added in functionality to create a map, and more importantly eliminate subtours (my solution can potentially return disconnected areas). I have also added the ability to use different solvers.

For this problem CPLEX just works much better (takes around 10 minutes), but I was able to get the SCIP solver to return the same solution in around 5 hours. GLPK did not return a solution even letting it churn out for over 12 hours. I tested CBC for shorter time periods, but that appears to be a no-go as well.

So just a brief overview, the libraries I will be using (check out the end of the post for setting up the conda environment):

import pickle
import pulp
import networkx
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import geopandas as gpd

class pmed():
    """
    Ar - list of areas in model
    Di - distance dictionary organized Di[a1][a2] gives the distance between areas a1 and a2
    Co - contiguity dictionary organized Co[a1] gives all the contiguous neighbors of a1 in a list
    Ca - dictionary of total number of calls, Ca[a1] gives calls in area a1
    Ta - integer number of areas to create
    In - float inequality constraint
    Th - float distance threshold to make a decision variables
    """
    def __init__(self,Ar,Di,Co,Ca,Ta,In,Th):

I do not copy-paste the entire pmed class in the blog post. It is long, but is the rewrite of my model from the paper. Next, to load in the objects I need to fit the model (which were created/pickled in the 00_PrepareData.py file).

# Loading in data
data_loc = r'D:\Dropbox\Dropbox\PublicCode_Git\PatrolRedistrict\PatrolRedistrict\DataCreated\fin_obj.pkl'
areas, dist_dict, cont_dict, call_dict, nid_invmap, MergeData = pickle.load(open(data_loc,"rb"))

# shapefile for reporting areas
carr_report = gpd.read_file(r'D:\Dropbox\Dropbox\PublicCode_Git\PatrolRedistrict\PatrolRedistrict\DataOrig\ReportingAreas.shp')

And now we can create a pmed object, which has all the info necessary, and gives us a print out of the total number of decision variables and constraints.

# Creating pmed object
pmed12 = pmed(Ar=areas,Di=dist_dict,Co=cont_dict,Ca=call_dict,Ta=12,In=0.1,Th=10)

Writing the class this way allows the user to input their own solver, and you can see it prints out the available solvers on your current machine. So to pass in the SCIOPT solver you could do pmed12.solve(solver=pulp.SCIP_CMD()), which again for this problem does find the solution, but takes 5 hours. So here I still stick with CPLEX to just illustrate the new classes functionality:

pmed12.solve(solver=pulp.CPLEX(timeLimit=30*60,msg=True))     # takes around 10 minutes

This prints out a ton of information at the command line that you do not get if you run from within Jupyter notebooks. So for debugging that is much more useful. timeLimit is in seconds, but does not include the presolve phase I believe, so some of these solvers can just get stuck on the presolve forever.

We can look at the results in a nice map if you do have geopandas installed and pass it a geopandas data frame and the key to match it on:

pmed12.map_plot(carr_report, 'PDGrid')

This map shows the new areas and different colors with a thick border, and the source area as a dot with a white outline. So here you can see that we end up having one disconnected area – a subtour in optimization parlance. I have added some methods to deal with this though:

stres = pmed12.collect_subtours()

And you can see that for one of our areas it identified a subtour. I haven’t written this to automatically resolve for subtours due to the potential length of the solving time. So here we can see the subtours have 0 calls in them. You can assign those areas to wherever and it does not change the objective function. So that is what I did in the paper and showed how in the Jupyter notebook at the main page for the github project.

So if you are using a function that takes 5 hours, I would suggest you manually fix these subtours if there is an obvious to the human eye easy solution. But here with the shorter solve time with CPLEX I can rerun the algorithm with a warm start, and it runs even faster (under 5 minutes). The .collect_subtours() method adds subtour constraints into the pulp model, so you just need to redo the solve() method to eliminate that particular subtour (I do not know if this is guaranteed to converge though for all problems).

So here in this data eliminating that one subtour results in a solution with no subtours:

pmed12.solve(solver=pulp.CPLEX_CMD(msg=False,warmStart=True))
stres = pmed12.collect_subtours()

And we can replot the result, which shows it chooses the areas that you would have assigned by eye anyway:

pmed12.map_plot(carr_report, 'PDGrid')

So again for this problem if you have CPLEX I would recommend it (have not tried Gurobi). But at least for this particular dataset SCIOPT was able to solve the problem in 5 hours. So if you are a crime analyst or someone else without academic access to CPLEX you can install the sciopt solver and give that a go for your actual data.

Note that this is based off of results for 325 subareas. So if you have more subareas it will take a longer (and if you have fewer it may be quite a bit shorter).

Setting up the Conda environment

So once you have python installed, you can typically do something like:

pip install pulp

Or via conda:

conda install pyscipopt

I often have trouble though, especially when working with the python geospatial libraries, to install geopandas, fiona, etc. So here what I do is create a new conda environment:

conda create -n linprog
conda activate linprog
conda install -c conda-forge python=3 pip pandas numpy networkx scikit-learn dbfread geopandas glpk pyscipopt pulp

And then you can run the pmedian code above in this environment. I suppose I should turn this into a python package, and I see a bunch of folks anymore are doing docker images as well with their packages in complicated environments. This is actually not that bad, minus geopandas makes things a bit tricky.

Creating automated reports using python and Jupyter notebooks

Deborah Osborne had a chat with Chris Bruce the other day about general crime analysis, and they discussed the regular reading of reports. I did this as well when I worked at Troy New York as the lone analyst. I would come in and skim the ~50 reports for the prior day.

The chief and mayor wanted a breakdown of particular noteworthy events, so I would place my own notes in a spreadsheet and then make a daily report. My set up was not fully automated but close – I had a pretty detailed HTML template, and once my daily data was inputted, I would run a SPSS script to fill in the HTML. I also did a simple pin map in batch geo (one thing that was not automated about it) and inserted into the report.

I had two other quite regular reports I would work on. One was a weekly command staff report about overall trends and recent upticks, the other was a monthly Compstat meeting going over similar stats. (I also had various other products to release – detective assignments/workload, sending aggregate stats to the Albany Crime Analysis Center.)

If I had to do these again knowing what I know now, I would automate nearly 100% of this work in python. For the reports, I would use jupyter notebooks (I actually do not like coding in these very much, I much prefer plain text IDEs, but they are good for generating nice looking reports I will show.)

Making Reports in Jupyter Notebooks

I have provided the notes to fully automate a simple report here on Github. To replicate, first you need to download the Dallas PD open data and create a local sqlite database (can’t upload that large of file to github).

So first before you start, if you download the .py files, you can run at the command prompt something like:

cd D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\jupyter_reports
python 00_CreateDB.py

Just replace the cd path to wherever you saved the files on your local machine (and this assumes you have Anaconda installed). Then once that is done, you can replicate the report locally, but it is really meant as a pedogological tool – you can see how I wrote the jupyter notebook to query the local database. For your own case you would write the SQL code and connection to whereever your local crime data is store.

Here is an example of how you can use string substitution in python to create a query that is date aware for when the code is run:

Part of the hardest part of making standardized reports is you can typically make the data formatted how you want, but to get little pieces looking exactly how you want them is hard. So here is a default pivot table exported in a Jupyter notebook for some year to date statistics (note the limitations of this, why I prefer graphs and do not show a percent change).

And here is code I wrote to change font sizes and insert a title. A bit of work, but once you figure it out once you are golden.

You can go look at the notebook itself, but I also have an example of generating a weekly error bar chart (much preferred over the Compstat YTD tables):

Final note, to compile the notebook without showing any code (the police chief does not want to see your python code!), it looks like this from the command line.

jupyter nbconvert --execute example_report.ipynb --no-input --to html

I have further notes in the github page on automating this fully via bat files for windows, renaming files to make them update to the current date, etc.

Why Automate?

I know some analysts are reading this and thinking to themselves – I can generate these reports in 30 minutes using Excel and Powerpoint – why spend time to learn something new to make it 100% automated? There are a few reasons. One is pure time savings in the end. Say for the weekly report you spend 1 hour, and it takes you three work days (24 hours) to fully automate. You will recover your time in 24 weeks.

Time savings is not the only component though. Fully automating means the workflow is 100% reproducible, and makes it easier to transfer that workflow to other analysts in the future. This is an important consideration when scaling – if you need to spend a few hours once a week forever, you can only take on generating so many reports. It is better for your time to do a one time large sink into automating something, than to loan out a portion of your time forever.

A final part is the skills you develop when generating automated reports are more similar to data science roles in the private sector – so consider it an investment in your career as well. The types of machine learning pipelines I create at my current role would not be possible if I could not fully automate. I would only be able to do 2 or maybe 3 projects forever and just maintain them. I fully automate my data pipelines though, and then can hand off that job to a DevOps engineer, and only worry about fixing things when they break. (Or a more junior data scientist can take over that project entirely.)

Spatial analysis of NYC Shootings using the SPPT

As a follow up to my prior post on spatial sample size recommendations for the SPPT test, I figured I would show an actual analysis of spatial changes in crime. I’ve previously written about how NYC shootings appear to be going up by a similar amount in each precinct. We can do a similar analysis, but at smaller geographic spatial units, to see if that holds true for everywhere.

The data and R code to follow along can be downloaded here. But I will copy-paste below to walk you through.

So first I load in the libraries I will be using and set my working directory:

###################################################
library(sppt)
library(sp)
library(raster)
library(rgdal)
library(rgeos)

my_dir <- 'C:\\Users\\andre\\OneDrive\\Desktop\\NYC_Shootings_SPPT'
setwd(my_dir)
###################################################

Now we just need to do alittle data prep for the NYC data. Concat the old and new files, convert the data fields for some of the info, and do some date manipulation. I choose the pre/post date here March 1st 2020, but also note we had the Floyd protests not to long after (so calling these Covid vs protest increases is pretty much confounded).

###################################################
# Read in the shooting data

old_shoot <- read.csv('NYPD_Shooting_Incident_Data__Historic_.csv', stringsAsFactors=FALSE)
new_shoot <- read.csv('NYPD_Shooting_Incident_Data__Year_To_Date_.csv', stringsAsFactors=FALSE)

# Just one column off
print( cbind(names(old_shoot), names(new_shoot)) )
names(new_shoot) <- names(old_shoot)
shooting <- rbind(old_shoot,new_shoot)

# I need to conver the coordinates to numeric fields
# and the dates to a date field

coord_fields <- c('X_COORD_CD','Y_COORD_CD')
for (c in coord_fields){
  shooting[,c] <- as.numeric(gsub(",","",shooting[,c])) #replacing commas in 2018 data
}

# How many per year to check no funny business
table(substring(shooting$OCCUR_DATE,7,10))

# Making a datetime variable in R
shooting$OCCUR_DATE <- as.Date(shooting$OCCUR_DATE, format = "%m/%d/%Y", tz = "America/New_York")

# Making a post date to split after Covid started
begin_date <- as.Date('03/01/2020', format="%m/%d/%Y")
shooting$Pre <- ifelse(shooting$OCCUR_DATE < begin_date,1,0)

#There is no missing data
summary(shooting)
###################################################

Next I read in a shapefile of the census tracts for NYC. (Pro-tip for NYC GIS data, I like to use Bytes of the Big Apple where available.) The interior has a few dongles (probably for here should have started with a borough outline file), so I do a tiny buffer to get rid of those interior dongles, and then smooth the polygon slightly. To check and make sure my crime data lines up, I superimpose with a tiny dot map — this is also a great/simple way to see the overall shooting density without the hassle of other types of hot spot maps.

###################################################
# Read in the census tract data

nyc_ct <- readOGR(dsn="nyct2010.shp", layer="nyct2010") 
summary(nyc_ct)
plot(nyc_ct)
nrow(nyc_ct) #2165 tracts

# Dissolve to a citywide file
nyc_ct$const <- 1
nyc_outline <- gUnaryUnion(nyc_ct, id = nyc_ct$const)
plot(nyc_outline)

# Area in square feet
total_area <- area(nyc_outline)
# 8423930027

# Turning crimes into spatial point data frame
coordinates(shooting) <- coord_fields
crs(shooting) <- crs(nyc_ct)

# This gets rid of a few dongles in the interior
nyc_buff <- gBuffer(nyc_outline,1,byid=FALSE)
nyc_simpler <- gSimplify(nyc_buff, 500, topologyPreserve=FALSE)

# Checking to make sure everything lines up
png('NYC_Shootings.png',units='in',res=1000,height=6,width=6,type='cairo')
plot(nyc_simpler)
points(coordinates(shooting),pch='.')
dev.off()
###################################################

The next part I created a function to generate a nice grid over an outline area of your choice to do the SPPT analysis. What this does is generates the regular grid, turns it from a raster to a vector polygon format, and then filters out polygons with 0 overlapping crimes (so in the subsequent SPPT test these areas will all be 0% vs 0%, so not much point in checking them for differences over time!).

You can see the logic from the prior blog post, if I want to use the area with power to detect big changes, I want N*0.85. Since I am comparing data over 10 years compared to 1+ years, they are big differences, so I treat N here as 1.5 times the newer dataset, which ends up being around a suggested 3,141 spatial units. Given the area for the overall NYC, this translates to grid cells that are about 1600 by 1600 feet. Once I select out all the 0 grid cells, there only ends up being a total of 1,655 grid cells for the final SPPT analysis.

###################################################
# Function to create sppt grid over areas with 
# Observed crimes

grid_crimes <- function(outline,crimes,size){
    # First creating a raster given the outline extent
    base_raster <- raster(ext = extent(outline), res=size)
    projection(base_raster) <- crs(outline)
    # Getting the coverage for a grid cell over the city area
    mask_raster <- rasterize(outline, base_raster, getCover=TRUE)
    # Turning into a polygon
    base_poly <- rasterToPolygons(base_raster,dissolve=FALSE)
    xy_df <- as.data.frame(base_raster,long=T,xy=T)
    base_poly$x <- xy_df$x
    base_poly$y <- xy_df$y
    base_poly$poly_id <- 1:nrow(base_poly)
    # May also want to select based on layer value
    # sel_poly <- base_poly[base_poly$layer > 0.05,]
    # means the grid cell has more than 5% in the outline area
    # Selecting only grid cells with an observed crime
    ov_crime <- over(crimes,base_poly)
    any_crime <- unique(ov_crime$poly_id)
    sub_poly <- base_poly[base_poly$poly_id %in% any_crime,]
    # Redo the id
    sub_poly$poly_id <- 1:nrow(sub_poly)
    return(sub_poly)
}


# Calculating suggested sample size
total_counts <- as.data.frame(table(shooting$Pre))
print(total_counts)

# Lets go with the pre-total times 1.5
total_n <- total_counts$Freq[1]*1.5

# Figure out the total number of grid cells 
# Given the total area
side <- sqrt( total_area/total_n ) 
print(side)
# 1637, lets just round down to 1600

poly_cells <- grid_crimes(nyc_simpler,shooting,1600)
print(nrow(poly_cells)) #1655

png('NYC_GridCells.png',units='in',res=1000,height=6,width=6,type='cairo')
plot(nyc_simpler)
plot(poly_cells,add=TRUE,border='blue')
dev.off()
###################################################

Next part is to split the data into pre/post, and do the SPPT analysis. Here I use all the defaults, the Chi-square test for proportional differences, along with a correction for multiple comparisons. Without the multiple comparison correction, we have a total of 174 grid cells that have a p-value < 0.05 for the differences in proportions for an S index of around 89%. With the multiple comparison correction though, the majority of those p-values are adjusted to be above 0.05, and only 25 remain afterwards (98% S-index). You can see in the screenshot that all of those significant differences are increases in proportions from the pre to post. While a few are 0 shootings to a handful of shootings (suggesting diffusion), the majority are areas that had multiple shootings in the historical data, they are just at a higher intensity now.

###################################################
# Now lets do the sppt analysis

split_shoot <- split(shooting,shooting$Pre)
pre <- split_shoot$`1`
post <- split_shoot$`0`

library(dplyr)
sppt_diff <- sppt_diff(pre, post, poly_cells)
summary(sppt_diff)

# Unadjusted vs adjusted p-values
sum(sppt_diff$p.value < 0.05) #174, around 89% similarity
sum(sppt_diff$p.adjusted  < 0.05) #25, 98% similarity

# Lets select out the increases/decreases
# And just map those

sig <- sppt_diff$p.adjusted < 0.05
sppt_sig <- sppt_diff[sig,]
head(sppt_sig,25) # to check out all increases
###################################################

The table is not all that helpful though for really digging into patterns, we need to map out the differences. The first here is a map showing the significant grid cells. They are somewhat tiny though, so you have to kind of look close to see where they are. The second map uses proportional circles to the percent difference (so bigger circles show larger increases). I am too lazy to do a legend/scale, but see my prior post on a hexbin map, or the sp website in the comments.

###################################################
# Making a map
png('NYC_SigCells.png',units='in',res=1000,height=6,width=6,type='cairo')
plot(nyc_simpler,lwd=1.5)
plot(sppt_sig,add=TRUE,col='red',border='white')
dev.off()

circ_sizes <- sqrt(-sppt_sig$diff_perc)*3

png('NYC_SigCircles.png',units='in',res=1000,height=6,width=6,type='cairo')
plot(nyc_simpler,lwd=1.5)
points(coordinates(sppt_sig),pch=21,cex=circ_sizes,bg='red')
dev.off() 

# check out https://edzer.github.io/sp/
# For nicer maps/legends/etc.
###################################################

So the increases appear pretty spread out. We have a few notable ones that made the news right in the thick of things in Manhattan, but there are examples of grid cells that increased scattered all over the boroughs. I am not going to the trouble here, but if I were a crime analyst working on this, I would export this to a format where I could zoom into the local areas and drill down into the specific incidents. You can do that either in ArcGIS, or more directly in R by creating a leaflet map.

So if folks have any better ideas for testing out crime increases I am all ears. At some point will give the R package sparr a try. (Here you could treat pre as the controls and post as the cases.) I am not a real big fan of over interpreting changes in kernel density estimates though (they can be quite noisy, and heavily influenced by the bandwidth), so I do like the SPPT analysis by default (but it swaps out a different problem with choosing a reasonable grid cell size).

Spatial sample size suggestions for SPPT analysis

I’ve reviewed several papers recently that use Martin Andresen’s Spatial Point Pattern Test (Andresen, 2016). I have been critical of these papers, as I think they are using too small of samples to be reasonable. So here in this blog post I will lay out spatial sample size recommendations. Or more specifically if you have N crimes, advice about how you can conduct the SPPT test in S spatial units of analysis.

Long story short, if you have N crimes, I think you should either use 0.85*N = S spatial units of analysis at the high end, but can only detect very large changes. To be well powered to detect smaller changes between the two distributions, use 0.45*N = S. That is, if you have two crime samples you want to compare, and the smaller sample has 1000 crimes, the largest spatial sample size I would recommend is 850 units, but I think 450 units is better.

For those not familiar with the SPPT technique, it compares the proportion of events falling inside a common area (e.g. police beats, census block groups, etc.) between two patterns. So for example in my work I compared the proportion of violent crime and the proportion of SQF in New York City (Wheeler et al., 2018). I think it makes sense as a gross monitoring metric for PDs this way (say for those doing DDACTS, swap out pedestrian stops with traffic stops), so you can say things like area A had a much lower proportion of crimes than stops, so we should emphasize people do fewer stops in A overall.

If you are a PD, you may already know the spatial units you want to use for monitoring purposes (say for each police sector or precinct). In that case, you want the power analysis to help guide you for how large a sample you need to effectively know how often you can update the estimates (e.g. you may only have enough traffic stops and violent crimes to do the estimates on a quarterly basis, not a monthly one) Many academic papers though are just generally theory testing, so don’t have an a priori spatial unit of analysis chosen. (But they do have two samples, e.g. a historical sample of 2000 shootings and a current sample of 1000 shootings.) See Martin’s site for a list of prior papers using the SPPT to see it in action.

I’ve reviewed several papers that examine these proportion changes using the SPPT at very tiny spatial units of analysis, such as street segments. They also happen to have very tiny numbers of overall crimes, and then break the crimes into subsequent subsets. For example reviewed a paper that had around 100 crimes in each subset of interest, and had around 20,000 street segments. I totally get wanting to examine micro place crime patterns – but the SPPT is not well suited for this I am afraid.

Ultimately if you chunk up the total number of crimes into smaller and smaller areas, you will have less statistical power to uncover differences. With very tiny total crime counts, you will be basically only identifying differences between areas that go from 0 to 1 or 0 to 2 etc. It also becomes much more important to control for multiple comparisons when using a large number of spatial units. In general this technique is not going to work out well for micro units of analysis, it will only really work out for larger spatial units IMO. But here I will give my best advice about how small you can reasonably go for the analysis.

Power analysis logic

There are quite a few different ways people have suggested to determine the spatial sample for areas when conducting quadrat analysis (e.g. when you make your own spatial areas). So one rule of thumb is to use 2*A/N, where A is the area of the study and N is the total number of events (Paez, 2021).

Using the SPPT test itself, Malleson et al. (2019) identify the area at which the spatial pattern exhibits the highest similarity index with itself using a resampling approach. Ramos et al. (2021) look at the smallest spatial unit at which the crime patterns within that unit show spatial randomness.

So those later two take an error metric based approach (the spatial unit of analysis likely to result in the miminal amount of error, with error defined different ways). I take a different approach here – power analysis. We want to compare to spatial point patterns for proportional differences, how can we construct the test to be reasonably powered to identify differences we want to detect?

I do not have a perfect way to do this power analysis, but here is my logic. Crime patterns are often slightly overdispersed, so here I assume if you split up say 1000 crimes into 600 areas, it will have an NB2 distribution with a mean of 1000/600 = 1.67 and an overdispersion parameter of 2. (I assume this parameter to be 2 for various reasons, based on prior analysis of crime patterns, and that 2 tends to be in the general ballpark for the amount of overdispersion.) So now we want to see what it would take to go from a hot spot of crime, say the 98th percentile of this distribution to the median 50th percentile.

So in R code, to translate the NB2 mean/dispersion to N & P notation results in N & P parameters of 1.0416667 and 0.3846154 respectively:

trans_np <- function(mu,disp){
    a <- disp
    x <- mu^2/(1 - mu + a*mu)
    p <- x/(x + mu)
    n <- (mu*p)/(1-p)
    return(c(n,p))
}

# Mean 1000/600 and dispersion of 2
nb_dis <- trans_np(1000/600,2)

Now we want to see what the counts are to go from the 98th to the 50th percentile of this distribution:

crime_counts <- qnbinom(c(0.98,0.5), size=nb_dis[1], prob=nb_dis[2])

And this gives us a result of [1] 8 1 in the crime_counts object. So a hot spot place in this scenario will have around 8 crimes, and the median will be around 1 crime in our hypothetical areas. So we can translate these to percentages, and then feed them into R’s power.prop.test function:

crime_prop <- crime_counts/1000
power.prop.test(n = 1000, p1 = crime_prop[1], p2 = crime_prop[2])

And this gives us a result of:

     Two-sample comparison of proportions power calculation 

              n = 1000
             p1 = 0.008
             p2 = 0.001
      sig.level = 0.05
          power = 0.6477139
    alternative = two.sided

NOTE: n is number in *each* group

Note that this is for one N estimate, and assumes that N will be the same for each proportion. In practice for the SPPT test this is not true, oftentimes we have two crime samples (or crime vs police actions like stops), which have very different total baseline N’s. (It is part of the reason the test is useful, it doesn’t make so much sense in that case to compare densities as it does proportions.) So subsequently when we do these estimates, we should either take the average of the total number of crimes we have in our two point patterns for SPPT (if they are close to the same size), or the minimum number of events if they are very disparate. So if you have in sample A 1000 crimes, and sample B 2000 crimes, I think you should treat the N in this scenario as 1500. If you have 5000 crimes vs 1000000 crimes, you should treat N here as 5000.

So that estimate above is for one set of crimes (1000), and one set of areas (600). But what if we vary the number of areas? At what number of areas do we have the maximum power?

So I provide functions below to generate the power estimate curve, given these assumptions about the underlying crime distribution (which will generally be in the ballpark for many crime patterns, but not perfect), for varying numbers of spatial units. Typically we know the total number of crimes, so we are saying given I have N crimes, how finely can a split them up to check for differences with the SPPT test.

Both the Malleson and Ramos article place their recommendations in terms of area instead of total number of units. But it would not surprise me if our different procedures end up resulting in similar recommendations based on the observed outputs of each of the papers. (The 2A/N quadrat analysis suggestion translates to N*0.5 total number of areas, pretty close to my 0.45*N suggestion for example.)

R Code

Below I have a nicer function to do the analysis I walked through above, but give a nice power curve and dataframe over various potential spatial sample sizes:

# SPPT Power analysis example
library(ggplot)

# See https://andrewpwheeler.com/2015/01/03/translating-between-the-dispersion-term-in-a-negative-binomial-regression-and-random-variables-in-spss/
trans_np <- function(mu,disp){
    a <- disp
    x <- mu^2/(1 - mu + a*mu)
    p <- x/(x + mu)
    n <- (mu*p)/(1-p)
    return(c(n,p))
}

diff_suggest <- function(total_crimes,areas=round(seq(2,total_crimes*10,length.out=500)),
                         change_quant=c(0.98,0.5), nb_disp=2, alpha = 0.05,
                         plot=TRUE){
         # Figure out mean
         areas <- unique(areas)
         mean_cr <- total_crimes/areas
         # Initialize some vectors to place the results
         n_areas <- length(areas)
         power <- vector("numeric",length=n_areas)
         hign <- power
         lown <- power
         higp <- power
         lowp <- power
         # loop over areas and calculate power
         for (i in 1:length(areas)){
             # Negative binomial parameters
             dp <- trans_np(mean_cr[i],nb_disp)
             hilo <- qnbinom(change_quant, size=dp[1], prob=dp[2])
             hilo_prop <- hilo/total_crimes
             # Power for test
             pow <- power.prop.test(n = total_crimes, p1 = hilo_prop[1], p2 = hilo_prop[2], 
                                    sig.level = alpha)
             # Stuffing results in vector
             power[i] <- pow$power
             hign[i] <- hilo[1]
             lown[i] <- hilo[2]
             higp[i] <- hilo_prop[1]
             lowp[i] <- hilo_prop[2]
         }
         Ncrimes <- rep(total_crimes,n_areas)
         res_df <- data.frame(Ncrimes,areas,mean_cr,power,hign,lown,higp,lowp)
         # replacing missing with 0
         res_df[is.na(res_df)] <- 0
         if (plot) {
            require(ggplot2)
            fmt_cr <- formatC(total_crimes, format="f", big.mark=",", digits=0)
            title_str <- paste0("Power per area for Total number of crimes: ",fmt_cr)
            cap_str <- paste0("NB Dispersion = ",nb_disp,", alpha = ",alpha,
                              ", change quantiles = ",change_quant[1]," to ",change_quant[2])
            p <- ggplot(data=res_df,aes(x=areas,y=power)) + geom_line(size=1.5) +
                 theme_bw() + theme(panel.grid.major = element_line(linetype="dashed")) +
                 labs(x='Number of Areas',y=NULL,title=title_str,caption=cap_str) +
                 scale_x_continuous(minor_breaks=NULL) + scale_y_continuous(minor_breaks=NULL) +
                 theme(text = element_text(size=16), axis.title.y=element_text(margin=margin(0,10,0,0)))
            print(p)
         }
         return(res_df)
}

Once that function is defined, you can make a simple call like below, and it gives you a nice graph of the power given different numbers of grid cells:

diff_suggest(100)

So you can see here we never have very high power over this set of parameters. It is also non-monotonic and volatile at very small numbers of spatial units of analysis (at which the overdispersion assumption likely does not hold, and probably is not of much interest). But once that volatility tamps down we have stepped curve, that ends up happening to step whenever the original NB distribution changes from particular integer values.

So what happens with the power curve if we up the number of crimes to 3000?

diff_suggest(3000)

So those two patterns are quite similar. It happens that when breaking down to the smaller units, the highest power scenario is when the crimes are subdivided into around 0.85 fewer spatial units than total crimes. So if you have 1000 crimes, in this scenario I would suggest to use 850 areas.

Also note the behavior when you break it down into a very large number of spatial units S, where S >> N, you get a progressive decline until around 0 power in this analysis. E.g. if you have 100 times more spatial units than observations, only a handful of locations have any crimes, and the rest are all 0’s. So you need to be able to tell the difference between 1/N and 0, which is tough (and any inferences you do make will just pretty much be indistinguishable from noise).

What about if we change the quantiles we are examining, and instead of looking at the very high crime place to the median, look at the 80th percentile to the 20th percentile:

diff_suggest(3000, change_quant=c(0.8,0.2))

We have a similar step pattern, but here the power is never as high as before, and only maxes out slightly above 0.5. It happens that this 0.5 power is around number of crimes*0.45. So this suggests that to uncover more middling transitions, one would need to have less than half the number of spatial units of crime observed. E.g. if you have 1000 crimes, I would not suggest any more than 450 spatial units of analysis.

So the first scenario, crimes*0.85 you could say something like this is the highest power scenario to detect changes from very high crime locations (aka hot spots), to the middle of the distribution. For the second scenario (my preferred offhand), is to say crimes*0.45 total spatial units results in the highest power scenario to detect more mild changes in the middle of the distribution (and detecting changes from hot spots to cold spots thus have even more power).

For now that is the best advice I can give for determining the spatial sample size for the SPPT test. Will have a follow up blog post on using R to make a grid to conduct the test.

Also I have been wondering about the best way to quantify changes in the overall ranking. I have not come upon a great solution I am happy with though, so will need to think about it some more.

References

CCTV and clearance rates paper published

My paper with Yeondae Jung, The effect of public surveillance cameras on crime clearance rates, has recently been published in the Journal of Experimental Criminology. Here is a link to the journal version to download the PDF if you have access, and here is a link to an open read access version.

The paper examines the increase in case clearances (almost always arrests in this sample) for incidents that occurred nearby 329 public CCTV cameras installed and monitored by the Dallas PD from 2014-2017. Quite a bit of the criminological research on CCTV cameras has examined crime reductions after CCTV installations, which the outcome of that is a consistent small decrease in crimes. Cameras are often argued to help solve cases though, e.g. catch the guy in the act. So we examined that in the Dallas data.

We did find evidence that CCTV increases case clearances on average, here is the graph showing the estimated clearances before the cameras were installed (based on the distance between the crime location and the camera), and the line after. You can see the bump up for the post period, around 2% in this graph and tapering off to an estimate of no differences before 1000 feet.

When we break this down by different crimes though, we find that the increase in clearances is mostly limited to theft cases. Also we estimate counterfactual how many extra clearances the cameras were likely to cause. So based on our model, we can say something like, a case would have an estimated probability of clearance without a camera of 10%, but with a camera of 12%. We can then do that counterfactual for many of the events around cameras, e.g.:

Probability No Camera   Probability Camera   Difference
    0.10                      0.12             + 0.02
    0.05                      0.06             + 0.01
    0.04                      0.10             + 0.06

And in this example for the three events, we calculate the cameras increased the total expected number of clearances to be 0.02 + 0.01 + 0.06 = 0.09. This marginal benefit changes for crimes mostly depends on the distance to the camera, but can also change based on when the crime was reported and some other covariates.

We do this exercise for all thefts nearby cameras post installation (over 15,000 in the Dallas data), and then get this estimate of the cumulative number of extra theft clearances we attribute to CCTV:

So even with 329 cameras and over a year post data, we only estimate cameras resulted in fewer than 300 additional theft clearances. So there is unlikely any reasonable cost-benefit analysis that would suggest cameras are worthwhile for their benefit in clearing additional cases in Dallas.

For those without access to journals, we have the pre-print posted here. The analysis was not edited any from pre-print to published, just some front end and discussion sections were lightly edited over the drafts. Not sure why, but this pre-print is likely my most downloaded paper (over 4k downloads at this point) – even in the good journals when I publish a paper I typically do not get 1000 downloads.

To go on, complaint number 5631 about peer review – this took quite a while to publish because it was rejected on R&R from Justice Quarterly, and with me and Yeondae both having outside of academia jobs it took us a while to do revisions and resubmit. I am not sure the overall prevalence of rejects on R&R’s, I have quite a few of them though in my career (4 that I can remember). The dreaded send to new reviewers is pretty much guaranteed to result in a reject (pretty much asking to roll a Yahtzee to get it past so many people).

We then submitted to a lower journal, The American Journal of Criminal Justice, where we had reviewers who are not familiar with what counterfactuals are. (An irony of trying to go to a lower journal for an easier time, they tend to have much worse reviewers, so can sometimes be not easier at all.) I picked it up again a few months ago, and re-reading it thought it was too good to drop, and resubmitted to the Journal of Experimental Criminology, where the reviews were reasonable and quick, and Wesley Jennings made fast decisions as well.

Some microsynth notes

Nate Connealy, a criminologist colleague of mine heading to Tampa asks:

My question is from our CPP project on business improvement districts (Piza, Wheeler, Connealy, Feng 2020). The article indicates that you ran three of the microsynth matching variables as an average over each instead of the cumulative sum (street length, percent new housing structures, percent occupied structures). How did you get R to read the variables as averages instead of the entire sum of the treatment period of interest? I have the microsynth code you used to generate our models, but cannot seem to determine how you got R to read the variables as averages.

So Nate is talking about this paper, Crime control effects of a police substation within a business improvement district: A quasi-experimental synthetic control evaluation (Piza et al., 2020), and here is the balance table in the paper:

To be clear to folks, I did not balance on the averages, but simply reported the table in terms of averages. So here is the original readout from R:

So I just divided those noted rows by 314 to make them easier to read. You could divide values by the total number of treated units though in the original data to have microsynth match on the averages instead if you wanted to. Example below (this is R code, see the microsynth library and paper by Robbins et al., 2017):

library(microsynth)
#library(ggplot2) #not loading here, some issue
set.seed(10)

data(seattledmi) #just using data in the package
cs <- seattledmi
# calculating proportions
cs$BlackPerc <- (cs$BLACK/cs$TotalPop)*100
cs$FHHPerc <- (cs$FEMALE_HOU/cs$HOUSEHOLDS)*100
# replacing 0 pop with 0
cs[is.na(cs)] <- 0

cov.var <- c("TotalPop","HISPANIC","Males_1521","FHHPerc","BlackPerc")
match.out <- c("i_felony", "i_misdemea")

sea_prop <- microsynth(cs, 
                       idvar="ID", timevar="time", intvar="Intervention", 
                       start.pre=1, end.pre=12, end.post=16, 
                       match.out.min=match.out,match.out=FALSE,
                       match.covar=FALSE,check.feas=FALSE,
                       match.covar.min=cov.var, 
                       result.var=match.out)

summary(sea_prop) # balance table

And here you can see that we are matching on the cumulative sums for each of the areas, but we can divide our covariates by the number of treated units, and we will match on the proportional values.

# Can divide by 39 and get the same results
cs[,cov.var] <- cs[,cov.var]/39

sea_div <- microsynth(cs, 
                      idvar="ID", timevar="time", intvar="Intervention", 
                      start.pre=1, end.pre=12, end.post=16, 
                      match.out.min=match.out,match.out=FALSE,
                      match.covar=FALSE,check.feas=FALSE,
                      match.covar.min=cov.var, 
                      result.var=match.out)

summary(sea_div) # balance table

Note that these do not result in the same weights. If you look at the results you will see the treatment effects are slightly different. Also if you do:

# Showing weights are not equal
all.equal(sea_div$w$Weights,sea_prop$w$Weights)

It does not return True. Honestly not familiar enough with the procedure that microsynth uses to do the matching (Raking survey weights) to know if this is due to stochastic stuff or due to how the weighting algorithm works (I would have thought a linear change does not make a difference, but I was wrong).

On the bucket list is to do a matching algorithm that returns geographically contiguous areas and gives the weights all values of 1 (so creates comparable neighborhoods), instead of estimating Raking weights. That may be 5 years though before I get around to that. Gio has a nice map to show the way the weights work now is they may be all over the place (Circo et al., 2021) – I am not sure that is a good thing though.

But I did want to share some functions I used for the paper I worked with Nate on. First, this is for if you use the permutation approach, the function prep_synth returns some of the data in a nicer format to make graphs and calculate your own stats:

# Function to scoop up the data nicely
prep_synth <- function(mod){
    #Grab the plot data
    plotStats <- mod[['Plot.Stats']]
    #For the left graph
    Treat <- as.data.frame(t(plotStats$Treatment))
    Treat$Type <- "Treat"
    #This works for my data at years, will not 
    #Be right for data with more granular time though
    Treat$Year <- as.integer(rownames(Treat))
    Cont <- as.data.frame(t(plotStats$Control))
    Cont$Type <- "Control"
    Cont$Year <- as.integer(rownames(Cont))
    AllRes <- rbind(Treat,Cont)
    #For the right graph
    Perm <- as.data.frame(t(as.data.frame(plotStats$Difference)))
    SplitStr <- t(as.data.frame(strsplit(rownames(Perm),"[.]")))
    colnames(SplitStr) <- c("Type","Year")
    rownames(SplitStr) <- 1:nrow(SplitStr)
    SplitStr <- as.data.frame(SplitStr)
    Perm$Type <- as.character(SplitStr$Type)
    Perm$Year <- as.integer(as.character(SplitStr$Year))
    Perm$Group <- ifelse(Perm$Type == 'Main','Treatment Effect','Permutations') 
    #Reordering factor levels for plots
    AllRes$Type <- factor(AllRes$Type,levels=c('Treat','Control'))
    levels(AllRes$Type) <- c('Treated','Synthetic Control')
    Perm$Group <- factor(Perm$Group,levels=c('Treatment Effect','Permutations'))
    #Exporting result
    Res <- vector("list",length=2)
    Res[[1]] <- AllRes
    Res[[2]] <- Perm
    names(Res) <- c("AggOutcomes","DiffPerms")
    return(Res)
}

It works for the prior tables, but I really made these functions to work with when you used permutations to get the errors. (In the micro synth example, it is easier to work with permutations than in the state level example for synth, in which I think conformal prediction intervals makes more sense, see De Biasi & Circo, 2021 for a recent real example with micro place based data though.)

# Takes like 1.5 minutes
sea_perm <- microsynth(seattledmi, 
                      idvar="ID", timevar="time", intvar="Intervention", 
                      start.pre=1, end.pre=12, end.post=16, 
                      match.out.min=match.out,match.out=FALSE,
                      match.covar=FALSE,check.feas=FALSE,
                      match.covar.min=cov.var, 
                      result.var=match.out, perm=99)

res_prop <- prep_synth(sea_perm)
print(res_prop)

So the dataframe in the first slot is the overall treatment effect, and the second dataframe is a nice stacked version for the permutations. First, I really do not like the percentage change (see Wheeler, 2016 for the most direct critique, but I have a bunch on this site). So I wrote code to translate the treatment effects into crime count reductions instead of the percent change stuff.

# Getting the observed treatment effect on count scale
# vs the permutations

agg_fun <- function(x){
    sdx <- sd(x)
    minval <- min(x)
    l_025 <- quantile(x, probs=0.025)
    u_975 <- quantile(x, probs=0.975)
    maxval <- max(x)
    totn <- length(x)
    res <- c(sdx,minval,l_025,u_975,maxval,totn)
    return(res)
}

treat_count <- function(rp){
    # Calculating the treatment effect based on permutations
    keep_vars <- !( names(rp[[2]]) %in% c("Year","Group") )
    out_names <- names(rp[[2]])[keep_vars][1:(sum(keep_vars)-1)]
    loc_dat <- rp[[2]][,keep_vars]
    agg_treat <- aggregate(. ~ Type, data = loc_dat, FUN=sum)
    n_cols <- 2:dim(agg_treat)[2]
    n_rows <- 2:nrow(agg_treat)
    dif <- agg_treat[rep(1,max(n_rows)-1),n_cols] - agg_treat[n_rows,n_cols]
    dif$Const <- 1
    stats <- aggregate(. ~ Const, data = dif, FUN=agg_fun)
    v_names <- c("se","min","low025","up975","max","totperm")
    long_stats <- reshape(stats,direction='long',idvar = "Const", 
                      varying=list(2:ncol(stats)),
                      v.names=v_names, times=out_names)
    # Add back in the original stats
    long_stats <- long_stats[,v_names]
    rownames(long_stats) <- 1:nrow(long_stats)
    long_stats$observed <- t(agg_treat[1,n_cols])[,1]
    long_stats$outcome <- out_names
    ord_vars <- c('outcome','observed',v_names)
    return(long_stats[,ord_vars])
}

treat_count(res_prop)

So that is the cumulative total effect of the intervention. This is more similar to the WDD test (Wheeler & Ratcliffe, 2018), but since the pre-time period is matched perfectly, just is the differences in the post time periods. And here it uses the permutations to estimate the error, not any Poisson approximation.

But I often see folks concerned about the effects further out in time for synthetic control studies. So here is a graph that just looks at the instant effects for each time period, showing the difference via the permutation lines:

# GGPLOT graphs, individual lines
library(ggplot2)
perm_data <- res_prop[[2]]
# Ordering factors to get the treated line on top
perm_data$Group <- factor(perm_data$Group, c("Permutations","Treatment Effect"))
perm_data$Type <- factor(perm_data$Type, rev(unique(perm_data$Type)))
pro_perm <- ggplot(data=perm_data,aes(x=Year,y=i_felony,group=Type,color=Group,size=Group)) + 
            geom_line() +
            scale_color_manual(values=c('grey','red')) + scale_size_manual(values=c(0.5,2)) +
            geom_vline(xintercept=12) + theme_bw() + 
            labs(x=NULL,y='Felony Difference from Control') + 
            scale_x_continuous(minor_breaks=NULL, breaks=1:16) + 
            scale_y_continuous(breaks=seq(-10,10,2), minor_breaks=NULL) +
            theme(panel.grid.major = element_line(linetype="dashed"), legend.title= element_blank(),
            legend.position = c(0.2,0.8), legend.background = element_rect(linetype="solid", color="black")) +
            theme(text = element_text(size=16), axis.title.y=element_text(margin=margin(0,10,0,0)))

And I also like looking at this for the cumulative effects as well, which you can see with the permutation lines widen over time.

# Cumulative vs Pointwise
perm_data$csum_felony <- ave(perm_data$i_felony, perm_data$Type, FUN=cumsum)
pro_cum  <- ggplot(data=perm_data,aes(x=Year,y=csum_felony,group=Type,color=Group,size=Group)) + 
              geom_line() +
              scale_color_manual(values=c('grey','red')) + scale_size_manual(values=c(0.5,2)) +
              geom_vline(xintercept=12) + theme_bw() + 
              labs(x=NULL,y='Felony Difference from Control Cumulative') + 
              scale_x_continuous(minor_breaks=NULL, breaks=1:16) + 
              scale_y_continuous(breaks=seq(-20,20,5), minor_breaks=NULL) +
              theme(panel.grid.major = element_line(linetype="dashed"), legend.title= element_blank(),
              legend.position = c(0.2,0.8), legend.background = element_rect(linetype="solid", color="black")) +
              theme(text = element_text(size=16), axis.title.y=element_text(margin=margin(0,10,0,0)))

If you do a ton of permutations (say 999 instead of 99), it would likely make more sense to do a fan chart type error bars and show areas of different percentiles instead of each individual line (Yim et al., 2020).

I will need to slate a totally different blog post to discuss instant vs cumulative effects for time series analysis. Been peer-reviewing quite a few time series analyses of Covid and crime changes – most everyone only focuses on instant changes, and does not calculate cumulative changes. See for example estimating excess deaths for the Texas winter storm power outage (Aldhous et al., 2021). Folks could do similar analyses for short term crime interventions. Jerry has a good example of using the Causal Impact package to estimate cumulative effects for a gang takedown intervention (Ratcliffe et al., 2017) for one criminal justice example I am familiar with.

Again for folks feel free to ask me anything. I may not always be able to do as deep a dive as this, but always feel free to reach out.

References

Using google places API in criminology research?

In my ask me anything series, Thom Snaphaan, a criminologist at Ghent University writes in with this question (slightly edited by me):

I read your blog post on using the Google Places API for criminological research. I am interested in using these data in the context of my PhD research. Can I ask you some questions on this matter? We think Google Places might be a very rich data source, specifically the user ratings of places. (1) Is it allowed to use these data on a large scale (two large cities) for scientific research? (2) Is it possible to download a set without the limit of 1,000 requests per day? (3) Are there, in your experience, other (perhaps more interesting) data sources to conduct this study? Many thanks! Best, Thom

And for my responses to Thom,

For 1) I believe it is OK to use for research purposes. You are not allowed to download the data and resell it though.

For 2) The quotas for the places API are much larger, it is now you get $200 credit per month, which amounts to 100,000 API calls. So that should be sufficient even for a large city.

For 3) I do not know, I haven’t paid much attention to the different online apps that do user reviews. Here in the states we have another service called Yelp (mostly for restaurants), I am not sure if that has more reviews or not though.

One additional piece of information not commonly used in place based research (but have seen it used some Hipp, 2016; Perenzin-Askey, 2018), is the use of the number of employees or sales volume at particular crime generators/attractors. This is not available via google, but is via Reference USA or Lexis Nexis. For Dallas IIRC Reference USA had much better coverage (almost twice as many businesses), but I recently reviewed a paper that did boots on the ground validation for Google data in the Indian city of Chennai and the validation for google businesses was very high (Kuralarason & Bernasco, 2021)

Answer in the comments if you think you have more helpful information on leveraging the place based user reviews in research projects.


In the past I have written about using various google APIs, and which I have used in my research for several different projects.

Google has new pricing now, where you get $200 in credits per month per API. But overall the Places and the streetview API you get a crazy ton of potential calls, so will work for most research projects. Looking it over I actually don’t think I have used Google places data in any projects, in Wheeler & Steenbeek, 2021 I use reference USA and some other sources.

Geocoding and distance API limits are tougher, I ended up accidentally charging myself ~$150 for my work with Gio on gunshot fatalities (Circo & Wheeler, 2021) calculating network distance and approximate drive times. The vision API is also quite low (1000 per month), so will need to budget/plan if you need those services for your project. Geocoding you should be able to find alternatives, like the census geocoder (R, python) and then only use google for the leftovers.

References

  • Circo, G. M., & Wheeler, A. P. (2021). Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims. Applied Spatial Analysis and Policy, 14(2), 379-393.
  • Hipp, J. R. (2016). General theory of spatial crime patterns. Criminology, 54(4), 653-679.
  • Kuralarasan, K., & Bernasco, W. (2021). Location Choice of Snatching Offenders in Chennai City. Journal of Quantitative Criminology, Online First.
  • Perezin-Askey, A., Taylor, R., Groff, E., & Fingerhut, A. (2018). Fast food restaurants and convenience stores: Using sales volume to explain crime patterns in Seattle. Crime & Delinquency, 64(14), 1836-1857.
  • Wheeler, A. P., & Steenbeek, W. (2021). Mapping the risk terrain for crime using machine learning. Journal of Quantitative Criminology, 37(2), 445-480.

Ask me anything

So I get cold emails probably a few times a month asking random coding questions (which is perfectly fine — main point of this post!). I’ve suggested in the past that folks use a few different online forums, but like many forums I have participated in the past they died out quite quickly (so are not viable alternatives currently).

I think going forward I will mimic what Andrew Gelman does on his blog, just turn my responses into blog posts for everyone (e.g. see this post for an example). I will of course ask people permission before I post, and omit names same as Gelman does.

I have debated over time of doing a Patreon account, but I don’t think that would work very well (imagine I would get 1.2 subscribers for $3 a month!). Ditto for writing books, I debate on doing a Data Science for Crime Analysts in Python or something along those lines, but then I write the outline and think that is too much work to have at best a few hundred people purchase the book in the end. I will do consulting gigs for folks, but the majority of questions people ask do not take long enough to justify running a tab for the work (and I have no desire to rack up charges for grad students asking a few questions).

So feel free to ask me anything.