Reasons Police Departments Should Consider Collaborating with Me

Much of my academic work involves collaborating and consulting with police departments on quantitative problems. Most of the work I’ve done so far is very ad-hoc, through either the network of other academics asking for help on some project or police departments cold contacting me directly.

In an effort to advertise a bit more clearly, I wrote a page that describes examples of prior work I have done in collaboration with police departments. That discusses what I have previously done, but doesn’t describe why a police department would bother to collaborate with me or hire me as a consultant. In fact, it probably makes more sense to contact me for things no one has previously done before (including myself).

So here is a more general way to think about (from a police departments or criminal justice agencies perspective) whether it would be beneficial to reach out to me.

Should I do X?

So no one is going to be against different evidence based policing practices, but not all strategies make sense for all jurisdictions. For example, while focussed deterrence has been successfully applied in many different cities, if you do not have much of a gang violence problem it probably does not make sense to apply that strategy in your jurisdiction. Implementing any particular strategy should take into consideration the cost as well as the potential benefits of the program.

Should I do X may involve more open ended questions. I’ve previously conducted in person training for crime analysts that goes over various evidence based practices. It also may involve something more specific, such as should I redistrict my police beats? Or I have a theft-from-vehicle problem, what strategies should I implement to reduce them?

I can suggest strategies to implement, or conduct cost-benefit analysis as to whether a specific program is worth it for your jurisdiction.

I want to do X, how do I do it?

This is actually the best scenario for me. It is much easier to design a program up front that allows a police department to evaluate its efficacy (such as designing a randomized trial and collecting key measures). I also enjoy tackling some of the nitty-gritty problems of implementing particular strategies more efficiently or developing predictive instruments.

So you want to do hotspots policing? What strategies do you want to do at the hotspots? How many hotspots do you want to target? Those are examples of where it would make sense to collaborate with me. Pretty much all police departments should be doing some type of hot spots policing strategy, but depending on your particular problems (and budget constraints), it will change how you do your hot spots. No budget doesn’t mean you can’t do anything — many strategies can be implemented by shifting your current resources around in particular ways, as opposed to paying for a special unit.

If you are a police department at this stage I can often help identify potential grant funding sources, such as the Smart Policing grants, that can be used to pay for particular elements of the strategy (that have a research component).

I’ve done X, should I continue to do it?

Have you done something innovative and want to see if it was effective? Or are you putting a bunch of money into some strategy and are skeptical it works? It is always preferable to design a study up front, but often you can conduct pretty effective post-hoc analysis using quasi-experimental methods to see if some crime reduction strategy works.

If I don’t think you can do a fair evaluation I will say so. For example I don’t think you can do a fair evaluation of chronic offender strategies that use officer intel with matching methods. In that case I would suggest how you can do an experiment going forward to evaluate the efficacy of the program.

Mutual Benefits of Academic-Practitioner Collaboration

Often I collaborate with police departments pro bono — which you may ask what is in it for me then? As an academic I get evaluated mostly by my research productivity, which involves writing peer reviewed papers and getting research grants. So money is not the main factor from my perspective. It is typically easier to write papers about innovative problems or programs. If it involves applying for a grant (on a project I am interested in) I will volunteer my services to help write the grant and design the study.

I could go through my career writing papers without collaborating with police departments. But my work with police departments is more meaningful. It is not zero-sum, I tend to get better ideas when understanding specific agencies problems.

So get in touch if you think I can help your agency!

CAN SEBP webcast on predictive policing

I was recently interviewed for a webcast by the Canadian Society of Evidence Based Policing on Predictive Policing.

I am not directly affiliated with any software vendor, so these are my opinions as an outsider, academic, and regular consultant for police departments on quantitative problems.

I do have some academic work on predictive policing applications that folks can peruse at the moment (listed below). The first is on evaluating the accuracy of a people predictions, the second is for addressing the problem of disproportionate minority contact in spatial predictive systems.

  • Wheeler, Andrew P., Robert E. Worden, and Jasmine R. Silver. (2018) The predictive accuracy of the Violent Offender Identification Directive (VOID) tool. Conditionally accepted at Criminal Justice and Behavior. Pre-print available here.
  • Wheeler, Andrew P. (2018) Allocating police resources while limiting racial inequality. Pre-print available here.

I have some more work on predictive policing applications in the pipeline, so just follow the blog or follow me on Twitter for updates about future work.

If police departments are interested in predictive policing applications and would like to ask me some questions, always feel free to get in contact. (My personal email is listed on my CV, my academic email is just Andrew.Wheeler at utdallas.edu.)

Most of my work consulting with police departments is ad-hoc (and much of it is pro bono), so if you think I can be of help always feel free to get in touch. Either for developing predictive applications or evaluating whether they are effective at achieving the outcomes you are interested in.

Monitoring Use of Force in New Jersey

Recently ProPublica published a map of uses-of-force across different jurisdictions in New Jersey. Such information can be used to monitor whether agencies are overall doing a good or bad job.

I’ve previously discussed the idea of using funnel charts to spot outliers, mostly around homicide rates but the idea is the same when examining any type of rate. For example in another post I illustrated its use for examining rates of officer involved shootings.

Here is another example applying it to lesser uses of force in New Jersey. Below is the rate of use of force reports per the total number of arrests. (Code to replicate at the end of the post.)

The average use of force per arrests in the state is around 3%. So the error bars show relative to the state average. Here is an interactive chart in which you can use tool tips to see the individual jurisdictions.

Now the original press release noted by Seth Stoughton on twitter noted that several towns have ratio’s of black to white use of force that are very high. Scott Wolfe suspected that was partly a function of smaller towns will have more variable rates. Basically as one is comparing the ratio between two rates with error, the error bars around the rate ratio will also be quite large.

Here is the chart showing the same type of funnel around the rate ratio of black to white use-of-force relative to the average over the whole sample (the black percent use of force is 3.2 percent of arrests, and the white percent use of force is 2.4, and the rate ratio between the two is 1.35). I show in the code how I constructed this, which I should write a blog post about itself, but in short there are decisions I could make to make the intervals wider. So the points that are just slightly above a ratio of 2 at around 10,000 arrests are arguably not outliers, those more to the top-right of the plot though are much better evidence. (I’d note that if one group is very small, you could always make these error bars really large, so to construct them you need to make reasonable assumptions about the size of the two groups you are comparing.)

And here is another interactive chart in which you can view the outliers again. The original press release, Millville, Lakewood, and South Orange are noted as outliers. Using arrests as the denominator instead of population, they each have a rate ratio of around 2. In this chart Millville and Lakewood are outside the bounds, but just barely. South Orange is within the bounds. So those aren’t the places I would have called out according to this chart.

That same twitter thread other folks noted the potential reliability/validity of such data (Pete Moskos and Kyle McLean). These charts cannot say why individual agencies are outliers — either high or low. It could be their officers are really using force at different rates, it could also be though they are using different definitions to reporting force. There are also potential other individual explanations that explain the use of force distribution as well as the ratio differences in black vs white — no doubt policing in Princeton vs Camden are substantively different. Also even if all individual agencies are doing well, it does not mean there are no potential problem officers (as noted by David Pyrooz, often a few officers contribute to most UoF).

Despite these limitations, I still think there is utility in this type of monitoring though. It is basically a flag to dig deeper when anomalous patterns are spotted. Those unaccounted for factors contribute to more points being pushed outside of my constructed limits (overdispersion), but more clearly indicate when a pattern is so far outside the norm of what is expected the public deserves some explanation of the pattern. Also it highlights when agencies are potentially doing good, and so can be promoted according to their current practices.

This is a terrific start to effectively monitoring police agencies by ProPublica — state criminal justice agencies should be doing this themselves though.

Here is the code to replicate the analysis.

Projecting spatial data in Python and R

I use my blog as sort of a scholarly notebook. I often repeatedly do a task, and then can’t find where I did it previously. One example is projecting crime data, so here are my notes on how to do that in python and R.

Commonly I want to take public crime data that is in spherical lat/lon coordinates and project it to some local projection. Most of the time so I can do simply euclidean geometry (like buffers within X feet, or distance to the nearest crime generator in meters). Sometimes you need to do the opposite — if I have the projected data and I want to plot the points on a webmap it is easier to work with the lat/lon coordinates. As a note, if you import your map data and then your points are not on the map (or in a way off location), there is some sort of problem with the projection.

I used to do this in ArcMap (toolbox -> Data Management -> Projections), but doing it these programs are faster. Here are examples of going back and forth for some Dallas coordinates. Here is the data and code to replicate the post.

Python

In python there is a library pyproj that does all the work you need. It isn’t part of the default python packages, so you will need to install it using pip or whatever. Basically you just need to define the to/from projections you want. Also it always returns the projected coordinates in meters, so if you want feet you need to do a conversions from meters to feet (or whatever unit you want). For below p1 is the definition you want for lat/lon in webmaps (which is not a projection at all). To figure out your local projection though takes a little more work.

To figure out your local projection I typically use this online tool, prj2epsg. You can upload a prj file, which is the locally defined projection file for shapefiles. (It is plain text as well, so you can just open in a text editor and paste into that site as well.) It will then tell you want EPSG code corresponds to your projection.

Below illustrates putting it all together and going back and forth for an example area in Dallas. I tend to write the functions to take one record at a time for use in various workflows, but I am sure someone can write a vectorized version though that will take whole lists that is a better approach.

import pyproj

#These functions convert to/from Dallas projection
#In feet to lat/lon
p1 = pyproj.Proj(proj='latlong',datum='WGS84')
p2 = pyproj.Proj(init='epsg:2276') #show how to figure this out, http://spatialreference.org/ref/epsg/ and http://prj2epsg.org/search 
met_to_feet = 3.280839895 #http://www.meters-to-feet.com/

#This converts Lat/Lon to projected coordinates
def DallConvProj(Lat,Lon):
    #always returns in meters
    if abs(Lat) > 180 or abs(Lon) > 180:
        return (None,None)
    else:
        x,y = pyproj.transform(p1, p2, Lon, Lat)
        return (x*met_to_feet, y*met_to_feet)

#This does the opposite, coverts projected to lat/lon
def DallConvSph(X,Y):
    if abs(X) < 2000000 or abs(Y) < 6000000:
        return (None,None)
    else:
        Lon,Lat = pyproj.transform(p2, p1, X/met_to_feet, Y/met_to_feet)
        return (Lon, Lat)

#check coordinates
x1 = -96.828295; y1 = 32.832521
print DallConvProj(Lat=y1,Lon=x1)

x2 = 2481939.934525765; y2 = 6989916.200679892
print DallConvSph(X=x2, Y=y2)

R

In R I use the library proj4 to do the projections for point data. R can read in the projection data from a file as well using the rgdal library.

library(proj4)
library(rgdal)

#read in projection from shapefile
MyDir <- "C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\Projections_R_Python"
setwd(MyDir)
DalBound <- readOGR(dsn="DallasBoundary_Proj.shp",layer="DallasBoundary_Proj")
DalProj <- proj4string(DalBound)    

ProjData <- data.frame(x=c(2481939.934525765),
                       y=c(6989916.200679892),
                       lat=c(32.832521),
                       lon=c(-96.828295))
       
LatLon <- proj4::project(as.matrix(ProjData[,c('x','y')]), proj=DalProj, inverse=TRUE)
#check to see if true
cbind(ProjData[,c('lon','lat')],as.data.frame(LatLon))

XYFeet <- proj4::project(as.matrix(ProjData[,c('lon','lat')]), proj=DalProj)
cbind(ProjData[,c('x','y')],XYFeet)    

plot(DalBound)
points(ProjData$x,ProjData$y,col='red',pch=19,cex=2)

The last plot function shows that the XY point is within the Dallas basemap for the projected boundary. But if you want to project the boundary file as well, you can use the spTransform function. Here I have a simple example of tacking the projected boundary file and transforming to lat/lon, so can be superimposed on a leaflet map.

Additionally I show a trick I sometimes use for maps by transforming the boundary polygon to a polyline, as it provides easier styling options sometimes.

#transform boundary to lat/lon
DalLatLon <- spTransform(DalBound,CRS("+init=epsg:4326") )
plot(DalLatLon)
points(ProjData$lon,ProjData$lat,col='red',pch=19,cex=2)

#Leaflet useful for boundaries to be lines instead of areas
DallLine <- as(DalLatLon, 'SpatialLines')
library(leaflet)

BaseMapDallas <- leaflet() %>%
  addProviderTiles(providers$OpenStreetMap, group = "Open Street Map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "CartoDB Lite") %>%
  addPolylines(data=DallLine, color='black', weight=4, group="Dallas Boundary Lines") %>%
  addPolygons(data=DalLatLon,color = "#1717A1", weight = 1, smoothFactor = 0.5,
              opacity = 1.0, fillOpacity = 0.5, group="Dallas Boundary Area") %>%
  addLayersControl(baseGroups = c("Open Street Map","CartoDB Lite"),
                   overlayGroups = c("Dallas Boundary Area","Dallas Boundary Lines"),
                   options = layersControlOptions(collapsed = FALSE)) %>%
                   hideGroup("Dallas Boundary Lines")   
                      
BaseMapDallas

I have too much stuff in the blog queue at the moment, but hopefully I get some time to write up my notes on using leaflet maps in R soon.

New preprint: Allocating police resources while limiting racial inequality

I have a new working paper out, Allocating police resources while limiting racial inequality. In this work I tackle the problem that a hot spots policing strategy likely exacerbates disproportionate minority contact (DMC). This is because of the pretty simple fact that hot spots of crime tend to be in disadvantaged/minority neighborhoods.

Here is a graph illustrating the problem. X axis is the proportion of minorities stopped by the police in 500 by 500 meter grid cells (NYPD data). Y axis is the number of violent crimes over along time period (12 years). So a typical hot spots strategy would choose the top N areas to target (here I do top 20). These are all very high proportion minority areas. So the inevitable extra police contact in those hot spots (in the form of either stops or arrests) will increase DMC.

I’d note that the majority of critiques of predictive policing focus on whether reported crime data is biased or not. I think that is a bit of a red herring though, you could use totally objective crime data (say swap out acoustic gun shot sensors with reported crime) and you still have the same problem.

The proportion of stops by the NYPD of minorities has consistently hovered around 90%, so doing a bunch of extra stuff in those hot spots will increase DMC, as those 20 hot spots tend to have 95%+ stops of minorities (with the exception of one location). Also note this 90% has not changed even with the dramatic decrease in stops overall by the NYPD.

So to illustrate my suggested solution here is a simple example. Consider you have a hot spot with predicted 30 crimes vs a hot spot with predicted 28 crimes. Also imagine that the 30 crime hot spot results in around 90% stops of minorities, whereas the 28 crime hot spot only results in around 50% stops of minorities. If you agree reducing DMC is a reasonable goal for the police in-and-of-itself, you may say choosing the 28 crime area is a good idea, even though it is a less efficient choice than the 30 crime hot spot.

I show in the paper how to codify this trade-off into a linear program that says choose X hot spots, but has a constraint based on the expected number of minorities likely to be stopped. Here is an example graph that shows it doesn’t always choose the highest crime areas to meet that racial equity constraint.

This results in a trade-off of efficiency though. Going back to the original hypothetical, trading off a 28 crime vs 30 crime area is not a big deal. But if the trade off was 3 crimes vs 30 that is a bigger deal. In this example I show that getting to 80% stops of minorities (NYC is around 70% minorities) results in hot spots with around 55% of the crime compared to the no constraint hot spots. So in the hypothetical it would go from 30 crimes to 17 crimes.

There won’t be a uniform formula to calculate the expected decrease in efficiency, but I think getting to perfect equality with the residential pop. will typically result in similar large decreases in many scenarios. A recent paper by George Mohler and company showed similar fairly steep declines. (That uses a totally different method, but I think will be pretty similar outputs in practice — can tune the penalty factor in a similar way to changing the linear program constraint I think.)

So basically the trade-off to get perfect equity will be steep, but I think the best case scenario is that a PD can say "this predictive policing strategy will not make current levels of DMC worse" by applying this algorithm on-top-of your predictive policing forecasts.

I will be presenting this work at ASC, so stop on by! Feedback always appreciated.

The random distribution of near-repeat strings

One thing several studies that examine near-repeat patterns have looked at is the distribution of the string of near-repeats. So near-repeats sometimes result in only 2 cases connected, sometimes 3, sometimes 4, etc. Here is an example from a recent work on arsons (Turchan et al., 2018):

Cory Haberman and Jerry Ratcliffe were the first I noticed to do this in this paper (Jerry’s near-repeat calculator has the option to export the strings). It is also a similar idea to what Davies and Marchione did in this paper.

Looking at these strings of events has clear utility for crime analysts, as they have a high probability of being linked to the same offender(s). Building off of some prior work, I wrote some python code to see what the distribution of these strings would look like when you randomly permuted the times in the data (which is the same approach used to estimate the intervals in the near repeat calculator). Here is the data and code, which is an analysis of 14,184 thefts from motor vehicles in Dallas that occurred in 2015.

So first I breakdown the total number of near repeat strings according to within 1000 feet and 7 days of each other. I then conduct 99 random permutations to see how many strings might happen by chance even if there is no near-repeat phenomenon. Some near-repeats can simply happen by chance, especially in places where crime is more prevalent. A length of string 1 in the table means it is not a near repeat, and 10+ means the string has 10 or more events in it. The numbers are the number of chains (in the Turchan article parlance), so 1,384 2-length chains means it includes 2,768 crime events.

If you compare the observed to the bounds in the table, you can see there are fewer isolates (1 length) in the observed than permutation distribution, and more 2 and 3 string events. After that the higher level strings occur just as frequently in the observed data than in the random data, with the exception of 10+ are fewer, but not by much.

So this provides evidence of the boost hypothesis in this data, albeit many near-repeat strings are still likely to occur just by chance, and the differences are not uber large. A crime analyst may be more interested in the question though "if I have X events in a near-repeat string, should I look into the data more". The idea being that since 2-strings are not that rare it would probably be a waste of an analysts time to dig into all of the two-events. I don’t think this is the perfect way to make that decision, but here is a breakdown of the distribution of strings for the permutated data.

So isolates happen in the random data 86% of the time. 2-strings happen 8.7% of the time, 3-strings 2.6%, etc. Based on this I would recommend that there needs to be at least 3 strings of near-repeat events if you have a low threshold in terms of "should I bother to dig into these events". If you want a high threshold though you may do more like 6+ events in a string.

This again is alittle bit of a slippage, as this is actual if you randomly picked a crime, what is the probability it is in a string of near-repeats of length N. I’m not quite sure of a better way to pose it though. Maybe it is better to think in terms of forecasts (eg given N prior crimes, what is the prob. of an additional near-repeat crime, similar to Piza and Carter). Or maybe in terms of if there are N near-repeats, what is the probability they will be linked to a common person (ala Mike Porter and crime linkage).

Also I should mention some of the cool work Liz Groff and Travis Taniguchi are doing on near-repeat work. I should probably just use their near-repeat code instead of rolling my own.

New paper: A simple weighted displacement difference test to evaluate place based crime interventions

At the ECCA conference this past spring Jerry Ratcliffe asked if I could apply some of my prior work on evaluating changes in crime patterns over time to make a set of confidence intervals for the weighted displacement quotient statistic (WDQ). The answer to that is no, you can’t, but in its stead I created another statistic in which you can do that, the weighted displacement difference (WDD). The work is published in the open access journal Crime Science.

The main idea is we wanted a simple statistic folks can use to evaluate place based interventions to reduce crime. All you need is pre and post crime counts for you treated and control areas of interest. Here is an excel spreadsheet to calculate the statistic, and below is a screen shot. You just need to fill in the pre and post counts for the treated and control locations and the spreadsheet will spit out the statistic, along with a p-value and a 95% confidence interval of the number of crimes reduced.

What is different compared to the WDQ statistic is that you need a control area for the displacement area too in this statistic. But if you are not worry about displacement, you can actually just put in zero’s for the displacement area and still do the statistic for the local (and its control area). In this way you can actually do two estimates, one for the local effects and one for the displacement. Just put in zero’s for the other values.

While you don’t really need to read the paper to be able to use the statistic, we do have some discussion on choosing control areas. In general the control areas should have similar counts of crime, you shouldn’t have a treatment area that has 100 crimes and a control area that only has 10 crimes. We also have this graph, which is basically a way to conduct a simple power analysis — the idea that “could you reasonably detect whether the intervention reduced crime” before you actually conduct the analysis.

So the way to read this graph is if you have a set of treated and control areas that have an average of 100 crimes in each period (so the cumulative total crimes is around 800), the number of crimes you need to reduce due to the intervention to even have weak evidence of a crime reduction (a one-tailed p-value of less than 0.1), the intervention needs to have prevented around 30 crimes. Many interventions just aren’t set up to have strong evidence of crime reductions. For example if you have a baseline of 20 crimes, you need to prevent 15 of them to find weak evidence of effectiveness. Interventions in areas with fewer baseline crimes basically cannot be verified they are effective using this simple of a design.

For those more mathy, I created a test statistic based on the differences in the changes of the counts over time by making an assumption that the counts are Poisson distributed. This is then basically just a combination of two difference-in-difference estimates (for the local and the displacement areas) using counts instead of means. For researchers with the technical capabilities, it probably makes more sense to use a data based approach to identify control areas (such as the synthetic control method or propensity score matching). This is of course assuming an actual randomized experiment is not feasible. But this is too much a burden for many crime analysts, so if you can construct a reasonable control area by hand you can use this statistic.

Aoristic analysis for hour of day and day of week in Excel

I’ve previously written code to conduct Aoristic analysis in SPSS. Since this reaches about an N of three crime analysts (if that even), I created an Excel spreadsheet to do the calculations for both the hour of the day and the day of the week in one go.

Note if you simply want within day analysis, Joseph Glover has a nice spreadsheet with VBA functions to accomplish that. But here I provide analysis for both the hour of the day and the day of the week. Here is the spreadsheet and some notes, and I will walk through using the spreadsheet below.

First off, you need your data in Excel to be BeginDateTime and EndDateTime — you cannot have the dates and times in separate fields. If you do have them in separate fields, if they are formatting correctly you can simply add your date field to your hour field. If you have the times in three separate date, hour, and minute fields, you can do a formula like =DATE + HOUR/24 + MINUTE/(60*24) to create the combined datetime field in Excel (excel stores a single date as one integer).

Presumably at this stage you should fix your data if it has errors. Do you have missing begin/end times? Some police databases when there is an exact time treat the end date time as missing — you will want to fix that before using this spreadsheet. I constructed the spreadsheet so it will ignore missing cells, as well as begin datetimes that occur after the end datetime.

So once your begin and end times are correctly set up, you can copy paste your dates into my Aoristic_HourWeekday.xlsx excel spreadsheet to do the aoristic calculations. If following along with my data I posted, go ahead and open up the two excel files in the zip file. In the Arlington_Burgs.xlsx data select the B2 cell.

Then scroll down to the bottom of the sheet, hold Shift, and then select the D3269 cell. That should highlight all of the data you need. Right-click, and the select Copy (or simply Ctrl + C).

Now migrate over to the Aoristic_HourWeekday.xlsx spreadsheet, and paste the data into the first three columns of the OriginalData sheet.

Now go to the DataConstructed sheet. Basically we need to update the formulas to recognize the new rows of data we just copied in. So go ahead and select the A11 to MI11 row. (Note there are a bunch of columns hidden from view).

Now we have a few over 3,000 cases in the Arlington burglary data. Grab the little green square in the lower right hand part of the selected cells, and then drag down the formulas. With your own data, you simply want to do this for as many cases as you have. If you go past your total N it is ok, it just treats the extra rows like missing data. This example with 3,268 cases then takes about a minute to crunch all of the calculations.

If you navigate to the TimeIntervals sheet, this is where the intervals are actually referenced, but I also place several summary statistics you might want to check out. The Total N shows that I have 3,268 good rows of data (which is what I expected). I have 110 missing rows (because I went over), and zero rows that have the begin/end times switched. The total proportion should always equal 1 — if it doesn’t I’ve messed up something — so please let me know!

Now the good stuff, if you navigate to the NiceTables_Graphs sheet it does all the summaries that you might want. Considering it takes awhile to do all the calculations (even for a tinier dataset of 3,000 cases), if you want to edit things I would suggest copying and pasting the data values from this sheet into another one, to avoid redoing needless calculations.

Interpreting the graphs you can see that burglaries in this dataset have a higher proportion of events during the daytime, but only on weekdays. Basically what you would expect.

Personally I would always do this analysis in SPSS, as you can make much nicer small multiple graphs than Excel like below. Also my SPSS code can split the data between different subsets. This particular Excel code you would just need to repeat for whatever subset you are interested in. But a better Excel sleuth than me can likely address some of those critiques.

One minor additional note on this is that Jerry’s original recommendation rounded the results. My code does proportional allocation. So if you have an interval like 00:50 TO 01:30, it would assign the [0-1] hour as 10/40, and [1-2] as 30/40 (original Jerry’s would be 50% in each hour bin). Also if you have an interval that is longer than the entire week, I simply assign equal ignorance to each bin, I don’t further wrap it around.

American Community Survey Variables of Interest to Criminologists

I’ve written prior blog posts about downloading Five Year American Community Survey data estimates (ACS for short) for small area geographies, but one of the main hiccups is figuring out what variables you want to use. The census has so many variables that are just small iterations of one another (e.g. Males under 5, males 5 to 9, males 10 to 14, etc.) that it is quite a chore to specify the ones you want. Often you want combinations of variables or to calculate percentages as well, so you need to take two or more variables and turn them into your constructed variable.

I have posted some notes on the variables I have used for past projects in an excel spreadsheet. This includes the original variables, as well as some notes for creating percentage variables. Some are tricky — such as figuring out the proportion of black residents for block groups you need to add non-Hispanic black and Hispanic black estimates (and then divide by the total population). For spatially oriented criminologists these are basically indicators commonly used for social disorganization. It also includes notes on what is available at the smaller block group level, as not all of the variables are. So you are more limited in your choices if you want that small of area.

Let me know if you have been using other variables for your work. I’m not an expert on these variables by any stretch, so don’t take my list as authoritative in any way. For example I have no idea whether it is valid to use the imputed data for moving in the prior year at the block group level. (In general I have not incorporated the estimates of uncertainty for any of the variables into my analyses, not sure of the additional implications for the imputed data tables.) Also I have not incorporated variables that could be used for income-inequality or for ethnic heterogeneity (besides using white/black/Hispanic to calculate the index). I’m sure there are other social disorganization relevant variables at the block group level folks may be interested in as well. So let me know in the comments or shoot me an email if you have suggestions to update my list.

I would prefer if as a field we could create a set of standardized indices so we are not all using different variables (see for example this Jeremy Miles paper). It is a bit hodge-podge though what variables folks use from study-to-study, and most folks don’t report the original variables so it is hard to replicate their work exactly. British folks have their index of deprivation, and it would be nice to have a similarly standardized measure to use in social science research for the states.


The ACS data has consistent variable names over the years, such as B03001_001 is the total population, B03002_003 is the Non-Hispanic white population, etc. Unfortunately those variables are not necessarily in the same tables from year to year, so concatenating ACS results over multiple years is a bit of a pain. Below I post a python script that given a directory of the excel template files will produce a nice set of dictionaries to help find what table particular variables are in.

#This python code grabs ACS meta-data templates
#To easier search for tables that have particular variables
import xlrd, os

mydir = r'!!!Insert your path to the excel files here!!!!!'

def acs_vars(directory):
    #get the excel files in the directory
    excel_files = []
    for file in os.listdir(directory):
        if file.endswith(".xls"):
            excel_files.append( os.path.join(directory, file) )
    #getting the variables in a nice dictionaries
    lab_dict = {}
    loc_dict = {}
    for file in excel_files:
        book = xlrd.open_workbook(file) #first open the xls workbook
        sh = book.sheet_by_index(0)
        vars = [i.value for i in sh.row(0)] #names on the first row
        labs = [i.value for i in sh.row(1)] #labels on the second
        #now add to the overall dictionary
        for v,l in zip(vars,labs):
            lab_dict[v] = l
            loc_dict[v] = file
    #returning the two dictionaries
    return lab_dict,loc_dict
    
labels,tables = acs_vars(mydir)

#now if you have a list of variables you want, you can figure out the table
interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B14006_002','B01001_003','B23025_005','B22010_002',
            'B16002_004']
            
for i in interest:
    head, tail = os.path.split(tables[i])
    print (i,labels[i],tail)

Data sources for crime generators

Those interested in micro place based crime analysis often need to collect information on businesses or other facilities where many people gather (e.g. hospitals, schools, libraries, parks). To keep it short, businesses influence the comings-and-goings of people, and those people are those who commit offenses and are victimized. Those doing neighborhood level research census data is almost a one stop shop, but that is not the case when trying to collect businesses data of interest. Here are some tips and resources I have collected over the years of conducting this research.

Alcohol License Data

Most states have a state level board in which one needs to obtain a license to sell alcohol. Bars and liquor stores are one of the most common micro crime generator locations criminologists are interested in, but in most states places like grocery stores, gas stations, and pharmacies also sell alcohol (minus those Quakers in Pennsylvania) and so need a license. So such lists contain many different crime generators of interest. For example here is Texas’s list, which includes a form to search for and download various license data. Here is Washington’s, which just has spreadsheets of the current alcohol and cannabis licenses in the state. To find these you can generally just google something like “Texas alcohol license data”.

In my experience these also have additional fields to further distinguish between the different types of locations. Such as besides the difference between on-premise vs off-premise, you can often also tell the difference between a sit down restaurant vs a more traditional bar. (Often based on the percent of food-stuffs vs alcohol that make up total revenue.) So if you were interested in a dataset of gas stations to examine commercial robbery, I might go here first as opposed to the other sources (again PA is an exception to that advice though, as well as dry counties).

Open Data Websites

Many large cities anymore have open data websites. If you simply google “[Your City] open data” they will often come up. Every city is unique in what data they have available, so you will just have to take a look on the site to see if whatever crime generator you are interested in is available. (These sites almost always contain reported crimes as well, I daresay reported crimes are the most common open data on these websites.) For businesses, the city may have a directory (like Chicago). (That is not the norm though.) They often have other points/places of interest as well, such as parks, hospitals and schools.

Another example is googling “[your city] GIS data”. Often cities/counties have a GIS department, and I’ve found that many publicly release some data, such as parcels, zoning, streets, school districts, etc. that are not included on the open data website. For example here is the Dallas GIS page, which includes streets, parcels, and parks. (Another pro-tip is that many cities have an ArcGIS data server lurking in the background, often which you can use to geocode address data. See these blog posts of mine (python,R) for examples. ) If you have a county website and you need some data, it never hurts to send a quick email to see if some of those datasets are available (ditto for crime via the local crime analyst). You have nothing to lose by sending a quick email to ask.

I’d note that sometimes you can figure out a bit from the zoning/parcel dataset. For instance there may be a particular special code for public schools or apartment complexes. NYC’s PLUTO data is the most extensive I have ever seen for a parcel dataset. Most though have simpler codes, but you can still at least figure out apartments vs residential vs commercial vs mixed zoning.

You will notice that finding these sites involve using google effectively. Since every place is idiosyncratic it is hard to give general advice. But google searches are easy. Recently I needed public high schools in Dallas for a project, and it was not on any of the prior sources I noted. A google search however turned up a statewide database of the public and charter school locations. If you include things like “GIS” or “shapefile” or “data” in the search it helps whittle it down some to provide a source that can actually be downloaded/manipulated.

Scraping from public websites

The prior two sources are generally going to be better vetted. They of course will have errors, but are typically based on direct data sources maintained by either the state or local government. All of the other sources I will list though are secondary, and I can’t really say to what extent they are incorrect. The biggest thing I have noticed with these data sources is that they tend to be missing facilities in my ad-hoc checks. (Prior mentioned sources at worst I’ve noticed a rare address swap with a PO box that was incorrect.)

I’ve written previously about using the google places API to scrape data. I’ve updated to create a short python code snippet that all you need is a bounding box you are looking for and it will do a grid search over the area for the place type you are interested in. Joel Caplan has a post about using Google Earth in a similar nature, but unfortunately that has a quite severe limitation — it only returns 10 locations. My python code snippet has no such limitation.

I don’t really understand googles current pricing scheme, but the places API has a very large number of free requests. So I’m pretty sure you won’t run out even when scraping a large city. (Geocoding and distance APIs are much fewer unfortunately, and so are much more limited.)

Other sources I have heard people use before are Yelp and Yellow pages. I haven’t checked those sources extensively (and if they have API’s like Google). When looking closely at the Google data, it tends to be missing places (it is up to the business owner to sign up for a business listing). Despite it being free and seemingly madness to not take the step to have your business listed easily in map searches, it is easy to find businesses that do not come up. So user beware.

Also, scraping the data for academic articles is pretty murky whether it violates the terms of service for these sites. They say you can’t cache the original data, but if you just store the lat/lon and then turn into a “count of locations” or a “distance to nearest location” (ala risk terrain modelling), I believe that does not violate the TOS (not a lawyer though — so take with a grain of salt). Also for academic projects since you are not making money I would not worry too extensively about being sued, but it is not a totally crazy concern.

Finally, the nature of scraping the business data is no different than other researchers who have been criticized for scraping public sites like Facebook or dating websites (it is just a business instead of personal info). I personally don’t find it unethical (and I did not think those prior researchers were unethical), but others will surely disagree.

City Observatory Data

City observatory has a convenient set of data, that they named the StoreFront Index. They have individual data points you can download for many different metro areas, along with their SIC codes. See also here for a nice map and to see if your metro area of interest is included.

See here for the tech report on which stores are included. They do not include liquor stores and gas stations though in their index. (Since it is based on Jane Jacob’s work I presume they also do not include used car sale lots.)

Lexis Nexis Business Data (and other proprietary sources)

The store front data come from a private database, Custom Lists U.S. Business Database. I’m not sure exactly what vendor produces this (a google search brings up several), but here are a few additional proprietary sources researchers may be interested in.

My local library in Plano (as well as my University), have access to a database named reference USA. This allows you to search for businesses in a particular geo area (such as zip code), as well as by other characteristics (such as by the previously mentioned SIC code). Also this database includes additional info. about sales and number of employees, which may be of further interest to tell the difference between small and large stores. (Obviously Wal-Mart has more customers and more crime than a smaller department store.) It provides the street address, which you will then need to geocode.

Reference USA though only allows you to download 250 addresses at a time, so could be painful for crime generators that are more prevalent or for larger cities. Another source though my friendly UTD librarian pointed out to me is Lexis Nexis’s database of public businesses. It has all the same info. as reference USA and you can bulk download the files. See here for a screenshot walkthrough my librarian created for me.

Any good sources I am missing? Let me know in the comments. In particular these databases I mention are cross-sectional snapshots in time. It would be difficult to use these to measure changes over time with few exceptions.