My online course lab materials and musings about online teaching

I often refer folks to the courses I have placed online. Just for an update for everyone, if you look at the top of my website, I have pages for each of my courses at the header of my page. Several of these are just descriptions and syllabi, but the few lab based courses I have done over the years I have put my materials entirely online. So those are:

And each of those pages links to a GitHub page where all the lab goodies are stored.

The seminar in research focuses on popular quasi-experimental designs in CJ, and has code in R/Stata/SPSS for the weekly lessons. (Will need to update with python, I may need to write my own python margins library though!)

Grad GIS is mostly old ArcGIS tutorials (I don’t think I will update ArcPro, will see when Eric Piza’s new book comes out and just suggest that probably). Even though the screenshots are perhaps old at this point though the ideas/workflow are not. (It also has some tutorials on other open source tools, such as CrimeStat, Jerry’s Near Repeat Calculator, GeoDa, spatial regression analysis in R, and Mallesons/Andresens SPPT tool are examples I remember offhand.)

Undergrad Crime Analysis is mostly focused on number crunching relevant to crime analysts in Excel, although has a few things in Access (making SQL queries), and making a BOLO in publisher.

So for folks self-learning of course use those resources however you want. My suggestion is to skim through the syllabus, see if you want to learn about any particular lesson, and then jump right to that one. No need to slog through the whole course if you are just interested in one specific thing.

They are also freely available to any instructors who want to adapt those materials for their own courses as well.


One of the things that has disappointed me about the teaching response to Covid is instead of institutions taking the opportunity to really invest in online teaching, people are just running around with their heads cut off and offering poor last minute hybrid courses. (This is both for the kiddos as well as higher education.)

If you have ever taken a Coursera course, they are a real production! And the ones I have tried have all been really well done; nice videos, interactive quizzes with immediate feedback, etc. A professor on their own though cannot accomplish that, we would need investment from the University in filming and in scripting the webpage. But once it is finished, it can be delivered to the masses.

So instead of running courses with a tiny number of students, I think it makes more sense for Universities to actually pony up resources to help professors make professional looking online courses. Not the nonsense with a bad recorded lecture and a discussion board. It is IMO better to give someone a semester sabbatical to develop a really nice online course than make people develop them at the last minute. Once the course is set up, you really only need to administer the course, which takes much less work.

Another interested party may be professional organizations. For example, the American Society of Criminology could make an ad-hoc committee to develop a model curriculum for an intro criminology course. You can see in my course pages I taught this at one point – there is no real reason why every criminology teacher needs to strike out on their own. This is both more work for the individual teacher, as well as introduces quite a bit of variation in the content that crim/cj students receive.

Even if ASC started smaller, say promoting individual lessons, that would be lovely. Part of the difficulty in teaching a broad course like Intro to Criminology is that I am not an expert on all of criminology. So for example if someone made a lesson plan/video for bio-social criminology, I would be more apt to use that. Think instead of a single textbook, leveraging multi-media.


It is a bit ironic, but one of the reasons I was hired at HMS was to internally deliver data science training. So even though I am in the private sector I am still teaching!

Like I said previously, you are on your own for developing teaching content at the University. There is very little oversight. I imagine many professors will cringe at my description, but one of the things I like at HMS is the collaboration in developing materials. So I initially sat down with my supervisor and project manager to develop the overall curricula. Then for individual lessons I submit my slides/lab portion to my supervisor to get feedback, and also do a dry run in front of one of my peers on our data science team to get feedback. Then in the end I do a recorded lecture – we limit to something like 30 people on WebEx so it is not lagging, but ultimately everyone in the org can access the video recording at a later date.

So again I think this is a better approach. It takes more time, and I only do one lecture at a time (so take a month or two to develop one lecture). But I think that in the end this will be a better long term investment than the typical way Uni’s deliver courses.

New book: Micro geographic analysis of Chicago homicides, 1965-2017

In joint work with Chris Herrmann and Dick Block, we now have a book out – Understanding Micro-Place Homicide Patterns in Chicago (1965 – 2017). It is a Springer Brief book, so I recommend anyone who has a journal article that is too long that this is a potential venue for the work. (Really this is like the length of three journal articles.)

A few things occurred to prompt me to look into this. First, Chicago increased a big spike of homicides in 2016 and 2017. Here is a graph breaking them down between domestic related homicides and all other homicides. You can see all of the volatility is related to non-domestic homicides.

So this (at least to me) begs the question of whether those spiked homicides show similar characteristics compared to historical homicides. Here we focus on long term spatial patterns and micro place grid cells in the city, 150 by 150 meter cells. Dick & Carolyn Block had collated data, including the address where the body was discovered, using detective case notes starting in 1965 (ending in 2000). The data from 2000 through 2017 is the public incident report data released by Chicago PD online. Although Dick and Carolyn’s public dataset is likely well known at this point, Dick has more detailed data than is released publicly on ICPSR and a few more years (through 2000). Here is a map showing those homicide patterns aggregated over the entire long time period.

So we really have two different broad exploratory analyses we employed in the work. One was to examine homicide clustering, and the other was to examine temporal patterns in homicides. For clustering, we go through a ton of different metrics common in the field, and I introduce even one more, Theil’s decomposition for within/between neighborhood clustering. This shows Theil’s clustering metric within neighborhoods in Chicago (based on the entire time period).

So areas around the loop showed more clustering in homicides, but here it appears it is somewhat confounded with neighborhood size – smaller neighborhoods appear to have more clustering. This is sort of par for the course for these clustering metrics (we go through several different Gini variants as well), in that they are pretty fickle. You do a different temporal slice of data or treat empty grid cells differently the clustering metrics can change quite a bit.

So I personally prefer to focus on long term temporal patterns. Here I estimated group based trajectory models using zero-inflated Poisson models. And here are the predicted outputs for those grid cells over the city. You can see unlike prior work David Weisburd (Seattle), myself (Albany), or Martin Andresen (Vancouver) has done, they are much more wavy patterns. This may be due to looking over a much longer horizon than any of those prior works though have.

The big wave, Group 9, ends up being clearly tied to former large public housing projects, which their demolitions corresponds to the downturn.

I have an interactive map to explore the other trajectory groups here. Unfortunately the others don’t show as clear of patterns as Group 9, so it is difficult to answer any hard questions about the uptick in 2016/2017, you could find evidence of homicides dispersing vs homicides being in the same places but at a higher intensity if you slice the data different ways.

Unfortunately the analysis is never ending. Chicago homicides have again spiked this year, so maybe we will need to redo some analysis to see if the more current trends still hold. I think I will migrate away from the clustering metrics though (Gini and Theil), they appear to be too volatile to say much of anything over short term patterns. I think there may be other point pattern analysis that are more diagnostic to really understand emerging/changing spatial patterns.

The coffee next to the cover image is Chris Herrmann’s beans, so go get yourself some as well at Fellowship Coffee!

A failed attempt at optimal search paths

Recently saw Kim Rossmo have a paper that describes a Bayesian approach to prioritizing areas for a search for missing persons. So he illustrates an approach to give a probability surface, but that still leaves implicit how individuals are to traverse over that probability space in the search itself.

For an example of where there can be potential ambiguity even with the probability surface, in the surface below we have three hot spots. So if we have four people to search this area, and they can only search a finite connected area (so no hop-scotching around), should we have them split between each of the hot spots, or should they cover one of the hot spots in more detail. (It is hard to tell in my graph, but the hot spot in the central western part of the graph has a higher hump, but is steeper, so top right has more mass but is more spread out.)

I’ve actually failed to be able to generate a decent algorithm to do this though. It is similar to this prior work of mine, but I actually discovered some errors in that work in trying to apply it to this situation (can have disconnected subtours that are complicated paths). So attempted several other variants and have yet to come up with a decent solution.

I tried out a greedy algorithm to solve the problem (pick the highest hump, march like an ant around until you have covered your max tour, and then start again). But this was not good either. But it generated some interesting accidental art. So here is my greedy approach to pick four tours in which they can traverse 300 grid cells, and here it says better to ignore the bottom hotspot and spread around your effort in the other two areas:

I know this is pretty sup-optimal though, as you can continue to generate more tours through this approach and eventually find better ones.

This is going to bug me forever now, but posting a blog to move on. So if you know of a solution please fill me in!

Mapping attitudes paper published

My paper (joint work with Jasmine Silver, Rob Worden, and Sarah McLean), Mapping attitudes towards the police at micro places, has been published in the most recent issue of the Journal of Quantitative Criminology. Here is the abstract:

Objectives: We examine satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. We illustrate the utility of this approach by comparing micro- and meso-level aggregations of policing attitudes, as well as by predicting views about the police from crime data at micro places.

Methods: In each survey, respondents provided the nearest intersection to their address. Using that geocoded survey data, we use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city and compare the micro-level pattern of policing attitudes to survey data aggregated to the census tract. We also use spatial and multi-level regression models to estimate the effect of local violent crimes on attitudes towards police, controlling for other individual and neighborhood level characteristics.

Results: We demonstrate that there are no systematic biases for respondents refusing to answer the nearest intersection question. We show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime. Models predicting satisfaction with police show that local counts of violent crime are a strong predictor of attitudes towards police, even above individual level predictors of race and age.

Conclusions: Asking survey respondents to provide the nearest intersection to where they live is a simple approach to mapping attitudes towards police at micro places. This approach provides advantages beyond those of using traditional neighborhood boundaries. Specifically, it provides more precise locations police may target interventions, as well as illuminates an important predictor (i.e., nearby violent crimes) of policing attitudes.

And this was one of my favorites to make maps. We show how to take surveys and create analogs of hot spot maps of negative sentiment towards police. We do this via asking individuals to list their closest intersection (to still give some anonymity), and then create inverse distance weighted maps of negative attitudes towards police.

We also find in this work that nearby crimes are the biggest factor in predicting negative sentiment towards police. This hints that past results aggregating attitudes to neighborhoods is inappropriate, and that police reducing crime is likely to have the best margin in terms of making people more happy with the police in general.

As always, feel free to reach out for a copy of the paper if you cannot access JQC. (Or you could go a view the pre-print.)

Using the Google Vision and Streetview API to Explore Hotspots

So previously I have shown how to automate the process of downloading google street view imagery (for individual addresses & running down a street). One interesting application is to then code those streetview images. There are many applications in criminology of coding these images for disorder. So Rob Sampson initially had the idea of ecometrics, in which he used systematic social observations via taking a video going down various streets to code physical disorder, such as garbage on the street (Raudenbush & Sampson, 1999). Others than leveraged Google streetview imagery to do those same audits instead of collecting their own footage (Bader et al., 2017).

Those are all someone looks at the images and a human says, there is XYZ in this photo and ABC in this photo. I was interested in testing out the Google Vision API to automate identifying parts of the images. So instead of a human manually reviewing, you build a score automatically. See for example work on identifying the percieved safety of streets (Naik et al., 2014).

Here I was motivated by some recent work of a colleague, Nate Connealy, in which he used this imagery to identify the differences in hot spots vs not hot spots (Connealy, 2020). Also I am pretty sure I saw George Mohler present on this at some ASC before I had the idea (it was similar to this paper, Khorshidi et al., 2019, not 100% sure it was the same one though). For an overview of crim applications using streetview and google maps, which also span CPTED type analyses, check out Vandeviver (2014).

So with Google’s automated vision API, if I submit this photo of a parking garage (this is actually the image I get if I submit the address Bad Address, Dallas, TX to the streetview API, so take in mind errors like that in my subsequent analysis).

You get back these labels, where the first item is the description and the second is the ‘score’ for whether the item is in the image:

('Architecture', 0.817379355430603),
('Floor', 0.7577666640281677),
('Room', 0.7444316148757935),
('Building', 0.7440816164016724),
('Parking', 0.7051371335983276),
('Ceiling', 0.6624311208724976),
('Flooring', 0.6004095673561096),
('Wood', 0.5958532094955444),
('House', 0.5928719639778137),
('Metal', 0.5114516019821167)

So I don’t tell Google what to look for, it just gives me back a ton of different labels depending on what it detects in the image. So what I do here is based on my hotspot work (Wheeler & Reuter, 2020), I grab a sample of 300 addresses inside my Dallas based hot spot areas, and 300 addresses outside of hot spots. (These addresses are based on crime data themselves, so similar to Nate’s work I only sample locations that at least have 1 crime).

So this isn’t a way to do predictions, but I think it is potentially interesting application of exploratory data analysis for hot spots or high crime areas.

Python Code Snippet

I am just going to paste the python code-snippet in its entirety.

'''
Grabbing streetview images and detecting
labels using the google vision API
'''

from google.cloud import vision
import pandas as pd
import io
import os
import urllib
import time

os.chdir(r'D:\Dropbox\Dropbox\Documents\BLOG\GoogleLabels_hotspots\analysis')

add_dat = pd.read_csv('Sampled_Adds.csv')
add_dat['FullAdd'] = add_dat['IncidentAddress'] + ", DALLAS, TX"

# Code to download image based on address 
# https://andrewpwheeler.com/2015/12/28/using-python-to-grab-google-street-view-imagery/

myloc = r"./Images" #replace with your own location
key = "&key=????YourKeyHere????" 

def GetStreet(Add,SaveLoc,Name):
  base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
  MyUrl = base + urllib.parse.quote_plus(Add) + key #added url encoding
  fi = Name + ".jpg"
  loc_tosav = os.path.join(SaveLoc,fi)
  urllib.request.urlretrieve(MyUrl, loc_tosav)

# Code to get the google vision API labels
# for the image

client = vision.ImageAnnotatorClient.from_service_account_json('Geo Dallas-b5543ff0bb6d.json')

def LabelImage(ImageLoc):
    # Loads the image into memory
    with io.open(ImageLoc, 'rb') as image_file:
        content = image_file.read()
    image = vision.types.Image(content=content)
    response = client.label_detection(image=image)
    labels = response.label_annotations
    res = []
    if response.error.message:
        print(f'Error for image {ImageLoc}')
        print(f'Error Message {response.error.message}')
        res.append( ('Error', 1.0 ) )
    else:
        res = []
        for l in labels:
            res.append( (l.description , l.score) )
    return res

#A random parking garage!
GetStreet('Bad Address, Dallas, TX',myloc,'Bad_Address')    
LabelImage(os.path.join(myloc,'Bad_Address.jpg'))
            
long_tup = []
for index, row in add_dat.iterrows():
    #Name of the image
    nm = str(index) + "_" + str(row['Inside'])
    #Download the image    
    GetStreet(row['FullAdd'],myloc,nm)
    #Get the labels
    labs = LabelImage(os.path.join(myloc,nm + '.jpg'))
    #Build the new data tuples
    for l in labs:
        long_dat = (index, nm +'.jpg', row['Inside'], row['FullAdd'], l[0], l[1])
        long_tup.append(long_dat)
    #Sleep for a second to not spam the servers
    time.sleep(1)
    print(f'Done with index {index}')

long_dat = pd.DataFrame(long_tup, 
                        columns=['Index','Image','Inside','Address','Description','Score'])
            
long_dat.to_csv('LabeledData.csv',index=False)

To get this to work you need a few things. First, you need to enable both the Vision API and the Streetview API in your Google API console. The streetview API has a key you can get directly from the API console (as described in my prior posts). But the vision API is different, and you can download a json file with all the necessary info and feed it into the client call. Once that is all done, you have it set up to query both API’s to get the images and then get the labels. But this is quick and dirty, it does not check for errors in either.

Here is a screenshot of some of the images downloaded, you can see that the streetview API doesn’t fail when their is no image available, it just does a mostly blank gray screenshot.

Analyzing the Results

I am not above just piping the results into an Excel document and doing some quick pivot tables. (I like doing that when there are many categories I want to explore quickly.) So here is a pivot table of the sum of the scores across the 300 outside hotspot (column 0) and 300 inside (column 1) images. So you can see the label of property is in more than half of the images for each (since the score value is never above 1). But property is more common outside hot spots than it is inside hot spots.

Here are contrast coded sums, so these identify the different labels that are more common in either hotspots or outside of hotspots. So outside of hotspots trees and plants appear more common (see Kondo et al., 2017 and Kondo’s other work on the topic). Inside hotspots we have more cars & asphault for examples.

This is just a quick and dirty analysis though. I do not take into account here missing images. The Screenshot label shows missing images are more common inside hotspots. And here since I use the addresses sometimes it gives me a shot of the street instead of the view perpendicular to the street. (I am not 100% sure the best way to do it, if you geocode and then use the lat/lon, you may not have the right view of the property either depending on the geocoding engine, so maybe going with the address directly is better?)

Future Work

In terms of predictive applications, I think using the streetview imagery is not likely to improve crime forecasts, that it is really only worthwhile for EDA or theory testing. In terms of predictive analysis, I actually think using the satellite imagery has more potential (see Jay, 2020 for an example, although that isn’t predictive but causal analysis).

So prior work has used 311 calls for service to identify high disorder areas (Magee, 2020; O’Brien & Winship, 2017; Wheeler, 2018), so I wonder if you can specifically build an image detector to identify particular disorder aspects that are not redundant with 311 calls. And also perhaps scales directly relevant to CPTED. The Google Vision labels are a bit superficial to really use for many theory crim applications I am afraid, but is an interesting exploratory data analysis to check them out.

References

Street Network Distances and Correlations

Wouter Steenbeek (a friend and co-author for a few articles) has a few recent blog posts replicating some of my prior work replicating some of my work on street network vs Euclidean distances in Albany, NY (Wouters, 1, 2) and my posts (1,2).

In Wouter’s second post, he was particularly interested in checking out shorter distances (as that is what we are often interested in in criminology, checking crime clustering). When doing that, the relationship between network and Euclidean distances sometimes appear less strong, so my initial statement that they tend to be highly correlated is incorrect.

But this is an artifact for the correlation between any two measures – worth pointing out in general for analysis. If you artificially restrict the domain of one variable the correlation always goes down. See some examples on the cross-validated site (1, 2) that illustrate this with nicer graphs than I can whip up in a short time.

But for a quick idea about the issue, imagine a scenario where you slice out Euclidean distances in some X bin width, and check the scatterplot between Euclidean and network distances. So you will get less variation on the X axis, and more variation on the Y axis. Now take this to the extreme, and slice on Euclidean distances at only one value, say 100 meters exactly. In this scatterplot, there is no X variation, it is just a vertical line of points. So in that scenario the correlation is 0.

So I should not say the correlation between the two measures is high, as this is not always true – you can construct an artificial sample in which that statement is false. So a more accurate statement is that you can use the Euclidean distance to predict the network distance fairly accurately, or that the linear relationship between Euclidean and network distances is quite regular – no matter what the Euclidean distance is.

My analysis I have posted the python code here. But for a quick rundown, I grab the street networks for a buffer around Albany, NY using the osmnx library (so it is open street map network data). I convert this street network to an undirected graph (so no worrying about one-way streets) in a local projection. Then using all of the intersections in Albany (a few over 4000), I calculate all of the pairwise distances (around 8.7 million pairs, takes my computer alittle over a day to crunch it out in the background).

So again, the overall correlation is quite high:

But if you chunk the data up into tinier intervals, here 200 meter intervals, the correlations are smaller (an index of 100 means [0-200), 300 means [200-400), etc.).

But this does not mean the linear relationship between the two change. Here is a comparison of the linear regression line for the whole sample (orange), vs a broken-stick type model (the blue line). Imagine you take a slice of data, e.g. all Euclidean distances in the bin [100-200) and fit a regression line. And then do the same for the Euclidean distances [200-300) etc. The blue line here are those regression fits for each of those individual binned estimates. You can see that the two estimates are almost indistinguishable, so the relationship doesn’t change if you subset the data to shorter distances.

Technically the way I have drawn the blue line is misleading, I should have breaks in the line (it is not forced to be connected between bins, like my post on restricted cubic splines is). But I am too lazy to write code to do those splits at the moment.

Now, what does this mean exactly? So for research designs that may want to use network distances and an independent variable, e.g. look at prison visitation as a function of distance, or in my work on patrol redistricting I had to impute some missing travel time distances, these are likely OK to use typical Euclidean distances. Even my paper on survivability for gun shot fatality shows improved accuracy estimates using network distances, but very similar overall effects compared to using Euclidean distances.

So while here I have my computer crunch out the network distances for a day, where the Euclidean distances with the same data only takes a second, e.g. using scipy.spatial.distance. So it depends on the nature of the analysis whether that extra effort is worth it. (It helps to have good libraries ease the work, like here I used osmnx for python, and Wouter showed R code using sf to deal with the street networks, hardest part is the networks are often not stored in a way that makes doing the routing very easy. Neither of those libraries were available back in 2014.) Also note you only need to do the network calculations once and then can cache them (and I could have made these network computations go faster if I parallelized the lookup). So it is slightly onerous to do the network computations, but not impossible.

So where might it make a difference? One common use of these network distances in criminology is for analyses like Ripley’s K or near-repeat patterns. I don’t believe using network distances makes a big deal here, but I cannot say for sure. What I believe happens is that using network distances will dilate the distances, e.g. if you conclude two point patterns are clustered starting at 100 meters using Euclidean distances, then if using network it may spread out further and not show clustering until 200 meters. I do not think it would change overall inferences, such as where you make an inference whether two point patterns are clustered or not. (One point is does make a difference is doing spatial permutations in Ripley’s K, you should definitely restrict the simulations to generating hypothetical distributions on the street network and not anywhere in the study area.)

Also Stijn Ruiter makes the point (noted in Wouter’s second post), that street networks may be preferable for prediction purposes. Stijn’s point is related to spatial units of analyses, not to Euclidean vs Network distances. You could have a raster spatial unit of analysis but incorporate street network statistics, and vice-versa could have a vector street unit spatial unit of analysis and use Euclidean distance measures for different measures related to those vector units.

Wouter’s post also brought up another idea I’ve had for awhile, that when using spatial buffers around areas they can be bad control areas, as even if you normalize the area they have a very tiny sliver of network distance attributable to them. I will need to show that for another blog post though. (This was mostly my excuse to learn osmnx to do the routing!)

Recent Papers on Hot Spots of Crime in Dallas

So I have two different papers that were published recently. Both are on hot spots in Dallas, so might as well discuss them together.

For each I have posted the code to replicate the results (and that spreadsheet has links to preprints as well).

For each as a bit of a background as to the motivation for the projects, Dallas has had official hot spots, named TAAG (Target Area Action Grid). These were clearly larger than what would be considered best practice in identifying hot spots (they were more like entire neighborhoods). I realize ‘best practices’ is a bit wishy-washy, but the TAAG areas ended up covering around 20% of the city (a smidge over 65 square miles). Here is a map of the 2017 areas. There were 54 TAAG areas that covered, so on average each is alittle over 1 square mile.

Additionally I knew the Dallas police department was interested in purchasing the RTM software to do hot spots. And a separate group, the Dallas Crime Task Force was interested in using the software as well for non-police related interventions.

So I did these projects on my own (with my colleagues Wouter and Sydney of course). It wasn’t paid work for any of these groups (I asked DPD if they were interested, and had shared my results with folks from CPAL before that task force report came out, but nothing much came of it unfortunately). But my results for Dallas data are very likely to generalize to other places, so hopefully they will be helpful to others.

Machine Learning to Predict and Understand Hot Spots

So I see the appeal for folks who want to use RTM. It is well validated in both theory and practice, and Joel has made a nice software as a service app. But I knew going in that I could likely improve upon the predictions compared to RTM.

RTM tries to find a middle ground between prediction and causality (which isn’t a critique, it is sort of what we are all doing). RTM in the end spits out predictions that are like “Within 800 feet of a Subway Entrances is Risk Factor 1” and “The Density of Bars within 500 Feet is Risk Factor 2”. So it prefers simple models, that have prognostic value for PDS (or other agencies) to identify potential causal reasons for why that location is high crime. And subsequently helps to not only identify where hot spots are, but frame the potential interventions an agency may be interested it.

But this simplicity has a few drawbacks. One is that it is a global model, e.g. “800 feet within a subway entrance” applies to all subway entrances in the city. Most crime generators have a distribution that makes it so most subway entrances are relatively safe, only a few end up being high crime (for an example). Another is that it forces the way that different crime generators predict crime to be a series of step functions, e.g. “within 600 ft” or “a high density within 1000 ft”. In reality, most geographic processes follow a distance decay function. E.g. if you are looking at the relationship between check-cashing stores and street robbery, there are likely to be more very nearby the store, and it tails off in a gradual process the further away you get.

So I fit a more complicated random forest model that has neither of those limitations and can learn much more complicated functions, both in terms of distance to crime generators as well as spatially varying over the city. But because of that you don’t get the simple model interpretation – they are fundamentally conflicting goals. In terms of predictions either my machine learning model or a simpler comparison of using prior crime = future crime greatly outperforms RTM for several different predictive metrics.

So this shows the predictions are better for RTM no matter how you slice the hot spot areas, but again you lose out the prognostic value of RTM. To replace that, I show local interpretability scores for hot spots. I have an online map here for an example. If you click on one of the high crime predicted areas, it gives you a local breakdown of the different variables that contributes to the risk score.

So it is still more complicated than RTM, but gets you a local set of factors that potentially contribute to why places are hot spots. (It is still superficial in terms of causality, but PDs aren’t going to be able to get really well identified causal relationships for these types of predictions.)

Return on Investment for Hot Spots Policing

The second part of this is that Dallas is no doubt in a tight economic bind. And this was even before all the stuff about reforming police budgets. So policing academics have been saying PDs should shift many more resources from reactive to proactive policing for years. But how to make the argument that it is in police departments best interest to shift resources or invest in additional resources?

To do this I aimed to calculate a return on investment on investing in hot spots policing. Priscilla Hunt (from RAND) recently came up with labor cost estimates for crime specifically relevant for police departments. So if an aggravated assault happens PDs (in Texas) typically spend around $8k in labor costs to respond to the crime and investigate (it is $125k for a homicide). So based on this, you can say, if I can prevent 10 agg assaults, I then save $80k in labor costs. I use this logic to estimate a return on investment for PDs to do hot spots policing.

So first I generate hot spots, weighting for the costs of those crimes. Here is an interactive map to check them out, and below is a screenshot of the map.

I have an example of then calculating a return on investment for the hot spot area that captured the most crime. I get this estimate by transforming meta-analysis estimates of hot spots policing, estimating an average crime reduction, and then backing out how much labor costs that would save a police department. So in this hot spot, an ROI for hot spots policing (for 1.5 years) is $350k.

That return would justify at least one (probably more like two) full time officers just to be assigned to that specific hot spot. So if you actually hire more officers, it will be around net-zero in terms of labor costs. If you shift around current officers it should be a net gain in labor resources for the PD.

So most of the hot spots I identify in the study if you do this ROI calculation likely aren’t hot enough to justify hot spots policing from this ROI perspective (these would probably never justify intensive overtime that is typical of crackdown like interventions). But a few clearly are, and definitely should be the targets of some type of hot spot intervention.

New paper out: Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims

A recent paper with Gio Circo, Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims, was published in Applied Spatial Analysis and Policy. In this work, me and Gio estimate the marginal effect that drive time distances to the nearest Level 1 trauma center have on the probability a victim dies of a gun shot wound, using open Philadelphia data.

If you do not have access to that published version, here is a pre-print version. (And you can always email me or Gio and ask for a copy.) Also because we use open data, we have posted the data and code used for the analysis. (Gio did most of the work!)

For a bit of the background on the project, Gio had another paper estimating a similar model using Detroit data. But Gio estimated those models with aggregate data. I was familiar with more detailed Philly shooting data, as I used it for an example hot spot cluster map in my GIS crime mapping class.

There are two benefits to leveraging micro data instead of the aggregated data. One is that you can incorporate micro level incident characteristics into the model. The other is that you can get the exact XY coordinates where the incident occurred. And using those exact coordinates we calculate drive time distances to the hospital, which offer a slight benefit in terms of leave-one-out cross-validated accuracy compared to Euclidean distances.

So in terms of incident level characteristics, the biggest factor in determining your probability of death is not the distance to the nearest hospital, but where you physically get shot on your body. Here is a marginal effect plot from our models, showing how the joint effect of injury location (as different colors) and the drive time distance impact the probability of death. So if you get shot in the head vs the torso, you have around a 30% jump in the probability of death from that gun shot wound. Or if you get shot in an extremity you have a very low probability of death as well.

But you can see from that the margins for drive times are not negligible. So if you are nearby a hospital and shot in the torso your probability of dying is around 20%, whereas if you are 30 minutes away your probability rises to around 30%. You can then use this to map out isochrone type survivability estimates over the city. This example map is if you get shot in the torso, and the probability of death based on the drive time distance to the nearest Level 1 trauma location.

Fortunately many shootings do not occur in the northern most parts of Philadelphia, here is a map of the number of shootings over the city for our sample.

You can subsequently use these models to either do hypothetical take a trauma center away or add a trauma center. So given the density of shootings and drive time distances, it might make sense for Philly to invest in a trauma center in the shooting hot spot in the Kensington area (northeast of Temple). (You could technically figure out an ‘optimal’ location given the distribution of shootings, but since you can’t just plop down a hospital wherever it would make more sense to do hypothetical investments in current hospitals.)

For a simplified example, imagine you had 100 shootings in the torso that were an average 20 minutes away. The average probability of death in that case is around 25% (so ~25 homicides). If you hypothetically have a location that is only 5 minutes away, the probability goes down to more like 20% (so ~20 homicides). So in that hypothetical, the distance margin would have prevented 5 deaths.

One future piece of research I would be interested in examining is pre-post Shotspotter. So in that article Jen Doleac is right in that the emipirical evidence for Shotspotter reducing shootings is pretty flimsy, but preventing mortality by getting to the scene faster may be one mechanism that ShotSpotter can justify its cost.

Making aoristic density maps in R

I saw Jerry the other day made/updated an R package to do aoristic analysis. A nice part of this is that it returns the weights breakdown for individual cases, which you can then make maps of. My goto hot spot map for data visualization, kernel density maps, are a bit tough to work with weighted data though in R (tough is maybe not the right word, to use ggplot it takes a bit of work leveraging other packages). So here are some notes on that.

I have provided the data/code here. It is burglaries in Dallas, specifically I filter out just for business burglaries.

R Code Snippet

First, for my front end I load the libraries I will be using, and change the working directory to where my data is located.

############################
library(aoristic) #aoristic analysis 
library(rgdal)    #importing spatial data
library(spatstat) #weighted kde
library(raster)   #manipulate raster object
library(ggplot2)  #for contour graphs
library(sf)       #easier to plot sf objects

my_dir <- "D:\\Dropbox\\Dropbox\\Documents\\BLOG\\aoristic_maps_R\\data_analysis"
setwd(my_dir)
############################

Next I just have one user defined function, this takes an input polygon (the polygon that defines the borders of Dallas here), and returns a raster grid covering the bounding box. It also have an extra data field, to say whether the grid cell is inside/outside of the boundary. (This is mostly convenient when creating an RTM style dataset to make all the features conform to the same grid cells.)

###########################
#Data Manipulation Functions

#B is border, g is size of grid cell on one side
BaseRaster <- function(b,g){
    base_raster <- raster(ext = extent(b), res=g)
    projection(base_raster) <- crs(b)
    mask_raster <- rasterize(b, base_raster, getCover=TRUE) #percentage of cover, 0 is outside
    return(mask_raster)
}
###########################

The next part I grab the datasets I will be using, a boundary file for Dallas (in which I chopped off the Lochs, so will not be doing an analysis of boat house burglaries today), and then the crime data. R I believe you always have to convert date-times when reading from a CSV (it never smartly infers that a column is date/time). And then I do some other data fiddling – Jerry has a nice function to check and make sure the date/times are all in order, and then I get rid of points outside of Dallas using the sp over function. Finally the dataset is for both residential/commercial, but I just look at the commercial burglaries here.

###########################
#Get the datasets

#Geo data
boundary <- readOGR(dsn="Dallas_MainArea_Proj.shp",layer="Dallas_MainArea_Proj")
base_Dallas <- BaseRaster(b=boundary,g=200) 
base_df <- as.data.frame(base_Dallas,long=TRUE,xy=TRUE)

#Crime Data
crime_dat <- read.csv('Burglary_Dallas.csv', stringsAsFactors=FALSE)
#prepping time fields
crime_dat$Beg <- as.POSIXct(crime_dat$StartingDateTime, format="%m/%d/%Y %H:%M:%OS")
crime_dat$End <- as.POSIXct(crime_dat$EndingDateTime, format="%m/%d/%Y %H:%M:%OS")

#cleaning up data
aor_check <- aoristic.datacheck(crime_dat, 'XCoordinate', 'YCoordinate', 'Beg', 'End')
coordinates(crime_dat) <- crime_dat[,c('XCoordinate', 'YCoordinate')]
crs(crime_dat) <- crs(boundary)
over_check <- over(crime_dat, boundary)
keep_rows <- (aor_check$aoristic_datacheck == 0) & (!is.na(over_check$city))
crime_dat_clean <- crime_dat[keep_rows,]

#only look at business burgs to make it go abit faster
busi_burgs <- crime_dat_clean[ crime_dat_clean$UCROffense == 'BURGLARY-BUSINESS', ]
###########################

The next part preps the aoristic weights. First, the aoristic.df function is from Jerry’s aoristic package. It returns the weights broken down by 168 hours per day of the week. Here I then just collapse across the weekdays into the same hour, which to do that is simple, just add up the weights.

After that it is some more geographic data munging using the spatstat package to do the heavy lifting for the weighted kernel density estimate, and then stuffing the result back into another data frame. My bandwidth here, 3000 feet, is a bit large but makes nicer looking maps. If you do this smaller you will have a more bumpy and localized hot spots in the kernel density estimate.

###########################
#aoristic weights

#This takes like a minute
res_weights <- aoristic.df(busi_burgs@data, 'XCoordinate', 'YCoordinate', 'Beg', 'End')

#Binning into same hourly bins
for (i in 1:24){
    cols <- (0:6*24)+i+5
    lab <- paste0("Hour",i)
    res_weights[,c(lab)] <- rowSums(res_weights[,cols])
}

#Prepping the spatstat junk I need
peval <- rasterToPoints(base_Dallas)[,1:2]
spWin <- as.owin(as.data.frame(peval))
sp_ppp <- as.ppp(res_weights[,c('x_lon','y_lat')],W=spWin) #spp point pattern object

#Creating a dataframe with all of the weighted KDE
Hour_Labs <- paste0("Hour",1:24)

for (h in Hour_Labs){
  sp_den <- density.ppp(sp_ppp,weights=res_weights[,c(h)],
                        sigma=3000,
                        edge=FALSE,warnings=FALSE)
  sp_dat <- as.data.frame(sp_den)
  kd_raster <- rasterFromXYZ(sp_dat,res=res(base_Dallas),crs=crs(base_Dallas))
  base_df[,c(h)] <- as.data.frame(kd_raster,long=TRUE)$value
}
###########################

If you are following along, you may be wondering why all the hassle? It is partly because I want to use ggplot to make maps, but for its geom_contour it does not except weights, so I need to do the data manipulation myself to supply ggplot the weighted data in the proper format.

First I turn my Dallas boundary into a simple feature sf object, then I create my filled contour graph, supplying the regular grid X/Y and the Z values for the first Hour of the day (so between midnight and 1 am).

###########################
#now making contour graphs

dallas_sf <- st_as_sf(boundary)

#A plot for one hour of the day
hour1 <- ggplot() + 
  geom_contour_filled(data=base_df, aes(x, y, z = Hour1), bins=9) +
  geom_sf(data=dallas_sf, fill=NA, color='black') +
  scale_fill_brewer(palette="Greens") +
  ggtitle('       Hour [0-1)') + 
  theme_void() + theme(legend.position = "none")
hour1

png('Hour1.png', height=5, width=5, units="in", res=1000, type="cairo") 
hour1
dev.off()
###########################

Nice right! I have in the code my attempt to make a super snazzy small multiple plot, but that was not working out so well for me. But you can then go ahead and make up other slices if you want. Here is an example of taking an extended lunchtime time period.

###########################
#Plot for the afternoon time period
base_df$Afternoon <- rowSums(base_df[,paste0("Hour",10:17)])

afternoon <- ggplot() + 
  geom_contour_filled(data=base_df, aes(x, y, z = Afternoon), bins=9) +
  geom_sf(data=dallas_sf, fill=NA, color='black') +
  scale_fill_brewer(palette="Greens") +
  ggtitle('       Hour [9:00-17:00)') + 
  theme_void() + theme(legend.position = "none")
afternoon
###########################

So you can see that the patterns only slightly changed compared to the midnight prior graph.

Note that these plots will have different breaks, but you could set them to be equal by simply specifying a breaks argument in the geom_contour_filled call.

I will leave it up so someone who is more adept at R code than me to make a cool animated viz over time from this. But that is a way to mash up the temporal weights in a map.

Notes on making Leaflet maps in R

The other day I wrote a blog post for crimrxiv about posting interactive graphics on their pre-print sharing service. I figured it would be good to share my notes on making interactive maps, and to date I’ve mostly created these using the R leaflet library.

The reason I like these interactive maps is they allow you to zoom in and look at hot spots of crime. With the slippy base maps you can then see, oh OK this hot spot is by a train station, or an apartment complex, etc. It also allows you to check out specific data labels via pop-ups as I will show.

I’m using data from my paper on creating cost of crime weighted hot spots in Dallas (that will be forthcoming in Police Quarterly soonish). But I have posted a more direct set of replicating code for the blog post here.

R Code

So first for the R libraries I am using, I also change the working directory to where I have my data located on my Windows machine.

##########################################################
#This code creates a nice leaflet map of my DBSCAN areas

library(rgdal)       #read in shapefiles
library(sp)          #spatial objects
library(leaflet)     #for creating interactive maps
library(htmlwidgets) #for exporting interactive maps

#will need to change baseLoc if replicating on your machine
baseLoc <- "D:\\Dropbox\\Dropbox\\Documents\\BLOG\\leaflet_R_examples\\Analysis"
setwd(baseLoc)
##########################################################

Second, I read in my shapefiles using the rgdal library. This is important, as it includes the projection information. To plot the spatial objects on a slippy map they need to be in the Web Mercator projection (or technically no projection, just a coordinate reference system for the globe). As another trick I like with these basemaps, for the outlined area (the Dallas boundary here), it is easier to plot as a line spatial object, as opposed to an empty filled polygon. You don’t need to worry about the order of the layers as much that way.

##########################################################
#Get the boundary data and DBSCAN data
boundary <- readOGR(dsn="Dallas_MainArea_Proj.shp",layer="Dallas_MainArea_Proj")
dbscan_areas <- readOGR(dsn="db_scan.shp",layer="db_scan")

#Now convert to WGS
DalLatLon <- spTransform(boundary,CRS("+init=epsg:4326"))
DallLine <- as(DalLatLon, 'SpatialLines') #Leaflet useful for boundaries to be lines instead of areas
dbscan_LatLon <- spTransform(dbscan_areas,CRS("+init=epsg:4326") )

#Quick and Dirty plot to check projections are OK
plot(DallLine)
plot(dbscan_LatLon,add=TRUE,col='blue')
##########################################################

Next part, I have a custom function I have made to make pop-up labels for these leaflet maps. First I need to read in a table with the data info for the hot spot areas and merge that into the spatial object. Then the way my custom function works is I pass it the dataset, then I have arguments for the variables I want, and the way I want them labeled. The function does the work of making the labels bolded and putting in line breaks into the HTML. (No doubt others have created nice libraries to do HTML tables/graphs inside the pop-ups that I am unaware of.) If you check out the final print statement, it shows the HTML it built for one of the labels, <strong>ID: </strong>1<br><strong>$ (Thousands): </strong>116.9<br><strong>PAI: </strong>10.3<br><strong>Street Length (Miles): </strong>0.4

##########################################################
#Function for labels

#read in data
crime_stats <- read.csv('ClusterStats_wlen.csv', stringsAsFactors=FALSE)
dbscan_stats <- crime_stats[crime_stats$type == 'DBSCAN',]
dbscan_stats$clus_id <- as.numeric(dbscan_stats$AreaStr) #because factors=False!

#merge into the dbscan areas
dbscan_LL <- merge(dbscan_LatLon,dbscan_stats)

LabFunct <- function(data,vars,labs){
  n <- length(labs)
  add_lab <- paste0("<strong>",labs[1],"</strong>",data[,vars[1]])
  for (i in 2:n){
    add_lab <- paste0(add_lab,"<br><strong>",labs[i],"</strong>",data[,vars[i]])
  }
  return(add_lab)
}

#create labels
vs <- c('AreaStr', 'val_th', 'PAI_valth_len', 'LenMile')
#Lazy, so just going to round these values
for (v in vs[-1]){
  dbscan_LL@data[,v] <- round(dbscan_LL@data[,v],1)
}  
lb <- c('ID: ','$ (Thousands): ','PAI: ','Street Length (Miles): ')
diss_lab <- LabFunct(dbscan_LL@data, vs, lb)

print(diss_lab[1]) #showing off just one
##########################################################

Now finally onto the hotspot map. This is a bit to chew over, so I will go through bit-by-bit.

##########################################################
HotSpotMap <- leaflet() %>%
  addProviderTiles(providers$OpenStreetMap, group = "Open Street Map") %>%
  addProviderTiles(providers$CartoDB.Positron, group = "CartoDB Lite") %>%
  addPolylines(data=DallLine, color='black', weight=4, group="Dallas Boundary") %>%
  addPolygons(data=dbscan_LL,color = "blue", weight = 2, opacity = 1.0, 
              fillOpacity = 0.5, group="DBSCAN Areas",popup=diss_lab, 
              highlight = highlightOptions(weight = 5,bringToFront = TRUE)) %>%
  addLayersControl(baseGroups = c("Open Street Map","CartoDB Lite"),
                   overlayGroups = c("Dallas Boundary","DBSCAN Areas"),
                   options = layersControlOptions(collapsed = FALSE))  %>%
  addScaleBar(position = "bottomleft", options = scaleBarOptions(maxWidth = 100, 
              imperial = TRUE, updateWhenIdle = TRUE))
                      
HotSpotMap #this lets you view interactively

#or save to a HTML file to embed in webpage
saveWidget(HotSpotMap,"HotSpotMap.html", selfcontained = TRUE)
##########################################################

First I create the empty leaflet() object. Because I am superimposing multiple spatial layers, I don’t worry about setting the default spatial layer. Second, I add in two basemap providers, OpenStreetMap and the grey scale CartoDB positron. Positron is better IMO for visualizing global data patterns, but the open street map is better for when you zoom in and want to see exactly what is around a hot spot area. Note when adding in a layer, I give it a group name. This allows you to later toggle which provider you want via a basegroup in the layers control.

Next I add in the two spatial layers, the Dallas Boundary lines and then the hot spots. For the DBSCAN hot spots, I include a pop-up diss_lab for the dbscan hot spot layer. This allows you to click on the polygon, and you get the info I stuffed into that label vector earlier. The HTML is to make it print nicely.

Finally then I add in a layers control, so you can toggle layers on/off. Basegroups mean that only one of the options can be selected, it doesn’t make sense to have multiple basemaps selected. Overlay you can toggle on/off as needed. Here the overlay doesn’t matter much due to the nature of the map, but if you have many layers (e.g. a hot spot map and a choropleth map of demographics) being able to toggle the layers on/off helps a bit more.

Then as a final touch I add in a scale bar (that automatically updates depending on the zoom level). These aren’t my favorite with slippy maps, as I’m not even 100% sure what location the scale bar refers to offhand (the center of the map? Or literally where the scale bar is located?) But when zoomed into smaller areas like a city I guess it is not misleading.

Here is a screenshot of this created map zoomed out to the whole city using the Positron grey scale base map. So it is tough to visualize the distribution of hot spots from this. If I wanted to do that in a static map I would likely just plot the hot spot centroids, and then make the circles bigger for areas that capture more crime.

But since we can zoom in, here is another screenshot zoomed in using the OpenStreetMap basemap, and also illustrating what my pop-up labels look like.

I’m too lazy to post this exact map, but it is very similar to one I posted for my actual hot spots paper if you want to check it out directly. I host it on GitHub for free.

Here I did not show how to make a choropleth map, but Jacob Kaplan in his R book has a nice example of that. And in the future I will have to update this to show how to do the same thing in python using the Folium library. I used Folium in this blog post if you want to dig into an example though for now.

Some more examples

For some other examples of what is possible in Leaflet maps in R, here are some examples I made for my undergrad Communities and Crime class. I had students submit prediction assignments (e.g. predict the neighborhood with the most crime in Dallas, predict the street segment in Oak Cliff with the most violent crime, predict the bar with the most crimes nearby, etc.) I would then show the class the results, as well as where other students predicted. So here are some screen shots of those maps.

Choropleth

Graduated Points

Street Segment Viz