Checking a Poisson distribution fit: An example with officer involved shooting deaths WaPo data (R functions)

So besides code on my GitHub page, I have a list of various statistic functions I’ve scripted on the blog over the years on my code snippets page. One of those functions I will illustrate today is some R code to check the fit of the Poisson distribution. Many of my crime analysis examples rely on crime data being approximately Poisson distributed. Additionally it is relevant in regression model building, e.g. should I use a Poisson GLM or do I need to use some type of zero-inflated model?

Here is a brief example to show how my R code works. You can source it directly from my dropbox page. Then I generated 10k simulated rows of Poisson data with a mean of 0.2. So I see many people in CJ make the mistake that, OK my data has 85% zeroes, I need to use some sort of zero-inflated model. If you are working with very small spatial/temporal units of analysis and/or rare crimes, it may be the mean of the distribution is quite low, and so the Poisson distribution is actually quite close.

# My check Poisson function
source('https://dl.dropboxusercontent.com/s/yj7yc07s5fgkirz/CheckPoisson.R?dl=0')

# Example with simulated data
set.seed(10)
lambda <- 0.2
x <- rpois(10000,lambda)
CheckPoisson(x,0,max(x),mean(x))

Here you can see in the generated table from my CheckPoisson function, that with a mean of 0.2, we expect around 81.2% zeroes in the data. And since we simulated the data according to the Poisson distribution, that is what we get. The table shows that out of the 10k simulation rows, 8121 were 0’s, 1692 rows were 1’s etc.

In real life data never exactly conform to hypothetical distributions. But we often want to see how close they are to the hypothetical before building predictive models. A real life example as close to Poisson distributed data as I have ever seen is the Washington Post Fatal Use of Force data. Every year WaPo has been collating the data, the total number of Fatal uses of Police Force in the US have been very close to 1000 events per year. And even in all the turmoil this past year, that is still the case.

# Washington Post Officer Involved Shooting Deaths Data
oid <- read.csv('https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv',
                stringsAsFactors = F)

# Year Stats
oid$year <- as.integer(substr(oid$date,1,4))
year_stats <- table(oid$year)[1:6]
year_stats 
mean(year_stats)
var(year_stats)

One way to check the Poison distribution is that the mean and the variance should be close, and here at the yearly level the data have some evidence of underdispersion according to the Poisson distribution (most crime data is overdispersed – the variance is much greater than the mean). If the actual mean is around 990, you would expect typical variations of say around plus/minus 60 per year (~ 2*sqrt(990)). But that only gives us a few observations to check (6 years). We can dis-aggregate the data to smaller intervals and check the Poisson assumption. Here I aggregate to days (note that this includes zero days in the table levels calculation). Then we again check the fit of the Poisson distribution.

#Now aggregating to count per day
oid$date_val <- as.Date(oid$date)
date_range <- paste0(seq(as.Date('2015-01-01'),max(oid$date_val),by='days'))
day_counts <- as.data.frame(table(factor(oid$date,levels=date_range)))
head(day_counts)
pfit <- CheckPoisson(day_counts$Freq, 0, 10, mean(day_counts$Freq))
pfit

According to the mean and the variance, it appears the distribution is a very close fit to the Poisson. We can see in this data we expected to have around 147 days with 0 fatal encounters, and in reality there were 160. I like seeing the overall counts, but another way is via the proportions in the final three columns of the table. You can see for all of the integers, we are less than 2 percentage points off for any particular integer count. E.g. we expect the distribution to have 3 fatal uses of force on about 22% of the days, but in the observed distribution days with 3 events only happened around 21% of the days (or 20.6378132 without rounding). So overall these fatal use of force data of course are not exactly Poisson distributed, but they are quite close.

So the Poisson distribution is motivated via a process in which the inter-arrival dates of events being counted are independent. Or in more simple terms one event does not cause a future event to come faster or slower. So offhand if you had a hypothesis that publicizing officer fatalities made future officers more hesitant to use deadly force, this is not supported in this data. Given that this is officer involved fatal encounters in the entire US, it is consistent with the data generating process that a fatal encounter in one jurisdiction has little to do with fatal encounters in other jurisdictions.

(Crime data we are often interested in the opposite self-exciting hypothesis, that one event causes another to happen in the near future. Self-excitation would cause an increase in the variance, so the opposite process would result in a reduced variance of the counts. E.g. if you have something that occurs at a regular monthly interval, the counts of that event will be underdispersed according to a Poisson process.)

So the above examples just checked a univariate data source for whether the Poisson distribution was a decent fit. Oftentimes academics are interested in whether the conditional distribution is a good fit post some regression model. So even if the marginal distribution is not Poisson, it may be you can still use a Poisson GLM, generate good predictions, and the conditional model is a good fit for the Poisson distribution. (That being said, you model has to do more work the further away it is from the hypothetical distribution, so if the marginal is very clearly off from Poisson a Poisson GLM probably won’t fit very well.)

My CheckPoisson function allows you to check the fit of a Poisson GLM by piping in varying predicted values over the sample instead of just one. Here is an example where I use a Poisson GLM to generate estimates conditional on the day of the week (just for illustration, I don’t have any obvious reason fatal encounters would occur more or less often during particular days of the week).

#Do example for the day of the week
day_counts$wd <- weekdays(as.Date(day_counts$Var1))
mod <- glm(Freq ~ as.factor(wd) - 1, family="poisson", data=day_counts)
#summary(mod), Tue/Wed/Thu a bit higher
lin_pred <- exp(predict(mod))
pfit_wd <- CheckPoisson(day_counts$Freq, 0, 10, lin_pred)
pfit_wd

You can see that the fit is almost exactly the same as before with the univariate data, so the differences in days of the week does not explain most of the divergence from the hypothetical Poisson distribution, but again this data is already quite close to a Poisson distribution.

So it is common for people to do tests for goodness-of-fit using these tables. I don’t really recommend it – just look at the table and see if it is close. Departures from hypothetical can inform modeling decisions, e.g. if you do have more zeroes than expected than you may need a negative binomial model or a zero-inflated model. If the departures are not dramatic, variance estimates from the Poisson assumption are not likely to be dramatically off-the-mark.

But if you must, here is an example of generating a Chi-Square goodness-of-fit test with the example Poisson fit table.

# If you really want to do a test of fit
chi_stat <- sum((pfit$Freq - pfit$PoisF)^2/pfit$PoisF)
df <- length(pfit$Freq) - 2
dchisq(chi_stat, df)

So you can see in this example the p-value is just under 0.06.

I really don’t recommend this though for two reasons. One is that with null hypothesis significance testing you are really put in a position that large data samples always reject the null, even if the departures are trivial in terms of the assumptions you are making for whatever subsequent model. The flipside of this is that with small samples the test is underpowered, so there are never many good scenarios where it is useful in practice. Two, you can generate superfluous categories (or collapse particular categories) in the Chi-Square test to increase the degrees of freedom and change the p-value.

One of the things though that this is useful for is checking the opposite, people fudging data. If you have data too close to the hypothetical distribution (so very high p-values here), it can be evidence that someone manipulated the data (because real data is never that close to hypothetical distributions). A famous example of this type of test is whether Mendel manipulated his data.

I intentionally chose the WaPo data as it is one of the few that out of the box really appears to be close to Poisson distributed in the wild. One of my next tasks though is to do some similar code for negative binomial fits. Like Paul Allison, for crime count data I rarely see much need for zero-inflated models. But while I was working on that I noticed that the parameters in NB fits with even samples of 1,000 to 10,000 observations were not very good. So I will need to dig into that more as well.

The WDD test with different pre/post time periods

Eric Piza asked the other day if my and Jerry’s WDD test can be used when the pre/post time periods are different. The answer is yes out of the box, the identification strategy does not rely on equality of time periods. So for example, say we had two years pre and one year post data, and the crime counts in treated/control looked like this:

         Pre  Post 
Treated   80    20
Control  100    50

So then our difference-in-difference Poisson estimate of the treatment effect would be:

(20 - 80) - (50 - 100) =  -10

What the parallel trends assumption means here is that since you saw a decrease in 50 crimes in the control area, you would expect a decrease of 50 crimes in the treated area as well. The variance of this estimate is then 20 + 80 + 50 + 100 = 250, and so the standard error is sqrt(250) ~ 15.8. So this is not a statistically significant effect.

It is hard to interpret this effect size though, since it is not a standard unit of time comparison. Also the variance of the estimate will be larger if you have a longer pre time period, which is the opposite of what you want. We can actually amend the statistic though to be a per-unit-time comparison, which will reduce the variance of the estimate. It ends up being similar to my prior post on adding Harm Weights to the WDD, you can’t just pipe in the per unit time estimates in the spreadsheet I shared, but I will show here how to incorporate them into the estimator (and share some python code to show the estimator behaves as expected in simulations).

So again with a pre-time period of 2 years, and post of 1 year, we could do the prior table as per year estimates.

         Pre  Post 
Treated   40    20
Control   50    50

And here our estimate of the crime reduction effect is different:

(20 - 40) - (50 - 50) =  -20

So with a Poisson variable with a mean of 100, the variance of that variable is also 100. So here we are dividing that 100 by a constant 2 – this changes the variance to 100/(2^2). (Var(X*a) = a^2*Var(X) where X is a random variable and a is a constant.) The post variables are simply divided by one, so does not change their variance. So to carry this forward to our standard error estimate, we would calculate (edit, had some errors in this part, should be using the original counts, not the average counts, but my later python code was correct, so the simulation part is OK):

20/1 + 80/4 + 50/1 + 100/4 = 115

So you can see that our variance estimate here is much smaller, and that the standard error is sqrt(115) ~ 10.7. So here the reduction is not quite a statistically significant reduction in crimes at the 0.05 level. A 95% confidence interval would be -20 +/- 2*10.7 ~ [+1, -41]. Here the WDD estimate is easier to interpret as well, and that confidence interval corresponds to a per year estimate of somewhere between +1 and -41 crimes.

Below I share some python code to conduct simulations similar to the original WDD paper. This code will then establish the estimator has the null distribution as expected (when there are no changes it really is a standard normal distribution) and that the confidence intervals have coverage like you would expect.

Python Simulation Code

For set up, I import the libraries I need (stat distributions, numpy and pandas). I am not going to go into detail into the functions, but it allows you to generate simulated distributions in various ways to conduct analysis of the properties of my time weighted estimator I have specified above.

'''
WDD Simulation with differing time periods
Andy Wheeler
'''

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import poisson
from scipy.stats import uniform

#This works for the scipy functions
np.random.seed(seed=10)

# A function to generate the WDD estimate for simulated data
def wdd_sim(treat0,treat1,cont0,cont1,pre,post):
    tr_cr_0 = poisson.rvs(mu = treat0, size=int(pre)).sum()
    co_cr_0 = poisson.rvs(mu = cont0, size=int(pre)).sum()
    tr_cr_1 = poisson.rvs(mu = treat1, size=int(post)).sum()
    co_cr_1 = poisson.rvs(mu = cont1, size=int(post)).sum()
    est = ( tr_cr_1/post - tr_cr_0/pre ) - ( co_cr_1/post - co_cr_0/pre )
    post2 = (1/post)**2
    pre2 = (1/pre)**2
    var_est = tr_cr_0*pre2 + tr_cr_1*post2 + co_cr_0*pre2 + co_cr_1*post2
    true_val = ( treat1 - treat0 ) - ( cont1 - cont0 )
    z_score = est / np.sqrt(var_est)
    return (est, var_est, true_val, z_score)

def make_data(n, treat0, treat1, cont0, cont1, pre, post):
    base = pd.DataFrame( range(n), columns=['index'])
    base['treat0'] = treat0
    if treat1 is not None:
        base['treat1'] = treat1
    else:
        base['treat1'] = base['treat0']
    if cont0 is not None:
        base['cont0'] = cont0
    else:
        base['cont0'] = base['treat0']
    if cont1 is not None:
        base['cont1'] = cont1
    else:
        base['cont1'] = base['cont0']
    base.drop(columns='index',inplace=True)
    base['pre'] = pre
    base['post'] = post
    sim_vals = base.apply(lambda x: wdd_sim(**x), axis=1, result_type='expand')
    sim_vals.columns = ['est','var_est','true_val','z_score']
    return pd.concat([base,sim_vals], axis=1)

So for a first example, this code generates treatment/control areas with a Poisson mean of 5 in both the pre/post time periods. But, the pre time period is 4 units of time, and the post time period is only 1 unit. So this means there is no change, and the Z score estimator should on average have a 0 estimate and a standard deviation of 1. I do 10,000 simulations to keep it going a bit faster, but you can up that if you want.

# No change, with baseline of 5 crimes per unit time
sim_dat = make_data(10000, 5, 5, 5, 5, 4, 1)
sim_dat['z_score'].describe()

So here we can see these 10k simulated Poisson data have a mean z-score of 0 and a standard deviation of 1, right like we expected.

So I haven’t extensively tested, but if you have average crime counts well under 5, I would be a bit hesitant to use this estimator. (So you either need larger area aggregations or larger time aggregations.) Although you could do simulations on your own to see how it holds up.

The way I wrote the functions you can also pass in random variables as well, so here is an example with again no change, but the baseline varies uniformily from 5 to 100. And here also the pre time periods are 6, and the post time period is again just 1.

# Can pass in random functions instead of constant values
sim_n = 10000
tf = uniform.rvs(loc=5, scale=100, size=sim_n)

sim_dat2 = make_data(sim_n, tf, None, None, None, 6, 1)
sim_dat2.head()
sim_dat2['z_score'].describe()

So you can see the base simulated dataset pre/post always has the same means, but instead of being a set of constant 5’s, it changes for each row (simulation) in the dataset. And again the null distribution is right on the money with a mean of 0 and standard deviation of 1.

So those are examples of the null distribution of no changes in the time weighted estimator. This establishes that the false positive alpha rates are as you would expect. E.g. if you use the usual p-value < 0.05, if the differences are really 0 you only have a false positive reject the null 5 times out of 100.

But we also want to establish that when there is a difference, the estimator is not biased and that the variance estimates are correct. For the later part looking at the coverage rates of the confidence intervals is one way to do that. So here I show that with my hypothetical example in the intro part of this blog, the 95% and 90% confidence interval coverage rates are exactly as they should be. And the z-score estimate is right about where it should be as well.

# Lets look at the coverage rate for a decline from 40 to 20
def cover(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    cover = ( data['true_val'] > low) & ( data['true_val'] < high )
    return cover

sim_dat3 = make_data(sim_n, 40, 20, 50, 50, 2, 1)
sim_dat3.head()

# This should be centered on 2
sim_dat3['z_score'].describe()

# Should be ~ 0.9
co_90 = cover(sim_dat3, ci=0.9)
co_90.mean()

# Should be ~ 0.95
co_95 = cover(sim_dat3, ci=0.95)
co_95.mean()

So you can see the coverage is right on the money. The estimator is slightly biased downward in this simulation (should get a z-score on average around -2, but here the mean is -1.85). But it is good enough IMO to not worry about much in this situation.

Again, the original estimator without weighted for time is fine, if we do the same motions without doing weighting for different time periods, the coverage is still all fine and dandy.

# Note you can do the same coverage estimate without time weighted
sim_dat4 = make_data(sim_n, 80, 20, 100, 50, 1, 1)
sim_dat4.head()

# This should be around -0.6
sim_dat4['z_score'].describe()

co_90w = cover(sim_dat4, ci=0.9)
co_90w.mean()

co_95w = cover(sim_dat4, ci=0.95)
co_95w.mean()

So you can see again coverage is right on the money, and the z-score estimator actually has less bias than the time weighted one, it is right on the money as expected.

So why would you prefer the time weighted estimator if it shows more bias? It is because it has a lower variance, this code shows the length of the confidence intervals in the simulations.

# Does it make a difference?
def len_ci(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    return high - low

len4 = len_ci(sim_dat4)
len4.describe()

len3 = len_ci(sim_dat3)
len3.describe()

So you can see here that the non-time weighted estimator tends to have a confidence interval with a length of 62, whereas the time weighted estimator has a confidence interval on average of 42.

So above establishes that the time weighted estimator behaves as you would expect. You can also use this code to conduct some potential power analyses. So for the time weighted estimator we show, even though the reduction is around 50% in the treated area (going from 40 to 20), the power is not great, around 60%.

# Example power analyses, ONE TAILED
def reject_rate(data, alpha=0.05):
    p_vals = norm.cdf(data['z_score'])
    return p_vals < alpha
    
r3 = reject_rate(sim_dat3)
r3.mean()

So this means if you did this experiment in real life and it was that effective, you would still fail to reject the null of no differences 2/5 times.

But what if we say we will get more historical data? So 4 years back instead of just 2? How does that impact our power estimates?

# How about with more historical data
sim_dat5 = make_data(sim_n, 40, 20, 50, 50, 4, 1)
r5 = reject_rate(sim_dat5)
r5.mean()

The power goes up by alittle, to 0.67. The same is true if we up the post period to 4 time periods instead of 1:

# How about with more post data
sim_dat6 = make_data(sim_n, 40, 20, 50, 50, 4, 4)
r6 = reject_rate(sim_dat6)
r6.mean()

So now in this example you have an over 90% power to detect a crime reduction, going from 40 to 20 per time period (where the control has an average of 50 crimes per time period), if you have 4 pre time periods and 4 post time periods.

Future Stuff

So a few caveats with this. For one, you may think that since dividing per time period reduces the variance, why not divide by smaller time slivers. So instead of one year, why not divide by 365 days?

I have not studied extensively this property of the estimator. So I cannot say how it behaves with more/less time aggregation into smaller Poisson estimates. You will need to take that on yourself if you want to examine very fine time units and very small Poisson counts per unit time. Again I think a baseline rule of thumb that they should not be lower the 5 counts per unit time is the best advice I can give without doing simulations for your exact circumstances.

A second part is that with longer time periods comes the risk that the control areas are not as good. This is a problem intrinsic to synthetic control analysis as well (that I don’t believe anyone has a particular answer to). And I don’t have an answer either.

For the pre-time period, you can check the parallel trends assumption by simply plotting the two time series, they should be close to in step with one another. So that is not a big deal. But with the post time period, I think if you monitor long enough they will eventually depart from one another.

So I think it is best to set up a time period at the start you have committed to doing the experiment. And you can use the power analysis simulations like I showed to help you figure out that period. But it may be possible to extend this WDD estimate to continuously monitor an intervention (see here for example).

New book: Micro geographic analysis of Chicago homicides, 1965-2017

In joint work with Chris Herrmann and Dick Block, we now have a book out – Understanding Micro-Place Homicide Patterns in Chicago (1965 – 2017). It is a Springer Brief book, so I recommend anyone who has a journal article that is too long that this is a potential venue for the work. (Really this is like the length of three journal articles.)

A few things occurred to prompt me to look into this. First, Chicago increased a big spike of homicides in 2016 and 2017. Here is a graph breaking them down between domestic related homicides and all other homicides. You can see all of the volatility is related to non-domestic homicides.

So this (at least to me) begs the question of whether those spiked homicides show similar characteristics compared to historical homicides. Here we focus on long term spatial patterns and micro place grid cells in the city, 150 by 150 meter cells. Dick & Carolyn Block had collated data, including the address where the body was discovered, using detective case notes starting in 1965 (ending in 2000). The data from 2000 through 2017 is the public incident report data released by Chicago PD online. Although Dick and Carolyn’s public dataset is likely well known at this point, Dick has more detailed data than is released publicly on ICPSR and a few more years (through 2000). Here is a map showing those homicide patterns aggregated over the entire long time period.

So we really have two different broad exploratory analyses we employed in the work. One was to examine homicide clustering, and the other was to examine temporal patterns in homicides. For clustering, we go through a ton of different metrics common in the field, and I introduce even one more, Theil’s decomposition for within/between neighborhood clustering. This shows Theil’s clustering metric within neighborhoods in Chicago (based on the entire time period).

So areas around the loop showed more clustering in homicides, but here it appears it is somewhat confounded with neighborhood size – smaller neighborhoods appear to have more clustering. This is sort of par for the course for these clustering metrics (we go through several different Gini variants as well), in that they are pretty fickle. You do a different temporal slice of data or treat empty grid cells differently the clustering metrics can change quite a bit.

So I personally prefer to focus on long term temporal patterns. Here I estimated group based trajectory models using zero-inflated Poisson models. And here are the predicted outputs for those grid cells over the city. You can see unlike prior work David Weisburd (Seattle), myself (Albany), or Martin Andresen (Vancouver) has done, they are much more wavy patterns. This may be due to looking over a much longer horizon than any of those prior works though have.

The big wave, Group 9, ends up being clearly tied to former large public housing projects, which their demolitions corresponds to the downturn.

I have an interactive map to explore the other trajectory groups here. Unfortunately the others don’t show as clear of patterns as Group 9, so it is difficult to answer any hard questions about the uptick in 2016/2017, you could find evidence of homicides dispersing vs homicides being in the same places but at a higher intensity if you slice the data different ways.

Unfortunately the analysis is never ending. Chicago homicides have again spiked this year, so maybe we will need to redo some analysis to see if the more current trends still hold. I think I will migrate away from the clustering metrics though (Gini and Theil), they appear to be too volatile to say much of anything over short term patterns. I think there may be other point pattern analysis that are more diagnostic to really understand emerging/changing spatial patterns.

The coffee next to the cover image is Chris Herrmann’s beans, so go get yourself some as well at Fellowship Coffee!

Changepoints in CCTV Effects

So I am a big fan of using splines in regression equations to model non-linear effects. But a limitation of these is that you need to upfront say how many knots you want, as well as where the knots are. So I have explored a bit on fitting models that can identify the changepoints themselves. It was a tricky road, I tried building some in deep learning using pytorch, then tried variational auto-encoders in pyro, then pystan (marginalizing the changepoint out), and then pymc3 (using different samplers). All of my attempts failed! But when I used the R mcp library (Lindeløv, 2020), it was able to find my changepoint using simulated data. (It uses JAGS under the hood, no idea why JAGS behaved better than my other attempts.)

Usecase: Dropoff effect of CCTV on clearance rates

So in spatial criminology, a popular hypothesis is estimating distance decay effects. Ratcliffe (2012) was the first example of using a changepoint regression model to do this, showing a changepoint in the effect of bars on the spatial density of crime nearby. This has been replicated in Xu & Griffiths (2017), and in my work using machine learning and partial dependence plots I show similar changepoint patterns as well (Wheeler & Steenbeek, 2020).

One example use case though I want to mention is not in terms of estimating the spatial density of crime, but with the characteristics of the crime events themselves. Sometimes people I think mistakenly think since I have spatial data, I need to aggregate it to some areal unit, and then do analysis of that areal unit data. That approach is not per-se wrong, but is sometimes a step removed from what you want, and can result in some tricky inferences.

Take for example a recent paper looking at clearances and using RTM by Kennedy et al. (2020). What they do is spatially aggregate homicides cleared and homicides not cleared, and run RTM on each. You might be tempted to interpret if a factor is selected for both models that it does not impact clearances, but it also depends on the size of the effect. So for example, in Brooklyn for drug markets they report a rate ratio of 3.1 and 2.4 (both at the same spatial distance). To translate this into a clearance rate, you need to add the two density estimates for all cases, and then take the cleared cases as the numerator.

# Example R code
clear <- exp(-0.1 + log(3.1))
nonclear <- exp(-0.1 + log(2.4))
prop <- clear/(clear + nonclear)
prop #0.5636364

Here I am treating -0.1 as the intercept. So here this is lower, but close to the overall clearance in Brooklyn, 58%. This 56% will be the estimate iff the intercept for each equation is the same, if they are not though it could change the clearance rate estimate either way. Since the Kennedy paper did not report this, we cannot know. So for instance, if we change the intercept estimates so clearances are higher and non-clearances are lower, we get an estimate that drug markets increase clearances slightly, not decrease them:

clear <- exp(-0.05 + log(3.1))
nonclear <- exp(-0.2 + log(2.4))
prop <- clear/(clear + nonclear)
prop #0.6001124

In this example it probably won’t push them too far either way, but takes a bit of work going from the aggregate data analysis to the estimate we want, how those spatial risk factors impact the clearance rate. There is an easier way though – just incorporate your spatial features, such as the distance the nearest crime generator factor, and estimate a model on the micro level incident data. This is what Kennedy et al. (2020) do later in the paper when incorporating the RTM predictions – I just think they should have done the RTM machinery directly on this problem, instead of the two-step approach.

Examples of my work I have done this approach in the past (incorporating spatial data into the micro level incidents) is with fatalities from gun shot wounds (Circo & Wheeler, 2020). We actually investigated non-linear effects though of distance/drive-time, and did not find evidence of that. Going back to the crime clearance example though, another pre-print I examine the effects of CCTV cameras and find a diminishing effect of case clearances given the distance to the camera (Jung & Wheeler, 2019).

So here we use a pre-post design to show there are some selection effects, and we do further analysis to show this camera bump in clearances is only limited to thefts. But we set the splines at 500, 1000, and 1500 feet pre-emptively for the analysis. A reviewer critique of this is that those three locations are arbitrary (which is correct), so here I will see if I do a changepoint model that allows us to find the knot locations if it will show the same ones.

The idea behind this analysis is that CCTV are often used in investigations. Yeondae is an officer in Korea, same as here in the states first things detectives do is to go and grab CCTV footage. Analysis of cameras are often aggregated to their viewsheds, but I think estimating distance decay effects make as much sense. So events closer to the cameras presumably will provide more clear evidence than events at the border of the viewshed. A second point is that even if the event takes place off-camera, there may be evidence cross by the camera viewshed. Detectives will often try to follow individuals across multiple cameras. So both of those factors suggest a distance decay effect both within a cameras viewshed and a decaying effect even outside of the viewshed. (In addition to this, geo coordinates of crime locations are not perfectly accurate measures either, so that could cause effects outside of the viewshed as well.)

Here I am just limiting the data to the post camera data within 3000 feet for thefts, which still is over 26,000 observations. I’ve posted the data/code to follow along here.

Analysis using mcp in R

Again given my hardship in coding this up myself in python, I created a simulated data example and checked the results using mcp (which you can check in my code). Since mcp recovered my simulated changepoint, (and my python attempts did not), going to go ahead with the mcp library! First, we will import my clearance data and get rid of a few missing cases.

#################
library(mcp)
library(ggplot2)
set.seed(10)
#can see I planned on doing this in pytorch at first!
setwd('D:\\Dropbox\\Dropbox\\Documents\\BLOG\\changepoint_pytorch\\Analysis')
theft_clear <- read.csv('PostTheft_CCTV.csv')
theft_clear <- theft_clear[complete.cases(theft_clear),]
#################

So first for a reference, if I assume there is a linear changepoint at 1000 feet, here are what my results look like. Note here that this is not aggregated data to spatial locations, each row in this dataset is a theft offense, whether it was cleared, and the distance to the nearest CCTV camera.

#################
#What are the coefficients if assume a changepoint of 1000 feet
theft_clear$x_dif <- (theft_clear$CAM.DIST - 1000)*(theft_clear$CAM.DIST > 1000)
theft_mod <- glm(formula = 'STATUSi ~ CAM.DIST + x_dif', family = "binomial", data = theft_clear)
summary(theft_mod) #This gives an estimate of 
#################

And here you can visualize the results alittle easier than trying to back out probabilities for the regression equation:

#################
pred_mod <- predict(theft_mod,type='response')
plot(theft_clear$CAM.DIST,pred_mod, main="Changepoint at 1000 ft",
  xlab="Distance from Camera (ft)", ylab="Probability Clearance")
#################

So this shows clearances nearby cameras in Dallas are around 15%, and they trail off to around 9% at 1000 feet. After that they continue to tail off, but are nearly flat. But again that is assuming a change point at 1000 feet. But the mcp package lets us actual estimate the changepoint itself using Bayesian regression. Here is the set up that is equivalent to my formulation earlier, in that the changepoint cannot be discontinuous.

#################
theft_clear$x <- theft_clear$CAM.DIST 
model = list(
  STATUSi | trials(const) ~ 1 + x,
  ~ 0 + x  #joined changing rate
)

fit = mcp(model, data = theft_clear, family = binomial(), iter = 3000, adapt = 500)
#################

And then if you are following along you can go ahead and take a nap (maybe took 2 hours on my machine?), and when we get back summary(fit) gives us:

So we have very similar coefficients to the manual changepoint model earlier, but the changepoint is around 1600 feet, not 1000. (Although note these are Bayesian credible intervals, not frequentist confidence intervals.) And now to make a nice plot of the fitted model.

#Fitted values for new data
newdat <- data.frame(x = (0:300)*10)
newdat$const <- 1
newdat$CAM.DIST <- newdat$x
res <- fitted(fit, newdata = newdat)

p_pred <- ggplot(data=res) + 
  geom_line(size=1.2, color='black', aes(x = x, y = fitted)) + 
  geom_ribbon(alpha=0.5, fill='black', aes(x = x, ymin=Q2.5 , ymax=Q97.5)) + 
  scale_x_continuous(name="Feet from Camera",breaks=seq(0,3000,500),minor_breaks=NULL) + 
  scale_y_continuous(name="P(Clearance)",breaks=seq(0.06,0.16,0.02),minor_breaks=NULL) +
  theme_bw() + theme(panel.grid.major = element_line(colour = 'grey', linetype = 'dashed', size=0.1)) + 
  theme(text = element_text(size=20))

p_pred

So you can see that here it is a nearly linear drop off until 1600 feet, and then starts to climb back up. The climb up I think is likely due to selection effects, but we can’t 100% rule out displacement effects. Displacement effects could occur with cameras if detectives prioritize events around cameras and de-prioritize other events not nearby cameras. Skeptical that applies to thefts in Dallas though, as they very rarely will be assigned a detective at all.

Wrap Up

So this ended up taking me for a few different turns. One of the things I wanted to be able to test multiple changepoints, maybe if I can ever get pymc3 to give me a reasonable fit, this example is a good illustration. That should also maybe say if you should have no changepoint as well. I think maybe it is much harder to fit those models with binomial data though than with continuous (maybe good for another blog post as well, did simulations at first with 1000 observations and that was a bad idea).

One thing that would be good for evaluating whether change points are reasonable are out of sample predictive comparisons. So say estimate a no changepoint model, a linear changepoint model, and then a model with fixed spline locations. Then see which of those better fits the out of sample data. But since this is a blog post, will leave it as is. But this is a simple illustration to extend prior spatial analysis of changepoints in distance decay effects to one example – crime clearances and CCTV cameras – that I think makes alot of sense.

References

Amending the WDD test to incorporate Harm Weights

So I received a question the other day about amending my and Jerry Ratcliffe’s Weighted Displacement Difference (WDD) test to incorporate crime harms (Wheeler & Ratcliffe, 2018). This is a great idea, but unfortunately it takes a small bit of extra work compared to the original (from the analysts perspective). I cannot make it as simple as just piping in the pre-post crime weights into that previous spreadsheet I shared. The reason is a reduction of 10 crimes with a weight of 10 has a different variance than a reduction of 25 crimes with a weight of 4, even though both have the same total crime harm reduction (10*10 = 4*25).

I will walk through some simple spreadsheet calculations though (in Excel) so you can roll this on your own. HERE IS THE SPREADSHEET TO DOWNLOAD TO FOLLOW ALONG. What you need to do is to calculate the traditional WDD for each individual crime type in your sample, and then combine all those weighted WDD’s estimates in the end to figure out your crime harm weighted estimate in the end (with confidence intervals around that estimated effect).

Here is an example I take from data from Worrall & Wheeler (2019) (I use this in my undergrad crime analysis class, Lab 6). This is just data from one of the PFA areas and a control TAAG area I chose by hand.

So first, go through the motions for your individual crimes in calculating the point estimate for the WDD, and then also see the standard error of that estimate. Here is an example of piping in the data for thefts of motor vehicles. The WDD is simple, just pre-post crime counts. Since I don’t have a displacement area in this example, I set those cells to 0. Note that the way I calculate this, a negative number is a good thing, it means crime went down relative to the control areas.

Then you want to place those point estimates and standard errors in a new table, and in those same rows assign your arbitrary weight. Here I use weights taken from Ratcliffe (2015), but these weights can be anything. See examples in Wheeler & Reuter (2020) for using police cost of crime estimates, and Wolfgang et al. (2006) for using surveys on public perceptions of severity. Many of the different indices though use sentencing data to derive the weights. (You could even use negative weights and the calculations here all work, say you had some positive data on community interactions.)

Now we have all we need to calculate the harm-weighted WDD test. The big thing here to note is that the variance of Var(x*harm_weight) = Var(x)*harm_weight^2. So that allows me to use all the same machinery as the original WDD paper to combine all the weights in the end. So now you just need to add a few additional columns to your spreadsheet. The point estimate for the harm reduction is simply the weight multiplied by the point estimate for the crime reduction. The variance though you need to square the standard error, and square the weight, and then multiply those squared results together.

Once that is done, you can pool the harm weighted stats together, see the calculations below the table. Then you can use all the same normal distribution stuff from your intro stats class to calculate z-scores, p-values, and confidence intervals. Here are what the results look like for this particular example.

I think this is actually a really good idea to pool results together. Many place based police interventions are general, in that you might expect them to reduce multiple crime types. Various harm scores are a good way to pool the results, instead of doing many individual tests. A few caveats though, I have not done simulations like I did in the WDD peer reviewed paper, I believe these normal approximations will do OK under the same circumstances though that we suggest it is reasonable to do the WDD test. You should not do the WDD test if you only have a handful of crimes in each area (under 5 in any cell in that original table is a good signal it is too few of crimes).

These crime count recommendations I think are likely to work as well for weighted crime harm. So even if you give murder a really high weight, if you have fewer than 5 murders in any of those original cells, I do not think you should incorporate it into the analysis. The large harm weight and the small numbers do not cancel each other out! (They just make the normal approximation I use likely not very good.) In that case I would say only incorporate individual crimes that you are OK with doing the WDD analysis to begin with on their own, and then pool those results together.

Sometime I need to grab the results of the hot spots meta-analysis by Braga and company and redo the analysis using this WDD estimate. I think the recent paper by Braga and Weisburd (2020) is right, that modeling the IRR directly makes more sense (I use the IRR to do cost-benefit analysis estimates, not Cohen’s D). But even that is one step removed, so say you have two incident-rate-ratios (IRRs), 0.8 and 0.5, the latter is bigger right? Well, if the 0.8 study had a baseline of 100 crimes, that means the reduction is 100 - 0.8*100 = 20, but if the 0.5 study had a baseline of 30 crimes, that would mean a reduction of 30 - 0.5*30 = 15, so in terms of total crimes is a smaller effect. The WDD test intentionally focuses on crime counts, so is an estimate of the actual number of crimes reduced. Then you can weight those actual crime decreases how you want to. I think worrying about the IRR could even be one step too far removed.

References

A latent variable approach to RTM using hidden layers in deep learning

Sorry about the long title! Previously I have blogged about how to use Deep Learning to generate an RTM like model variable selection and positive constraints. Deep learning frameworks often do not rely on variable selection like that though, they more often leverage hidden layers. For social scientists familiar with structural equation modelling, these hidden layers are very much akin to formative latent variables. (More traditionally folks use reflective latent variables in factor analysis, so the latent variable causes the observed measures. This is the obverse, the observed measures cause/define the latent variable, and we find the loadings that best predict some outcome further down the stream.)

In a nutshell, instead of the typical RTM way of picking the best variable to use, e.g. Alcohol Density < 100 meters OR Alcohol Density < 500 meters, it allows both to contribute to a latent variable, call it AlcoholDens, but allows those weights to vary. Then I see how well the AlcoholDens latent variable predicts crime. I will show later in the results that the loadings are often spread out among different density/distance measures in this sample, suggesting the approach just pick one is perhaps misguided.

I’ve posted the data and code to follow along here. There are two py files, 00_RTMHidden.py runs the main analysis, but dl_rtm_funcs.py has various functions used to build the deep learning model in pytorch. I am just going to hit some of the highlights instead of walking through bit by bit.

Some helper functions

First, last blog post I simply relied on using Poisson loss. This time, I took some effort to figure out my own loss function for the negative binomial model. Here I am using the NB2 form, and you can see I took the likelihood function from the Stata docs (they are a really great reference for various regression model info). To incorporate this into your deep learning model, you need to add a single parameter in your model, here I call it disp.

#Log likelihood taken from Stata docs, pg 11 
#https://www.stata.com/manuals13/rnbreg.pdf
def nb2_loss(actual, log_pred, disp):
    m = 1/disp.exp()
    mu = log_pred.exp()
    p = 1/(1 + disp.exp()*mu)
    nll = torch.lgamma(m + actual) - torch.lgamma(actual+1) - torch.lgamma(m)
    nll += m*torch.log(p) + actual*torch.log(1-p)
    return -nll.mean()

A second set of helper functions I will illustrate at the end of the post is evaluating the fit for Poisson/Negative Binomial models. I’ve discussed these metrics before, they are just a python rewrite of older SPSS code I made.

def pred_nb(mu, disp, int_y):
    inv_disp = 1/disp
    p1 = gamma(int_y + inv_disp) / ( factorial(int_y)*gamma(inv_disp) )
    p2 = ( inv_disp / (inv_disp + mu) ) ** inv_disp
    p3 = ( mu / (inv_disp + mu) ) ** int_y
    pfin = p1*p2*p3
    return pfin
    
def nb_fit(mu, obs, disp, max_y):
    res = []
    cum_fit = mu - mu
    for i in range(max_y+1):
        pred_fit = pred_nb(mu=mu, disp=disp, int_y=i)
        pred_obs = (obs == i)
        res.append( (str(i), pred_obs.mean(), pred_fit.mean(), pred_obs.sum(), pred_fit.sum()) )
        cum_fit += pred_fit
    fin_fit = 1 - cum_fit
    fin_obs = (obs > max_y)
    res.append( (str(max_y+1)+'+', fin_obs.mean(), fin_fit.mean(),
                  fin_obs.sum(), fin_fit.sum()) )
    dat = pd.DataFrame(res, columns=['Int','Obs','Pred','ObsN','PredN'])
    return dat

Main Analysis

Now onto the main analysis. Skipping the data loading (it is near copy-paste from my prior RTM Deep Learning post), here are the main guts to building and fitting the RTM model.

model = dl_rtm_funcs.RTM_hidden(gen_list=[alc_set,metro_set,c311_set], 
                                gen_names=['AlcOutlets','MetroEntr','Dens311'])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #1e-4

for t in range(5001):
    #Forward pass
    y_pred = model(comb_ten)
    #Loss 
    loss_insample = dl_rtm_funcs.nb2_loss(y_ten, y_pred, model.dispersion)
    optimizer.zero_grad()
    loss_insample.backward() #retain_graph=True
    optimizer.step()
    if t % 100 == 0:
        loss_out = dl_rtm_funcs.nb2_loss(out_ten, y_pred, model.dispersion)
        print(f'iter {t}: loss in = {loss_insample.item():.5f}, loss out = {loss_out.item():.5f}')

And in terms of iterations, on my machine this takes less than 20 seconds to do the 5000 iterations, and it has clearly peaked out by then (both in sample 2011 and out of sample 2012).

I’ve loading the RTM model object with a few helper functions, so if you then run print( model.coef_table() ), you get out the final regression coefficients, including the dispersion term. For my negative binomial models for my dissertation, the dispersion term tended to be around ~4 for many models, so this corresponds pretty closely with my prior work.

These have interpretations as latent variables representing the effect of nearby alcohol outlets (both distance and density), metro entrances (just distance), and 311 calls for service (just density). Similar to original RTM, I have restricted the crime generator effects to be positive.

I also have another helper function, model.loadings(), that gives you a nice table. Here this shows how the original variables contribute to the latent variable. So here are the loadings for the distance to the nearest metro.

You can see that the dummy variables for met_dis_300 (meters) and smaller all contribute to the latent variable. So instead of picking one variable in the end, it allows multiple variables to contribute to the latent risk score. It may make more sense in this set up to encode variables as not cumulative, e.g. < 50 meters, < 100 meters, but orthogonal, e.g. [0,50),[50,100), etc.), but just stuck with the prior data in the same format for now. I force the loadings to sum to 1 and be positive, so the latent variables still have a very apples-to-apples comparison in terms of effect sizes.

Here are the loadings for alcohol outlets, so we have both some distance and density effects in the end.

And here are the loadings for 311 density variables:

So you can see for the last one, only the furthest away had an effect at all. Which is contra to the broken windows theory! But also shows that this is more general than the original RTM approach. If it only should be one variable the model will learn that, but if it should be more it will incorporate a wider array of weights.

Next is to check out how well the model does overall. For calibration for Poisson/Negative Binomial models, I just detach my pytorch tensors, and feed them into my functions to do the evaluations.

#Calibration for Negative Binomial predictions
pred_pd = pd.Series( y_pred.exp().detach().numpy() )
disp_val = model.dispersion.exp().item()

nb_fit = dl_rtm_funcs.nb_fit(mu=pred_pd, obs=crime_data['Viol_2011'], 
                             disp=disp_val, max_y=10)
print( nb_fit )

So this shows that the model is pretty well calibrated in terms of overall predictions. Both samples predict 83% zeroes. I predict a few more 3/4 crime areas than observed, and my tails are somewhat thinner than they should be, but only by a tiny bit. (No doubt this would improve if I incorporated more covariates, kept it simple to debug on purpose.)

We can ignore the negative binomial dispersion term and see what our model would predict in the usual Poisson case (the mean functions are the same, it is just changing the variance). To do this, just pass in a dispersion term of 1.

pois_fit = dl_rtm_funcs.nb_fit(mu=pred_pd, obs=crime_data['Viol_2011'], 
                               disp=1, max_y=10)
print( pois_fit )

You can see that the Poisson model is a much worse fit. Underpredicting zero crime areas by 6%, and areas with over 10 crimes should pretty much never happen according to the Poisson model.

We should be assessing these metrics out of sample as well, and you can see that given crime is very historically stable, the out of sample 2012 violent crime counts are similarly well calibrated.

Finally, I have suggested in the past to use a weighted ROC curve as a metric for crime counts. Here is a simple example of doing that in python.

crime_data['Weights'] = crime_data['Viol_2012'].clip(1)
crime_data['Outcome'] = crime_data['Viol_2012'].clip(0,1)

fpr, tpr, thresh = roc_curve(crime_data['Outcome'], pred_pd, sample_weight=crime_data['Weights'])
weighted_auc = auc(fpr, tpr)
print( weighted_auc ) 

So you can see the AUC is nothing to brag about here, 0.61 (it is only 0.63 in the 2011 sample). But again I am sure I could get that up by quite a bit by incorporating more covariates into the model.

Open Source Criminology Related Network Datasets

So I am a big proponent of open source data analysis. There is a problem with using criminal justice data sources though – they often have private information that prevents us from sharing the data. For example, I have posted quite a few of my projects here (mostly spatial data analysis), but there are a few I cannot share. For example, I worked on a paper with chronic offender predictions, and I cannot share that data (Wheeler et al., 2019). The outcome, being a victim or perpetrator of gun violence, is so rare that by itself basically makes it impossible to publicly share the data without exposing the individuals under study.

One good resource all criminologists should be aware of is ICPSR, in particular NACJD. Many datasets on there though anymore are restricted, in that you need to get IRB permission and ICPSR permision to download the dataset to use. (Which typically takes like 2~3 months in my experience doing it a few times, which includes both your local Uni IRB and the ICPSR process.) For example here is one I went through the motions to get to (in the end) validate different survival prediction methods.

ICPSR is a great resource to be able to handle sharing potentially sensitive data. But this falls short in two areas. One is in teaching – you cannot go through the IRB ritual in a timely enough fashion to be able to use those datasets in a course environment. The other is in terms of methods, so for example if you wanted to say your model provides better predictions than some other model, they should be established on the same datasets. Current state of affairs in criminology in this regard is pretty bad to be curt – most everybody uses their own data they have access to. So much of the research on different risk assessment instruments for bail/probation/parole are pretty much impossible to say one is better than another.

One example type of data source that is almost entirely missing from NACJD (that I am aware of) is social network datasets relevant for criminology/criminal justice. So I have started a spreadsheet to collate different open source network datasets relevant for criminologists. So I have some from my work and a few other random examples I have come across on the internet.

SPREADSHEET OF NETWORK DATASETS

I have made that spreadsheet open, so anyone should be able to edit in more sources. (Feel free to include links to ICPSR as well, but if you do edit a note to say whether it is restricted access or not.) For here I would be interested in really large networks, for example would love to try to replicate Marie’s work on gang network transitions (Oullet et al., 2019a).

And also while I am here, Jacob Young has created a very nice introductory course to social network analysis. I have a brief lecture in my advanced research design class, but Jacob’s is much more thorough (and he is more of an expert in this area than I am for sure).

I will add to that spreadsheet over time as well. I have made a separate sheet for survival analysis datasets. I would be particularly keen for example criminal justice examples. So for network analysis we have examples of looking at use-of-force networks (Oullet et al., 2019b), and for survival analysis I would be interested in a time to solve example dataset. Unfortunately for the solved cases, NIBRS is a good resource but has a large confound in they don’t measure whether a case was ever assigned to a detective.

Feel free to add whatever in that spreadsheet, but what I was thinking was oriented towards different methods (again as a main motivation is for teaching). So for example if you knew of datasets for age-period-cohort modelling, or for estimating group-based-trajectory models, I think those would be good examples to start new sheets and collate different data sources.

References

  • Ouellet, M., Bouchard, M., & Charette, Y. (2019a). One gang dies, another gains? The network dynamics of criminal group persistence. Criminology, 57(1), 5-33.
  • Ouellet, M., Hashimi, S., Gravel, J., & Papachristos, A. V. (2019b). Network exposure and excessive use of force: Investigating the social transmission of police misconduct. Criminology & Public Policy, 18(3), 675-704.
  • Wheeler, A. P., Worden, R. E., & Silver, J. R. (2019). The accuracy of the violent offender identification directive tool to predict future gun violence. Criminal Justice and Behavior, 46(5), 770-788.

Using the Google Vision and Streetview API to Explore Hotspots

So previously I have shown how to automate the process of downloading google street view imagery (for individual addresses & running down a street). One interesting application is to then code those streetview images. There are many applications in criminology of coding these images for disorder. So Rob Sampson initially had the idea of ecometrics, in which he used systematic social observations via taking a video going down various streets to code physical disorder, such as garbage on the street (Raudenbush & Sampson, 1999). Others than leveraged Google streetview imagery to do those same audits instead of collecting their own footage (Bader et al., 2017).

Those are all someone looks at the images and a human says, there is XYZ in this photo and ABC in this photo. I was interested in testing out the Google Vision API to automate identifying parts of the images. So instead of a human manually reviewing, you build a score automatically. See for example work on identifying the percieved safety of streets (Naik et al., 2014).

Here I was motivated by some recent work of a colleague, Nate Connealy, in which he used this imagery to identify the differences in hot spots vs not hot spots (Connealy, 2020). Also I am pretty sure I saw George Mohler present on this at some ASC before I had the idea (it was similar to this paper, Khorshidi et al., 2019, not 100% sure it was the same one though). For an overview of crim applications using streetview and google maps, which also span CPTED type analyses, check out Vandeviver (2014).

So with Google’s automated vision API, if I submit this photo of a parking garage (this is actually the image I get if I submit the address Bad Address, Dallas, TX to the streetview API, so take in mind errors like that in my subsequent analysis).

You get back these labels, where the first item is the description and the second is the ‘score’ for whether the item is in the image:

('Architecture', 0.817379355430603),
('Floor', 0.7577666640281677),
('Room', 0.7444316148757935),
('Building', 0.7440816164016724),
('Parking', 0.7051371335983276),
('Ceiling', 0.6624311208724976),
('Flooring', 0.6004095673561096),
('Wood', 0.5958532094955444),
('House', 0.5928719639778137),
('Metal', 0.5114516019821167)

So I don’t tell Google what to look for, it just gives me back a ton of different labels depending on what it detects in the image. So what I do here is based on my hotspot work (Wheeler & Reuter, 2020), I grab a sample of 300 addresses inside my Dallas based hot spot areas, and 300 addresses outside of hot spots. (These addresses are based on crime data themselves, so similar to Nate’s work I only sample locations that at least have 1 crime).

So this isn’t a way to do predictions, but I think it is potentially interesting application of exploratory data analysis for hot spots or high crime areas.

Python Code Snippet

I am just going to paste the python code-snippet in its entirety.

'''
Grabbing streetview images and detecting
labels using the google vision API
'''

from google.cloud import vision
import pandas as pd
import io
import os
import urllib
import time

os.chdir(r'D:\Dropbox\Dropbox\Documents\BLOG\GoogleLabels_hotspots\analysis')

add_dat = pd.read_csv('Sampled_Adds.csv')
add_dat['FullAdd'] = add_dat['IncidentAddress'] + ", DALLAS, TX"

# Code to download image based on address 
# https://andrewpwheeler.com/2015/12/28/using-python-to-grab-google-street-view-imagery/

myloc = r"./Images" #replace with your own location
key = "&key=????YourKeyHere????" 

def GetStreet(Add,SaveLoc,Name):
  base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
  MyUrl = base + urllib.parse.quote_plus(Add) + key #added url encoding
  fi = Name + ".jpg"
  loc_tosav = os.path.join(SaveLoc,fi)
  urllib.request.urlretrieve(MyUrl, loc_tosav)

# Code to get the google vision API labels
# for the image

client = vision.ImageAnnotatorClient.from_service_account_json('Geo Dallas-b5543ff0bb6d.json')

def LabelImage(ImageLoc):
    # Loads the image into memory
    with io.open(ImageLoc, 'rb') as image_file:
        content = image_file.read()
    image = vision.types.Image(content=content)
    response = client.label_detection(image=image)
    labels = response.label_annotations
    res = []
    if response.error.message:
        print(f'Error for image {ImageLoc}')
        print(f'Error Message {response.error.message}')
        res.append( ('Error', 1.0 ) )
    else:
        res = []
        for l in labels:
            res.append( (l.description , l.score) )
    return res

#A random parking garage!
GetStreet('Bad Address, Dallas, TX',myloc,'Bad_Address')    
LabelImage(os.path.join(myloc,'Bad_Address.jpg'))
            
long_tup = []
for index, row in add_dat.iterrows():
    #Name of the image
    nm = str(index) + "_" + str(row['Inside'])
    #Download the image    
    GetStreet(row['FullAdd'],myloc,nm)
    #Get the labels
    labs = LabelImage(os.path.join(myloc,nm + '.jpg'))
    #Build the new data tuples
    for l in labs:
        long_dat = (index, nm +'.jpg', row['Inside'], row['FullAdd'], l[0], l[1])
        long_tup.append(long_dat)
    #Sleep for a second to not spam the servers
    time.sleep(1)
    print(f'Done with index {index}')

long_dat = pd.DataFrame(long_tup, 
                        columns=['Index','Image','Inside','Address','Description','Score'])
            
long_dat.to_csv('LabeledData.csv',index=False)

To get this to work you need a few things. First, you need to enable both the Vision API and the Streetview API in your Google API console. The streetview API has a key you can get directly from the API console (as described in my prior posts). But the vision API is different, and you can download a json file with all the necessary info and feed it into the client call. Once that is all done, you have it set up to query both API’s to get the images and then get the labels. But this is quick and dirty, it does not check for errors in either.

Here is a screenshot of some of the images downloaded, you can see that the streetview API doesn’t fail when their is no image available, it just does a mostly blank gray screenshot.

Analyzing the Results

I am not above just piping the results into an Excel document and doing some quick pivot tables. (I like doing that when there are many categories I want to explore quickly.) So here is a pivot table of the sum of the scores across the 300 outside hotspot (column 0) and 300 inside (column 1) images. So you can see the label of property is in more than half of the images for each (since the score value is never above 1). But property is more common outside hot spots than it is inside hot spots.

Here are contrast coded sums, so these identify the different labels that are more common in either hotspots or outside of hotspots. So outside of hotspots trees and plants appear more common (see Kondo et al., 2017 and Kondo’s other work on the topic). Inside hotspots we have more cars & asphault for examples.

This is just a quick and dirty analysis though. I do not take into account here missing images. The Screenshot label shows missing images are more common inside hotspots. And here since I use the addresses sometimes it gives me a shot of the street instead of the view perpendicular to the street. (I am not 100% sure the best way to do it, if you geocode and then use the lat/lon, you may not have the right view of the property either depending on the geocoding engine, so maybe going with the address directly is better?)

Future Work

In terms of predictive applications, I think using the streetview imagery is not likely to improve crime forecasts, that it is really only worthwhile for EDA or theory testing. In terms of predictive analysis, I actually think using the satellite imagery has more potential (see Jay, 2020 for an example, although that isn’t predictive but causal analysis).

So prior work has used 311 calls for service to identify high disorder areas (Magee, 2020; O’Brien & Winship, 2017; Wheeler, 2018), so I wonder if you can specifically build an image detector to identify particular disorder aspects that are not redundant with 311 calls. And also perhaps scales directly relevant to CPTED. The Google Vision labels are a bit superficial to really use for many theory crim applications I am afraid, but is an interesting exploratory data analysis to check them out.

References

Incorporating treatment non-compliance into call-ins

I have previously published work on identifying optimal individuals to prioritize for call-ins in Focused Deterrence interventions. The idea is we want to identify optimal people to spread the message, so you call in a small number of individuals and they should spread the message to the remaining group. There are better people than others to seed the message to to make sure it spreads throughout the network.

I knew of a direct improvement on that algorithm I published (very similar to the TURF problem I described the other day). But the bigger issue was that even when you call in individuals they do not always come to the meeting – treatment non-compliance. When working with state parole and/or local probation, the police department can ask those agencies to essentially make people come in, but otherwise it is voluntary.

The TURF problem I did the other day gave me a bit of inspiration on how to tackle that treatment non-compliance problem though. In a nutshell when you calculate whether someone is reached (via being directly connected to someone called-in), they can be partially reached based on the probability of the selected nodes treatment compliance. I have posted the code to follow along on dropbox here. I won’t go through the whole thing, but just some highlights.

The Model

First, in some quick and dirty text math, the model is:

Maximize Sum( R_i )

Subject to:

  • R_i <= Sum( S_j*p_j ) for each i
  • Sum( S_j ) = k
  • S_i element of [0,1]
  • R_i <= 1 for each i

Here i refers to an individual node in the gang/group network.

The first constraint R_i <= Sum( S_j*p_j ), the j’s are the nodes that are connected to i (and i itself). The p_j are the estimates that an individual will comply with coming into the call-in. For one agency we worked with for that project, they guessed that those who don’t need to come in comply about 1/6th of the time, so I use that estimate here in my examples, and give people who are on probation/parole a 1 for the probability of compliance.

Second constraint is we can only call in so many people, here k. The model solves very fast, so you can generate results for various k until you get the reach you want to in the end. (You could do the model the other way, minimize S_i while constraining the minimized acceptable reach, e.g. Sum( R_i ) >= threshold, I don’t suggest this in practice though, as when dealing with compliance there may be no feasible solution that gets you the amount of reach in the network you want.)

For the third constraint, the decision variables S_i are binary 0/1’s, but the R_i are continuous. But the trick here is that the last constraint, R_i <= 1, means that the expected reach is capped at 1. Here is a way to think about this, imagine you want to know the chance that person A is reached, and they are connected to two called-in individuals, who each have a 40% chance at complying with the treatment (coming to the call-in). The expected times person A would be reached then is additive in the probabilities, 0.4 + 0.4 = 0.8. If we had 3 people connected to A again at 40% apiece, the expected number of times A would be reached is then 0.4 + 0.4 + 0.4 = 1.2. So a person can be reached multiple times. (Note this is not the probability a person is reached at least once! It is a non-linear problem to model that.)

But if we took away the last constraint, what would happen is that the algorithm would just pick the nodes that had the highest number of neighbors. Since we are maximizing expected reach, if we had a sample of two people, the expected reach values of [2.5, 0] would be preferable to [1, 1], although clearly we rather have the reach spread out. So to prevent that, I cap the expected reach variable at 1, R_i <= 1 for each i, so this spreads out the selected individuals. So in the end the expected number of times people are reached are a lower bound estimate, but those are only people who are expected to receive the message multiple times.

This is a bit of a hack, but in my tests works quite well. I attempted to model the non-linear problem of estimating the probabilities at the person level and still maximizing the expected reach (in the code I have an example of using the CVXR R package). But it was quite fickle in when it would return a solution. So I am focusing on the linear program here, which is not perfect, but is an improvement over my prior published work.

Some Python Snippets

So for my example code, I am using City 4 Gang 4 from my paper. The reason is this was the largest network, and my original algorithm performed the worst. 99 nodes, and my original algorithm identified a 33 person dominant set, but Borgotti’s tool (that uses a genetic algorithm) identified a 29 dominant set.

Here is an example of calling my function to select the individuals for a call-in based on the non-compliance estimates. (g4 is the networkx graph object, the second arg is the number of individuals, and compliance is the node attribute that has the probability of treated compliance.) If we call in only 5 people, we still expect a reach of 29 individuals. Here there ends up being some highly connected people on parole/probation, so they have a 1 probability of complying with the treatment.

A consequence of this algorithm is that if you pipe in 1’s for the treatment compliance, you basically get an improvement to my original algorithm. So for a test we can see if I get the same minimal dominating set as Borgotti did for his algorithm here, where const is just everybody complies 100% of the time.

And yep we get a dominating set (all 99 people are reached). What happens if we go down one, and only select 28 people?

We only reach 98 out of the 99. So it appears a 29 set is the minimal dominating set here. But like I said the treatment non-compliance is a big deal in this setting. What is our expected reach if we take that into account, but still call-in 29 people?

It is still pretty high, around 2/3s of the network, but is still much smaller. Also if you look at the overlap between the constant versus non-compliance model, they select quite a few different individuals. It makes a big difference.

Here is a graph I made of selecting 20 individuals. Red means I selected that person, pink means they are reached at least some, and the size of the reach is proportion to the node. Then grey folks I wouldn’t expect to be reached by the message (at least by first degree connections).

So you can see that most of the people selected have that full 1 expected reach, so the algorithm does prioritize individuals on probation/parole who have a 100% expected compliance. But you can see a few folks who have a lower compliance who are selected as they are in places in the network not covered by those on probation/parole.

I have a tough time getting network layouts to look nice in python (even with the same layout algorithms, I feel like igraph in R just looks much better out of the box).

Future Work

Out of the box, this algorithm could incorporate several different pieces of information. So here I use the non-compliance estimate as a constant, but you could have varying estimates for that based on some other model no problem (e.g. older individuals comply more often than younger, etc.). Also another interesting extension (if you could get estimates) would be the probability a called-in individual spreads the message. In the part Sum( S_j*p_j ) it would just be something like Sum( S_j*p_cj*p_sj ), where p_cj is the compliance probability for attending, and p_sj is the probability to spread the message to those they are connected to.

Getting worthwhile estimates for either of those things will be tough though. Only way I can see it is via some shoe leather qualitative or survey approach.

Street Network Distances and Correlations

Wouter Steenbeek (a friend and co-author for a few articles) has a few recent blog posts replicating some of my prior work replicating some of my work on street network vs Euclidean distances in Albany, NY (Wouters, 1, 2) and my posts (1,2).

In Wouter’s second post, he was particularly interested in checking out shorter distances (as that is what we are often interested in in criminology, checking crime clustering). When doing that, the relationship between network and Euclidean distances sometimes appear less strong, so my initial statement that they tend to be highly correlated is incorrect.

But this is an artifact for the correlation between any two measures – worth pointing out in general for analysis. If you artificially restrict the domain of one variable the correlation always goes down. See some examples on the cross-validated site (1, 2) that illustrate this with nicer graphs than I can whip up in a short time.

But for a quick idea about the issue, imagine a scenario where you slice out Euclidean distances in some X bin width, and check the scatterplot between Euclidean and network distances. So you will get less variation on the X axis, and more variation on the Y axis. Now take this to the extreme, and slice on Euclidean distances at only one value, say 100 meters exactly. In this scatterplot, there is no X variation, it is just a vertical line of points. So in that scenario the correlation is 0.

So I should not say the correlation between the two measures is high, as this is not always true – you can construct an artificial sample in which that statement is false. So a more accurate statement is that you can use the Euclidean distance to predict the network distance fairly accurately, or that the linear relationship between Euclidean and network distances is quite regular – no matter what the Euclidean distance is.

My analysis I have posted the python code here. But for a quick rundown, I grab the street networks for a buffer around Albany, NY using the osmnx library (so it is open street map network data). I convert this street network to an undirected graph (so no worrying about one-way streets) in a local projection. Then using all of the intersections in Albany (a few over 4000), I calculate all of the pairwise distances (around 8.7 million pairs, takes my computer alittle over a day to crunch it out in the background).

So again, the overall correlation is quite high:

But if you chunk the data up into tinier intervals, here 200 meter intervals, the correlations are smaller (an index of 100 means [0-200), 300 means [200-400), etc.).

But this does not mean the linear relationship between the two change. Here is a comparison of the linear regression line for the whole sample (orange), vs a broken-stick type model (the blue line). Imagine you take a slice of data, e.g. all Euclidean distances in the bin [100-200) and fit a regression line. And then do the same for the Euclidean distances [200-300) etc. The blue line here are those regression fits for each of those individual binned estimates. You can see that the two estimates are almost indistinguishable, so the relationship doesn’t change if you subset the data to shorter distances.

Technically the way I have drawn the blue line is misleading, I should have breaks in the line (it is not forced to be connected between bins, like my post on restricted cubic splines is). But I am too lazy to write code to do those splits at the moment.

Now, what does this mean exactly? So for research designs that may want to use network distances and an independent variable, e.g. look at prison visitation as a function of distance, or in my work on patrol redistricting I had to impute some missing travel time distances, these are likely OK to use typical Euclidean distances. Even my paper on survivability for gun shot fatality shows improved accuracy estimates using network distances, but very similar overall effects compared to using Euclidean distances.

So while here I have my computer crunch out the network distances for a day, where the Euclidean distances with the same data only takes a second, e.g. using scipy.spatial.distance. So it depends on the nature of the analysis whether that extra effort is worth it. (It helps to have good libraries ease the work, like here I used osmnx for python, and Wouter showed R code using sf to deal with the street networks, hardest part is the networks are often not stored in a way that makes doing the routing very easy. Neither of those libraries were available back in 2014.) Also note you only need to do the network calculations once and then can cache them (and I could have made these network computations go faster if I parallelized the lookup). So it is slightly onerous to do the network computations, but not impossible.

So where might it make a difference? One common use of these network distances in criminology is for analyses like Ripley’s K or near-repeat patterns. I don’t believe using network distances makes a big deal here, but I cannot say for sure. What I believe happens is that using network distances will dilate the distances, e.g. if you conclude two point patterns are clustered starting at 100 meters using Euclidean distances, then if using network it may spread out further and not show clustering until 200 meters. I do not think it would change overall inferences, such as where you make an inference whether two point patterns are clustered or not. (One point is does make a difference is doing spatial permutations in Ripley’s K, you should definitely restrict the simulations to generating hypothetical distributions on the street network and not anywhere in the study area.)

Also Stijn Ruiter makes the point (noted in Wouter’s second post), that street networks may be preferable for prediction purposes. Stijn’s point is related to spatial units of analyses, not to Euclidean vs Network distances. You could have a raster spatial unit of analysis but incorporate street network statistics, and vice-versa could have a vector street unit spatial unit of analysis and use Euclidean distance measures for different measures related to those vector units.

Wouter’s post also brought up another idea I’ve had for awhile, that when using spatial buffers around areas they can be bad control areas, as even if you normalize the area they have a very tiny sliver of network distance attributable to them. I will need to show that for another blog post though. (This was mostly my excuse to learn osmnx to do the routing!)