All posts by Andy Wheeler

Data sources for crime generators

Those interested in micro place based crime analysis often need to collect information on businesses or other facilities where many people gather (e.g. hospitals, schools, libraries, parks). To keep it short, businesses influence the comings-and-goings of people, and those people are those who commit offenses and are victimized. Those doing neighborhood level research census data is almost a one stop shop, but that is not the case when trying to collect businesses data of interest. Here are some tips and resources I have collected over the years of conducting this research.

Alcohol License Data

Most states have a state level board in which one needs to obtain a license to sell alcohol. Bars and liquor stores are one of the most common micro crime generator locations criminologists are interested in, but in most states places like grocery stores, gas stations, and pharmacies also sell alcohol (minus those Quakers in Pennsylvania) and so need a license. So such lists contain many different crime generators of interest. For example here is Texas’s list, which includes a form to search for and download various license data. Here is Washington’s, which just has spreadsheets of the current alcohol and cannabis licenses in the state. To find these you can generally just google something like “Texas alcohol license data”.

In my experience these also have additional fields to further distinguish between the different types of locations. Such as besides the difference between on-premise vs off-premise, you can often also tell the difference between a sit down restaurant vs a more traditional bar. (Often based on the percent of food-stuffs vs alcohol that make up total revenue.) So if you were interested in a dataset of gas stations to examine commercial robbery, I might go here first as opposed to the other sources (again PA is an exception to that advice though, as well as dry counties).

Open Data Websites

Many large cities anymore have open data websites. If you simply google “[Your City] open data” they will often come up. Every city is unique in what data they have available, so you will just have to take a look on the site to see if whatever crime generator you are interested in is available. (These sites almost always contain reported crimes as well, I daresay reported crimes are the most common open data on these websites.) For businesses, the city may have a directory (like Chicago). (That is not the norm though.) They often have other points/places of interest as well, such as parks, hospitals and schools.

Another example is googling “[your city] GIS data”. Often cities/counties have a GIS department, and I’ve found that many publicly release some data, such as parcels, zoning, streets, school districts, etc. that are not included on the open data website. For example here is the Dallas GIS page, which includes streets, parcels, and parks. (Another pro-tip is that many cities have an ArcGIS data server lurking in the background, often which you can use to geocode address data. See these blog posts of mine (python,R) for examples. ) If you have a county website and you need some data, it never hurts to send a quick email to see if some of those datasets are available (ditto for crime via the local crime analyst). You have nothing to lose by sending a quick email to ask.

I’d note that sometimes you can figure out a bit from the zoning/parcel dataset. For instance there may be a particular special code for public schools or apartment complexes. NYC’s PLUTO data is the most extensive I have ever seen for a parcel dataset. Most though have simpler codes, but you can still at least figure out apartments vs residential vs commercial vs mixed zoning.

You will notice that finding these sites involve using google effectively. Since every place is idiosyncratic it is hard to give general advice. But google searches are easy. Recently I needed public high schools in Dallas for a project, and it was not on any of the prior sources I noted. A google search however turned up a statewide database of the public and charter school locations. If you include things like “GIS” or “shapefile” or “data” in the search it helps whittle it down some to provide a source that can actually be downloaded/manipulated.

Scraping from public websites

The prior two sources are generally going to be better vetted. They of course will have errors, but are typically based on direct data sources maintained by either the state or local government. All of the other sources I will list though are secondary, and I can’t really say to what extent they are incorrect. The biggest thing I have noticed with these data sources is that they tend to be missing facilities in my ad-hoc checks. (Prior mentioned sources at worst I’ve noticed a rare address swap with a PO box that was incorrect.)

I’ve written previously about using the google places API to scrape data. I’ve updated to create a short python code snippet that all you need is a bounding box you are looking for and it will do a grid search over the area for the place type you are interested in. Joel Caplan has a post about using Google Earth in a similar nature, but unfortunately that has a quite severe limitation — it only returns 10 locations. My python code snippet has no such limitation.

I don’t really understand googles current pricing scheme, but the places API has a very large number of free requests. So I’m pretty sure you won’t run out even when scraping a large city. (Geocoding and distance APIs are much fewer unfortunately, and so are much more limited.)

Other sources I have heard people use before are Yelp and Yellow pages. I haven’t checked those sources extensively (and if they have API’s like Google). When looking closely at the Google data, it tends to be missing places (it is up to the business owner to sign up for a business listing). Despite it being free and seemingly madness to not take the step to have your business listed easily in map searches, it is easy to find businesses that do not come up. So user beware.

Also, scraping the data for academic articles is pretty murky whether it violates the terms of service for these sites. They say you can’t cache the original data, but if you just store the lat/lon and then turn into a “count of locations” or a “distance to nearest location” (ala risk terrain modelling), I believe that does not violate the TOS (not a lawyer though — so take with a grain of salt). Also for academic projects since you are not making money I would not worry too extensively about being sued, but it is not a totally crazy concern.

Finally, the nature of scraping the business data is no different than other researchers who have been criticized for scraping public sites like Facebook or dating websites (it is just a business instead of personal info). I personally don’t find it unethical (and I did not think those prior researchers were unethical), but others will surely disagree.

City Observatory Data

City observatory has a convenient set of data, that they named the StoreFront Index. They have individual data points you can download for many different metro areas, along with their SIC codes. See also here for a nice map and to see if your metro area of interest is included.

See here for the tech report on which stores are included. They do not include liquor stores and gas stations though in their index. (Since it is based on Jane Jacob’s work I presume they also do not include used car sale lots.)

Lexis Nexis Business Data (and other proprietary sources)

The store front data come from a private database, Custom Lists U.S. Business Database. I’m not sure exactly what vendor produces this (a google search brings up several), but here are a few additional proprietary sources researchers may be interested in.

My local library in Plano (as well as my University), have access to a database named reference USA. This allows you to search for businesses in a particular geo area (such as zip code), as well as by other characteristics (such as by the previously mentioned SIC code). Also this database includes additional info. about sales and number of employees, which may be of further interest to tell the difference between small and large stores. (Obviously Wal-Mart has more customers and more crime than a smaller department store.) It provides the street address, which you will then need to geocode.

Reference USA though only allows you to download 250 addresses at a time, so could be painful for crime generators that are more prevalent or for larger cities. Another source though my friendly UTD librarian pointed out to me is Lexis Nexis’s database of public businesses. It has all the same info. as reference USA and you can bulk download the files. See here for a screenshot walkthrough my librarian created for me.

Any good sources I am missing? Let me know in the comments. In particular these databases I mention are cross-sectional snapshots in time. It would be difficult to use these to measure changes over time with few exceptions.

3 Comments

by Andy Wheeler on August 3, 2018 • Permalink

Posted in Crime Analysis, Crime Mapping, Python

Tagged data

Posted by Andy Wheeler on August 3, 2018

https://andrewpwheeler.com/2018/08/03/data-sources-for-crime-generators/

Sorting rates using empirical Bayes

A problem I have come across in a few different contexts is the idea of ranking rates. For example, say a police department was interested in increasing contraband recovery and are considering two different streets to conduct additional traffic enforcement on — A and B. Say street A has a current hit rate of 50/1000 for a rate of 5%, and street B has a recovery rate of 1/10 for 10%. If you just ranked by percentages, you would choose street B. But given the small sample size, targeting street B is not a great bet to actually have a 10% hit rate going forward, so it may be better to choose street A.

The idea behind this observation is called shrinkage. Your best guess for the hit rate in either location A or location B in the future is not the observed percentage, but somewhere in between the observed percentage and the overall hit rate. Say the overall hit rate for contraband recovery is only 1%, then you wouldn’t expect street B to have a 10% hit rate going forward, but maybe something closer to 2% given the very small sample size. For street A you would expect shrinkage as well, but given it is a much larger sample size you would expect the shrinkage to be much less, say a 4% hit rate going forward. In what follows I will show how to calculate that shrinking using a technique called empirical Bayesian estimation.

I wanted to apply this problem to a recent ranking of cities based on officer involved shooting rates via federalcharges.com (hat tip to Justin Nix for tweeting that article). The general idea is that you don’t want to highlight cities who have high rates simply by chance due to smaller population baselines. Howard Wainer talks about this problem of ranking resulted in the false idea that smaller schools were better based on small samples of test results. Due to the high variance small schools will be both at the top and the bottom of the distributions, even if all of the schools have the same overall mean rate. Any reasonable ranking needs to take that variance into account to avoid the same mistake.

The same idea can be applied to homicide or other crime rates. Here I provide some simple code (and a spreadsheet) so other analysts can easily replicate this sorting idea for their own problems.

Sorting OIS Shooting Rates

For this analysis I just took the reported rates by the federal changes post already aggregated to city, and added in 2010 census estimates from Wikipedia. I’d note these are not necessarily the correct denominator, some jurisdictions may cover less/more of the pop that these census designated areas. (Also you may consider other non-population denominators as well.) But just as a proof of concept I use the city population (which I suspect is what the original federal charges blog post used.)

The below graph shows the city population on the X axis, and the OIS rate per 100,000 on the Y axis. I also added in the average rate within these cities (properly taking into account that cities are different population sizes), and curves to show the 99% confidence interval funnel. You can see that the distribution is dispersed more than would be expected by the simple binomial proportions around the overall rate of close to 9 per 100,000.

The following section I have some more notes on how I calculated the shrinkage, but here is a plot that shows the original rate, and the empirical Bayes shrunk OIS rate. The arrow points to the shrunk rate, so you can see that places with smaller population proportions and those farther away from the overall rate are shrunk towards the overall OIS rate within this sample.

To see how this changes the rankings, here is a slopegraph of the before/after rankings.

So most of the rankings only change slightly using this technique. But if one incorporated cities with smaller populations though they would change even more.

The federal charges post also calculates differences in the OIS rate versus the homicide rate. That approach suffers from even worse problems in ignoring the variance of smaller population denominators (it compounds two high variance estimates), but I think the idea of adjusting for homicide rates in this context maybe has potential in a random effects binomial model (either as a covariate or a multivariate outcome). Would need to think about it/explore it some more though. Also to note is that the fatal encounters data is multiple years, so don’t be confused that OIS rates by police are larger than yearly homicide rates.

The Mathy Part, Empirical Bayes Shrinkage

There are a few different ways I have seen reported to do empirical Bayes shrinkage. One is estimating the beta distribution for the entire sample, and then creating a shrunk estimate for the observed rates for individual observations using the observed sample Beta estimates as a prior (hence empirical Bayes). David Robinson has a nice little e-book on batting averages and empirical Bayes that can be applied to basically any type of percentage estimate you are interested in.

Another way I have seen it expressed is based on the work of the Luc Anselin and the GeoDa folks using explicit formulas.

Either of these ways you can do in a spreadsheet (a more complicated way is to actually fit a random effects model), but here is a simpler run-down of the GeoDa formula for empirical shrinkage, which is what I use in the above example. (This will not necessarily be the same compared to David Robinson’s approach, see the R code in the zip file of results for comparisons to David’s batting average dataset, but are pretty similar for that example.) So you can think of the shrunk rate as a weighted average between the observed rate for location i as y_i, and the overall rate mu, where the weight is W_i.

Shrunk Rate_i = W_i*y_i + (1 - W_i)*mu

You then need to calculate the W_i weight term. Weights closer to 1 (which will happen with bigger population denominators) result in only alittle shrinkage. Weights closer to 0 (when the population denominator is small), result in much larger shrinkage. Below are the formulas and variable definitions to calculate the shrinkage.

i = subscript to denote area i. No subscript means it is a scalar.
r_i = total number of incidents (numerator) in area i
p_i = total population in area i (denominator)
y_i = observed rate in area i = r_i/p_i
k = total number of areas in study
mu = population mean rate = sum(r_i)/sum(p_i)
v = population variance = sum(p_i*[y_i - mu]^2]) / [sum(p_i)] - mu/(sum(p_i)/k)
W_i = shrinkage weight = v /[v + (mu/p_i)]

For those using R, here is a formula that takes the numerator and denominator as vectors and returns the smoothed rate based on the above formula:

#R function
shrunkrate <- function(num,den){
  sDen <- sum(den)
  obsrate <- num/den
  k <- length(num)
  mu <- sum(num)/sDen
  pav <- sDen/k
  v <- ( sum( den*(obsrate-mu)^2 ) / sDen ) - (mu/pav) 
  W <- v / (v + (mu/den))
  smoothedrate <- W*obsrate + (1 - W)*mu
  return(smoothedrate)
}

For those using SPSS I’ve uploaded macro code to do both the funnel chart lines and the shrunk rates.

For either missing values might mess things up, so eliminate them before using the functions. For those who don’t use stat software, I have also included an Excel spreadsheet that shows how to calculate the smoothed rates. It is in this zip file, along with other code and data used to replicate my graphs and results here.

For those interested in other related ideas, see

Bayesian surprise maps by Michael Correll and Jeffrey Heer
Xan Gregg’s post on using binomial Z-scores for proportion data and maps those residuals.
A different approach I have seen suggested besides empirical Bayes shrinkage to the same problem is to take the lower bound on the confidence interval for the proportion and sort based on that lower bound.

1 Comment

by Andy Wheeler on July 23, 2018 • Permalink

Posted in Crime Analysis, Data Visualization, SPSS

Tagged homicide-rates

Posted by Andy Wheeler on July 23, 2018

https://andrewpwheeler.com/2018/07/23/sorting-rates-using-empirical-bayes/

The length it takes from submission to publication

The other day I received a positive comment about my housing demolition paper. It made me laugh abit inside — it felt like I finished that work so long ago it was talking about history. That paper was not so ancient though, I submitted it 8/4/17, went through one round of revision, and I got the email from Jean McGloin for conditional acceptance on 1/16/18. It then came online first a few months later (3/15/18), and is in the current print issue of JRCD, which came out in May 2018.

This ignores the time it takes from conception to finishing a project (we started the project sometime in 2015), but focusing just on the publishing process this is close to the best case scenario for the life-cycle of a paper through peer reviewed journals in criminology & criminal justice. The realist best case scenario typically is:

Submission
Wait 3 months for peer reviews
Get chance to revise-resubmit
Wait another 3 months for second round of reviews and editor final decision

So ignoring the time it takes for editors to make decisions and the time for you to turn around edits, you should not bank on a paper being accepted under 6 months. There are exceptions to this, some journals/editors don’t bother with the second three month wait period for reviewers to look at your revisions (which I think is the correct way to do it), and sometimes you will get reviews back faster or slower than three months, but that realist scenario is the norm for most journals in the CJ/Crim field. Things that make this process much slower (multiple rounds of revisions, editors taking time to make decisions, time it takes to make extensive revisions), are much more common than things that can make it go shorter (I’ve only heard myths about a uniform accept on the first round without revisions).

Not having tenure this is something that is on my mind. It is a bit of a rat race trying to publish all the papers expected of you, and due to the length of peer review times you essentially need to have your articles out and under review well before your tenure deadline is up. The six month lag is the best case scenario in which your paper is accepted at the first journal you submit to. The top journals are uber competitive though, so you often have to go through that process multiple times due to rejections.

So to measure that time I took my papers, including those not published, to see what this life-cycle time is. If I only included those that were published it would bias the results to make the time look shorter. Here I measured the time it took from submission of the original article until when I received the email of the paper being accepted or conditionally accepted. So I don’t consider the lag time at the end with copy-editing and publishing online, nor do I consider up front time from conception of the project or writing the paper. Also I include three papers that I am not shopping around anymore, and censored them at the date of the last reject. For articles still under review I censored them at 5/9/18.

So first, for 25 of my papers that have received one editorial decision, here is a graph of the typical number of rejects I get for each paper. A 0 for a paper means it was published at the first journal I submitted to, a 1 means I had one reject and was accepted at the second journal I submitted the paper to, etc. (I use "I" but this includes papers I am co-author on as well.) The Y axis shows the total percentage, and the label for each bar shows the total N.

So the proportion of papers of mine that are accepted on the first round is 28%, and I have a mean of 1.6 rejections per article. This does not take into account censoring (not sure how to for this estimate), and that biases the estimate of rejects per paper downward here, as it includes some articles under review now that will surely be rejected at some point after writing this blog post.

The papers with multiple rejects run the typical gamut of why academic papers are sometimes hard to publish. Null results, a hostile reviewer at multiple places, controversial findings. It also illustrates that peer review is not necessarily a beacon showing the absolute truth of an article. I’m pretty sure everything I’ve published, even papers accepted at the first venue, have had one reviewer with negative comments. You could find reasons to reject the findings of anything I write that has been peer reviewed — same as you can think many of my pre-print articles are correct or useful even though they do not currently have a peer review stamp of approval.

Most of those rejections add about three months to the life-cycle, but some can be fast (these include desk rejections), and some can be slower (rejections on later rounds of revisions). So using those begin times, end times, and taking into account censoring, I can estimate the typical survival time of my papers within the peer-review system when lumping all of those different factors together into the total time. Here is the 1 - survival chart, so can be interpreted as the number of days until publication. This includes 26 papers (one more that has not had a first decision), so this estimate does account for papers that are censored.

The Kaplan-Meier estimate of the median survival times for my papers is 290 days. So if you want a 50% chance of your article being published, you should expect 10 months based on my experience. The data is too sparse to estimate extreme quantiles, but say I want an over 80% probability of an article being published based on this data, how much time do I need? The estimate based on this data is at least 460 days.

Different strategies will produce different outcomes — so my paper survival times may not generalize to yours, but I think that estimate will be pretty reasonable for most folks in Crim/CJ. I try to match papers to journals that I think are the best fit (so I don’t submit everything to Criminology or Justice Quarterly at the first go), so I have a decent percent of papers that land on the first round. If I submitted first round to more mediocre journals overall my survival times would be faster. But even many mid-tiered journals in our field have overall acceptance rates below 10%, nothing I submit I ever think is really a slam dunk sure thing, so I don’t think my overall strategy is the biggest factor. Some of that survival time is my fault and includes time editing the article in between rejects and revise-resubmits, but the vast majority of this is simply waiting on reviewers.

So the sobering truth for those of us without tenure is that based on my estimates you need to have your journal articles out of the door well over a year before you go up for review to really ensure that your work is published. I have a non-trivial chunk of my work (near 20%) that has taken over one and a half years to publish. Folks currently getting their PhD it is the same pressure really, since to land a tenure track job you need to have publications as well. (It is actually one I think reasonable argument to take a longer time writing your dissertation.) And that is just for the publishing part — that does not include actually writing the article or conducting the research. The nature of the system is very much delayed gratification in having your work finally published.

Here is a link to the data on survival times for my papers, as well as the SPSS code to reproduce the analysis.

4 Comments

by Andy Wheeler on May 9, 2018 • Permalink

Posted in Personal Productivity, scholarly

Tagged peer-review

Posted by Andy Wheeler on May 9, 2018

https://andrewpwheeler.com/2018/05/09/the-length-it-takes-from-submission-to-publication/

Testing changes in short run crime patterns: The Poisson e-test

A common task for a crime analyst is to see if a current set of crime numbers is significantly rising. For a typical example, in prior data there are on average 16 robberies per month, so are the 25 robberies that occurred this month a significant change from the historical pattern? Before I go any further:

PERCENT CHANGE IS A HORRIBLE METRIC — PLEASE DO NOT USE PERCENT CHANGE ANYMORE

But I cannot just say don’t use X — I need to offer alternatives. The simplest is to just report the change in the absolute number of crimes and let people judge for themselves whether they think the increase is noteworthy. So you could say in my hypothetical it is an increase of 9 crimes. Not good, but not the end of the world. See also Jerry Ratcliffe’s different take but same general conclusion about year-to-date percent change numbers.

Where this fails for the crime analyst is that you are looking at so many numbers all the time, it is difficult to know where to draw the line to dig deeper into any particular pattern. Time is zero-sum, if you spend time looking into the increase in robberies, you are subtracting time from some other task. If you set your thresholds for when to look into a particular increase too low, you will spend all of your time chasing noise — looking into crime increases that have no underlying cause, but are simply just due to the random happenstance. Hence the need to create some rules about when to look into crime increases that can be applied to many different situations.

For this I have previously written about a Poisson Z-score test to replace percent change. So in our original example, it is a 56% increase in crimes, (25-16)/16 = 0.5625. Which seems massive when you put it on a percent change scale, but only amounts to 9 extra crimes. But using my Poisson Z-test, which is simply 2 * [ Square_Root(Current) - Square_Root(Historical) ] and follows an approximate standard normal distribution, you end up with:

2*(sqrt(25) - sqrt(16)) = 2*(5 - 4) = 2

Hearkening back to your original stats class days, you might remember a z-score of plus or minus 2 has about a 0.05 chance in occurring (1 in 20). Since all analysts are monitoring multiple crime patterns over time, I suggest to up-the-ante beyond the usual plus or minus 2 to the more strict plus or minus 3 to sound the alarm, which is closer to a chance occurrence of 1 in 1000. So in this hypothetical case there is weak evidence of a significant increase in robberies.

The other day on the IACA list-serve Isaac Van Patten suggested to use the Poisson C-test via this Evan Miller app. There is actually a better test than that C-test approach, see A more powerful test for comparing two Poisson means, by Ksrishnamoorthy and Thomson (2004), which those authors name as the E-test (PDF link here). So I just examine the E-test here and don’t worry about the C-test.

Although I had wrote code in Python and R to conduct the e-test, I have never really studied it. In this example the e-test would result in a p-value rounded to 0.165, so again not much evidence that the underlying rate of changes in the hypothetical example.

My Poisson Z-score wins in terms of being simple and easy to implement in a spreadsheet, but the Poisson e-test certainly deserves to be studied in reference to my Poisson Z-score. So here I will test the Poisson e-test versus my Poisson Z-score approach using some simulations. To do this I do two different tests. First, I do a test where the underlying Poisson distribution from time period to time period does not change at all, so we can estimate the false positive rate for each technique. The second I introduce actual changes into the underlying crime patterns, so we can see if the test is sensitive enough to actually identify when changes do occur in the underlying crime rate. SPSS and Python code to replicate this simulation can be downloaded from here.

No Changes and the False Positive Rate

First for the set up, I generate 100,000 pairs of random Poisson distributed numbers. I generate the Poisson means to have values of 5, 10, 15, 20 and 25. Since each of these pairs is always the same, any statistically significant differences are just noise chasing. (I limit to a mean of 25 as the e-test takes a bit longer for higher integers, which is not a big deal for an analyst in practice, but is for a large simulation!)

Based on those simulations, here is a table of the false positive rate given both procedures and different thresholds.¹

So you can see my Poisson Z-score has near constant false positive rate for each of the different means, but the overall rate is higher than you would expect from the theoretical standard normal distribution. My advice to up the threshold to 3 only limits the false positive rate for this data to around 4 in 100, whereas setting the threshold to a Z-score of 4 makes it fewer than 1 in 100. Note these are false positives in either direction, so the false positive rate includes both false alarms for significantly increasing trends as well as significantly decreasing trends.

The e-test is as advertised though, the false positive rate is pretty much exactly as it should be for p-values of less than 0.05, 0.01, and 0.001. So in this round the e-test is a clear winner based on false positives over my Poisson Z-score.

Testing the power of each procedure

To be able to test the power of the procedure, I add in actual differences to the underlying Poisson distributed random values and then see if the procedure identifies those changes. The differences I test are:

base 5, add in increase of 1 to 5 by 1
base 15, add in increase of 3 to 15 by 3
base 25, add in increase of 5 to 25 by 5

I do each of these for pairs of again 100,000 random Poisson draws, then see how often the procedure flags the the second value as being significantly larger than the first (so I don’t count bad inferences in the wrong direction). Unlike the prior simulation, these numbers are always different, so a test with 100% power would always say these simulated values are different. No test will ever reach that level of power though for tiny differences in Poisson data, so we see what proportion of the tests are flagged as different, and that proportion is the power of the test. In the case with tiny changes in the underlying Poisson distribution, any test will have less power, so you evaluate the power of the test over varying ranges of actual differences in the underlying data.

Then we can draw the power curves for each procedure, where the X axis is the difference from the underlying Poisson distribution, and the Y axis is the proportion of true positives flagged for each procedure.² A typical "good" amount of power is considered to be 0.80, but that is more based on being a simple benchmark to aim for in experimental designs than any rigorous reasoning that I am aware of.

So you can see there is a steep trade-off in power with setting a higher threshold for either the Poisson Z score or the E-test. The curves for the Z score of above 3 and above 4 basically follow the E-test curves for <0.05 and <0.01. The Poisson Z-score of over 2 has a much higher power, but of course that comes with the much higher false positive rate as well.

For the lowest base mean of 5, even doubling the underlying rate to 10 still has quite low power to uncover the difference via any of these tests. With bases of 15 and 25 doubling gets into a bit better range of at least 0.5 power or better. Despite the low power though, the way these statistics are typically implemented in crime analysis departments along regular intervals, I think doing a Poisson Z-score of > 3 should be the lowest evidentiary threshold an analyst should use to say "lets look into this increase further".

Of course since the E-test is better behaved than my Poisson Z-score you could swap that out as well. It is a bit harder to implement as a simple spreadsheet formula, but for those who do not use R or Python I have provided an excel spreadsheet to test the differences in two simple pre-post counts in the data files to replicate this analysis.

In conclusion

I see a few things to improve upon this work in the future.

First is that given the low power, I wonder if there is a better way to identify changes when monitoring many series but still be able to control the false positive rate. Perhaps some lower threshold for the E-test but simultaneously doing a false discovery rate correction to the p-values, or maybe some way to conduct partial pooling of the series into a multi-level model with shrinkage and actual parameters of the increase over time.

A second is a change in the overall approach about how such series are monitored, in particular using control charting approaches in place of just testing one vs another, but to identify consistent rises and falls. Control charting is tricky with crime data — there is no gold standard for when an alarm should be sounded, crime data show seasonality that needs to be adjusted, and it is unclear when to reset the CUSUM chart — but I think those are not unsolvable problems.

One final thing I need to address with future work is the fact that crime data is often over-dispersed. For my Poisson Z-score just setting the threshold higher with data seemed to work ok for real and simulated data distributed like a negative binomial distribution, but I would need to check whether that is applicable to the e-test as well. I need to do more general analysis to see the typical amounts of over/under dispersion though in crime data to be able to generate a reasonable simulation though. I can probably use NIBRS data to figure that out — so for the next blog post!

Note the e-test is not defined when both values are zero.↩
You can technically calculate the exact power of the e-test, see the cited Ksrishnamoorthy & Thomson (2004) article that introduces it. For simplicity I am just doing the simulation for both my Poisson Z-scores and the e-test here.↩

7 Comments

by Andy Wheeler on April 29, 2018 • Permalink

Posted in Crime Analysis, Python

Tagged crime-trends, Poisson

Posted by Andy Wheeler on April 29, 2018

https://andrewpwheeler.com/2018/04/29/testing-changes-in-short-run-crime-patterns-the-poisson-e-test/

Some more testing coefficient contrasts: Multinomial models and indirect effects

Testing the equality of two coefficients is one of my more popular posts. This is a good thing — often more interesting hypotheses are to test two parameters against each other, as opposed to a strict null hypothesis of a coefficient against zero. Every now an then I get questions about applying this idea to new situations in which it is not always straightforward how to figure out. So here are a few examples using demonstration R code.

Multinomial Models

One question I received about applying the advice was to test coefficients across different contrasts in multinomial models. It may not seem obvious, but the general approach of extracting out the coefficients and the covariance between those estimates works the same way as most regression equations.

So in a quick example in R:

library(nnet)
data(mtcars)
library(car)

mtcars$cyl <- as.factor(mtcars$cyl)  
mtcars$am <- as.factor(mtcars$am)  
mod <- multinom(cyl ~ am + hp, data=mtcars, Hess=TRUE)
summary(mod)

And the estimates for mod are:

> summary(mod)
Call:
multinom(formula = cyl ~ am + hp, data = mtcars)

Coefficients:
  (Intercept)       am1        hp
6   -42.03847  -3.77398 0.4147498
8   -92.30944 -26.27554 0.7836576

Std. Errors:
  (Intercept)       am1        hp
6    27.77917  3.256003 0.2747842
8    31.93525 46.854100 0.2559052

Residual Deviance: 7.702737 
AIC: 19.70274

So say we want to test whether the hp effect is the same for 6 cylinders vs 8 cylinders. To test that, we just grab the covariance and construct our test:

#Example constructing test by hand
v <- vcov(mod)
c <- coef(mod)
dif <- c[1,3] - c[2,3]
se <- sqrt( v[3,3] + v[6,6] - 2*v[3,6])
z <- dif/se
#test stat, standard error, and two-tailed p-value
dif;se;2*(1 - pnorm(abs(z)))

Which we end up with a p-value of 0.0002505233, so we would reject the null that these two effects are equal to one another. Note to get the variance-covariance estimates for the parameters you need to set Hess=TRUE in the multinom call.

Another easier way though is to use the car libraries function linearHypothesis to conduct the same test:

> linearHypothesis(mod,c("6:hp = 8:hp"),test="Chisq")
Linear hypothesis test

Hypothesis:
6:hp - 8:hp = 0

Model 1: restricted model
Model 2: cyl ~ am + hp

  Df  Chisq Pr(>Chisq)    
1                         
2  1 13.408  0.0002505 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can see although this is in terms of a Chi-square test, it results in the same p-value. The Wald test however can be extended to testing multiple coefficient equalities, and a popular one for multinomial models is to test if any coefficients change across different levels of the dependent categories. The idea behind that test is to see if you can collapse that category with another that is equivalent.

To do that test, I created a function that does all of the contrasts at once:

#Creating function to return tests for all coefficient equalities at once
all_tests <- function(model){
  v <- colnames(coef(model))
  d <- rownames(coef(model))
  allpairs <- combn(d,2,simplify=FALSE)
  totn <- length(allpairs) + length(d)
  results <- data.frame(ord=1:totn)
  results$contrast <- ""
  results$test <- ""
  results$Df <- NULL
  results$Chisq <- NULL
  results$pvalue <- NULL
  iter <- 0
  for (i in allpairs){
    iter <- iter + 1
    l <- paste0(i[1],":",v)
    r <- paste0(i[2],":",v)
    test <- paste0(l," = ",r)
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i[1]," vs ",i[2])
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]    
  }
  ref <- model$lab[!(model$lab %in% d)]
  for (i in d){
    iter <- iter + 1
    test <- paste0(i,":",v," = 0")
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i," vs ",ref)
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]  
  }
  return(results)
}

Not only does this construct the test of the observed categories, but also tests whether each set of coefficients is simultaneously zero, which is the appropriate contrast for the referent category.

> all_tests(mod)
  ord contrast                                                            test Df        Chisq       pvalue
1   1   6 vs 8 6:(Intercept) = 8:(Intercept) and 6:am1 = 8:am1 and 6:hp = 8:hp  3    17.533511 0.0005488491
2   2   6 vs 4                    6:(Intercept) = 0 and 6:am1 = 0 and 6:hp = 0  3     5.941417 0.1144954481
3   3   8 vs 4                    8:(Intercept) = 0 and 8:am1 = 0 and 8:hp = 0  3 44080.662112 0.0000000000

User beware of multiple testing with this, as I am not sure as to the appropriate post-hoc correction here when examining so many hypotheses. This example with just three is obviously not a big deal, but with more categories you get n choose 2, or (n*(n-1))/2 total contrasts.

Testing the equality of multiple indirect effects

Another example I was asked about recently was testing whether you could use the same procedure to calculate indirect effects (popular in moderation and mediation analysis). Those end up being a bit more tricky, as to define the variance and covariance between those indirect effects we are not just dealing with adding and subtracting values of the original parameters, but are considering multiplications.

Thus to estimate the standard error and covariance parameters of indirect effects folks often use the delta method. In R using the lavaan library, here is an example (just taken from a code snippet Yves Rosseel posted himself), to estimate the variance-covariance matrix model defined indirect parameters.

#function taken from post in
#https://groups.google.com/forum/#!topic/lavaan/skgZRyzqtYM
library(lavaan)
vcov.def <- function(model){
  m <- model
  orig <- vcov(m)
  free <- m@Fit@x
  jac <- lavaan:::lavJacobianD(func = m@Model@def.function, x = free)
  vcov_def <- jac %*% orig %*% t(jac)
  estNames <- subset(parameterEstimates(m),op==":=")
  row.names(vcov_def) <- estNames$lhs
  colnames(vcov_def) <- estNames$lhs
  #I want to print the covariance table estimates to make sure the
  #labels are in the correct order
  estNames$se2 <- sqrt(diag(vcov_def))
  estNames$difSE <- estNames$se - estNames$se2
  print(estNames[,c('lhs','se','se2','difSE')])
  print('If difSE is not zero, labels are not in right order')
  return(vcov_def)
}

Now here is an example of testing individual parameter estimates for indirect effects.

set.seed(10)
n <- 100
X1 <- rnorm(n)
X2 <- rnorm(n)
X3 <- rnorm(n)
M <- 0.5*X1 + 0.4*X2 + 0.3*X3 + rnorm(n)
Y <- 0.1*X1 + 0.2*X2 + 0.3*X3 + 0.7*M + rnorm(n)
Data <- data.frame(X1 = X1, X2 = X2, X3 = X3, Y = Y, M = M)
model <- ' # direct effect
             Y ~ X1 + X2 + X3 + d*M
           # mediator
             M ~ a*X1 + b*X2 + c*X3
           # indirect effects
             ad := a*d
             bd := b*d
             cd := c*d
         '
model_SP.fit <- sem(model, data = Data)
summary(model_SP.fit)

#now apply to your own sem model
defCov <- vcov.def(model_SP.fit)

Unfortunately as far as I know, the linearHypothesis function does not work for lavaan objects, so if we want to test whether the indirect effect of whether ad = bd we need to construct it by hand. But with the vcov.def function we have those covariance terms we needed.

#testing hypothesis that "ad = bd"
#so doing "ad - bd = 0"
model_SP.param <- parameterEstimates(model_SP.fit)
model_SP.defined <- subset(model_SP.param, op==":=")
dif <- model_SP.defined$est[1] - model_SP.defined$est[2]
var_dif <- defCov[1,1] + defCov[2,2] - 2*defCov[1,2]
#so the test standard error of the difference is 
se_dif <- sqrt(var_dif)
#and the test statistic is
tstat <- dif/se_dif 
#two tailed p-value
dif;se_dif;2*(1 - pnorm(abs(tstat)))

To test whether all three indirect parameters are equal to each other at once, one way is to estimate a restricted model, and then use a likelihood ratio test of the restricted vs the full model. It is pretty easy in lavaan to create coefficient restrictions, just set what was varying to only be one parameter:

restrict_model <- ' # direct effect
                      Y ~ X1 + X2 + X3 + d*M
                    # mediator
                      M ~ a*X1 + a*X2 + a*X3
                    # indirect effects
                      ad := a*d
                  '

model_SP.restrict <- sem(restrict_model, data = Data)
lavTestLRT(model_SP.fit, model_SP.restrict)

If folks know of an easier way to do the Wald tests via lavaan models let me know, I would be interested!

2 Comments

by Andy Wheeler on April 9, 2018 • Permalink

Posted in R, Regression

Tagged contrasts, hypothesis-testing, multinomial, sem

Posted by Andy Wheeler on April 9, 2018

https://andrewpwheeler.com/2018/04/09/some-more-testing-coefficient-contrasts-multinomial-models-and-indirect-effects/

Drawing Google Streetview images down an entire street using python

Update: I have been asked several times over the years for help on this. In response I have made a simple GUI tool, given a list of addresses and a download folder location, will download all of the images. Applications are websites listing properties and marketers for mailings.

The tool costs $300, check out the CRIME De-Coder Store to purchase.

I’ve previously written about grabbing Google Streetview images given a particular address. For a different project I sampled images running along an entire street, so figured I would share that code. It is a bit more complicated though, because when you base it off an address you do not need to worry about drawing the same image twice. So I will walk through an example.

So first we will import the necessary libraries we are using, then will globally define your user key and the download folder you want to save the streetview images into.

#Upfront stuff you need
import urllib, os, json
key = "&key=" + "!!!!!!!!!!!!!YourAPIHere!!!!!!!!!!!!!!!!"
DownLoc = r'!!!!!!!!!!!YourFileLocationHere!!!!!!!!!!!!!!'

Second are a few functions. The first, MetaParse, grabs the date (Month and Year) and pano_id from a particular street view image. Because if you submit just a slightly different set of lat-lon, google will just download the same image again. To prevent that, we do a sort of memoization, where we grab the meta-data first, stuff it in a global list PrevImage. Then if you have already downloaded that image once, the second GetStreetLL function will not download it again, as it checks the PrevImage list. If you are doing a ton of images you may limit the size of PrevImage to a certain amount, but it is no problem doing a few thousand images as is. (With a free account you can IIRC get 25,000 images in a day, but the meta-queries count against that as well.)

def MetaParse(MetaUrl):
    response = urllib.urlopen(MetaUrl)
    jsonRaw = response.read()
    jsonData = json.loads(jsonRaw)
    #return jsonData
    if jsonData['status'] == "OK":
        if 'date' in jsonData:
            return (jsonData['date'],jsonData['pano_id']) #sometimes it does not have a date!
        else:
            return (None,jsonData['pano_id'])
    else:
        return (None,None)

PrevImage = [] #Global list that has previous images sampled, memoization kindof        
        
def GetStreetLL(Lat,Lon,Head,File,SaveLoc):
    base = r"https://maps.googleapis.com/maps/api/streetview"
    size = r"?size=1200x800&fov=60&location="
    end = str(Lat) + "," + str(Lon) + "&heading=" + str(Head) + key
    MyUrl = base + size + end
    fi = File + ".jpg"
    MetaUrl = base + r"/metadata" + size + end
    #print MyUrl, MetaUrl #can check out image in browser to adjust size, fov to needs
    met_lis = list(MetaParse(MetaUrl))                           #does not grab image if no date
    if (met_lis[1],Head) not in PrevImage and met_lis[0] is not None:   #PrevImage is global list
        urllib.urlretrieve(MyUrl, os.path.join(SaveLoc,fi))
        met_lis.append(fi)
        PrevImage.append((met_lis[1],Head)) #append new Pano ID to list of images
    else:
        met_lis.append(None)
    return met_lis

Now we are ready to download images running along an entire street. To get the necessary coordinates and header information I worked it out in a GIS. Using a street centerline file I regularly sampled along the streets. Based on those sample points then you can calculate a local trajectory of the street, and then based on that trajectory turn the camera how you want it. Most social science folks I imagine want it to look at the sidewalk, so then you will calculate 90 degrees to the orientation of the street.

Using trial and error I found that spacing the samples around 40 feet apart tended to get a new image. I have the pixel size and fov parameters to the streetview api hard set in the function, but you could easily amend the function to take those as arguments as well.

So next I have an example list of tuples with lat-lon’s and orientation. Then I just loop over those sample locations and draw the images. Here I also have another list image_list, that contains what I save the images too, as well as saves the pano-id and the date meta data.

DataList = [(40.7036043470179800,-74.0143908501053400,97.00),
            (40.7037139540670900,-74.0143727485309500,97.00),
            (40.7038235569946140,-74.0143546472568100,97.00),
            (40.7039329592712600,-74.0143365794219800,97.00),
            (40.7040422704154500,-74.0143185262956300,97.00),
            (40.7041517813782500,-74.0143004403322000,97.00),
            (40.7042611636045350,-74.0142823755611700,97.00),
            (40.7043707615693800,-74.0142642750708300,97.00)]

    
image_list = [] #to stuff the resulting meta-data for images
ct = 0
for i in DataList:
    ct += 1
    fi = "Image_" + str(ct)
    temp = GetStreetLL(Lat=i[0],Lon=i[1],Head=i[2],File=fi,SaveLoc=DownLoc)
    if temp[2] is not None:
        image_list.append(temp)

I have posted the entire python code snippet here. If you want to see the end result, you can check out the photo album. Below is one example image out of the 8 in that street segment, but when viewing the whole album you can see how it runs along the entire street.

Still one of the limitations of this is that there is no easy way to draw older images that I can tell — doing this approach you just get the most recent image. You need to know the pano-id to query older images. Preferably the meta data json should contain multiple entries, but that is not the case. Let me know if there is a way to amend this to grab older imagery or imagery over time. Here is a great example from Kyle Walker showing changes over time in Detroit.

29 Comments

by Andy Wheeler on April 2, 2018 • Permalink

Posted in Mapping, Python

Tagged google-streetview-api

Posted by Andy Wheeler on April 2, 2018

https://andrewpwheeler.com/2018/04/02/drawing-google-streetview-images-down-an-entire-street-using-python/

Work on Shootings in Dallas Published

I have two recent articles that examine racial bias in decisions to shoot using Dallas Police Data:

Wheeler, Andrew P., Scott W. Phillips, John L. Worrall, and Stephen A. Bishopp. (2018) What factors influence an officer’s decision to shoot? The promise and limitations of using public data. Justice Research and Policy Online First.
Worrall, John L., Stephen A. Bishopp, Scott C. Zinser, Andrew P. Wheeler, and Scott W. Phillips. (2018) Exploring bias in police shooting decisions with real shoot/don’t shoot cases. Crime & Delinquency Online First.

In each the main innovation is using control cases in which officers pulled their firearm and pointed at a suspect, but decided not to shoot. Using this design we find that officers are less likely to shoot African-Americans, which runs counter to most recent claims of racial bias in police shootings. Besides the simulation data of Lois James, this is a recurring finding in the recent literature — see Roland Fryer’s estimates of this as well (although he uses TASER incidents as control cases).

The reason for the two articles is that me and John through casual conversation found out that we were both pursuing very similar projects, so we decided to collaborate. The paper John is first author examined individual officer level outcomes, and in particular retrieved personnel complaint records for individual officers and found they did correlate with officer decisions to shoot. My article I wanted to intentionally stick with the publicly available open data, as a main point of the work was to articulate where the public data falls short and in turn suggest what information would be needed in such a public database to reasonably identify racial bias. (The public data is aggregated to the incident level — one incident can have multiple officers shooting.) From that I suggest instead of a specific officer involved shooting database, it would make more sense to have officer use of force (at all levels) attached to incident based reporting systems (i.e. NIBRS should have use of force fields included). In a nutshell when examining any particular use-of-force outcome, you need a counter-factual that is that use-of-force could happen, but didn’t. The natural way to do that is to have all levels of force recorded.

Both John and I thought prior work that only looked at shootings was fundamentally flawed. In particular analyses where armed/unarmed was the main outcome among only a set of shooting cases confuses cause and effect, and subsequently cannot be used to determine racial bias in officer decision making. Another way to think about it is that when only looking at shootings you are just limiting yourself to examining potentially bad outcomes — officers often use their discretion for good (the shooting rate in the Dallas data is only 3%). So in this regard databases that only include officer involved shooting cases are fundamentally limited in assessing racial bias — you need cases in which officers did not shoot to assess bias in officer decision making.

This approach of course has some limitations as well. In particular it uses another point of discretion for officers – when to draw their firearm. It could be the case that there is no bias in terms of when officers pull the trigger, but they could be more likely to pull their gun against minorities — our studies cannot deny that interpretation. But, it is also the case other explanations could explain why minorities are more likely to have an officer point a gun at them, such as geographic policing or even more basic that minorities call the police more often. In either case, at the specific decision point of pulling the trigger, there is no evidence of racial bias against minorities in the Dallas data.

I did not post pre-prints of this work due to the potentially contentious nature, as well as the fact that colleagues were working on additional projects based on the same data. I have posted the last version before the copy-edits of the journal for the paper in which I am first author here. If you would like a copy of the article John is first author always feel free to email.

Making interactive plots with R and Plotly

I wrote a small op-ed based on the homicide studies work I recently published about interpreting crime trends. Unfortunately that op-ed was not picked up by anyone (I missed the timing abit, maybe next year when the UCR stats come out I can just update the numbers and make the same point). I’ve posted that op-ed here, and I wanted to make a quick blog post detailing how I made the interactive graphs in that post using R and the Plotly library. All the data and code to replicate this can be downloaded from here.

Unfortunately with my free wordpress blog I cannot embed the actual interactive graphics, but I will provide links to online versions at my UT Dallas page that work and show a screenshot of each. So first, lets load all of the libraries that you will need, as well as set the working directory. (Of course change it to where you have your data saved on your local machine.)

#########################################################
#Making a shiny app for homicide rate chart
library(shiny)
library(ggplot2)
library(plotly)
library(htmlwidgets)
library(scales)

mydir <- "C:\\Users\\axw161530\\Box Sync\\Projects\\HomicideGraphs\\Analysis\\Analysis" 
setwd(mydir)
#########################################################

Now I just read in the data. I have two datasets, the funnel rates just has additional columns to draw the funnel graphs already created. (See here or here, or the data in the original Homicide Studies paper linked at the top, on how to construct these.)

############################################################
#Get the data 

FunnRates <- read.csv(file="FunnelData.csv",header=TRUE)
summary(FunnRates)
FunnRates$Population <- FunnRates$Pop1 #These are just to make nicer labels 
FunnRates$HomicideRate <- FunnRates$HomRate

IntRates <- read.csv(file="IntGraph.csv",header=TRUE)
summary(IntRates)
############################################################

Funnel Chart for One Year

First, plotly makes it dead easy to take graphs you created via ggplot and turn them into an interactive graph. So here is a link to the interactive chart, and below is a screenshot.

To walk through the code, first you make your (almost) plane Jane ggplot object. Here I name it p. You will get an error for an “unknown aesthetics: text”, but this will be used by plotly to create tooltips. Then you use the ggplotly function to turn the original ggplot graph p into an interactive graph. By default the plotly object has more stuff in the tooltip than I want, which you can basically just go into the innards of the plotly object and strip out. Then the final part is just setting the margins to be alittle larger than default, as the axis labels were otherwise slightly cut-off.

############################################################
#Make the funnel chart
year_sel <- 2015
p <- ggplot(data = FunnRates[FunnRates$Year == year_sel,]) + geom_point(aes(x=Population, y=HomicideRate, text=NiceLab), pch=21) +
     geom_line(aes(x=Population,y=LowLoc99)) + geom_line(aes(x=Population,y=HighLoc99)) + 
     labs(title = paste0("Homicide Rates per 100,000 in ",year_sel)) + 
     scale_x_log10("Population", limits=c(10000,10000000), breaks=c(10^4,10^5,10^6,10^7), labels=comma) + 
     scale_y_continuous("Homicide Rate", breaks=seq(0,110,10)) + 
     theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))

pl <- ggplotly(p, tooltip = c("HomicideRate","text"))
#pl <- plotly_build(p, width=1000, height=900)
#See https://stackoverflow.com/questions/45801389/disable-hover-information-for-a-specific-layer-geom-of-plotly
pl$x$data[[2]]$hoverinfo <- "none"
pl$x$data[[3]]$hoverinfo <- "none"
pl % layout(margin = list(l = 75, b = 65))
############################################################

After this point you can just type pl into the console and it will open up an interactive window. Or you can use the saveWidget function from the htmlwidgets package, something like saveWidget(as_widget(pl), "FunnelChart_2015.html", selfcontained=TRUE) to save the graph to an html file.

Now there are a couple of things. You can edit various parts of the graph, such as its size and label text size, but depending on your application these might not be a good idea. If you need to take into account smaller screens, I think it is best to use some of the defaults, as they adjust per the screen that is in use. For the size of the graph if you are embedding it in a webpage using iframe’s you can set the size at that point. If you look at my linked op-ed you can see I make the funnel chart taller than wider — that is through the iframe specs.

Funnel Chart over Time

Ok, now onto the fun stuff. So we have a funnel chart for one year, but I have homicide years from 1965 through 2015. Can we examine those over time. Plotly has an easy to use additional argument to ggplot graphs, named Frame, that allows you to add a slider to the interactive chart for animation. The additional argument ids links one object over time, ala Hans Rosling bubble chart over time. Here is a link to the interactive version, and below is a screen shot:

############################################################
#Making the funnel chart where you can select the year
py <- ggplot(data = FunnRates) + geom_point(aes(x=Population, y=HomicideRate, text=NiceLab, frame=Year,ids=ORI), pch=21) +
      geom_line(aes(x=Population,y=LowLoc99,frame=Year)) + geom_line(aes(x=Population,y=HighLoc99,frame=Year)) + 
      labs(title = paste0("Homicide Rates per 100,000")) + 
      scale_x_log10("Population", limits=c(10000,10000000), breaks=c(10^4,10^5,10^6,10^7), labels=comma) + 
      scale_y_continuous("Homicide Rate", breaks=seq(0,110,10), limits=c(0,110)) + 
      theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))

ply % animation_opts(0, redraw=FALSE)
ply$x$data[[2]]$hoverinfo <- "none"
ply$x$data[[3]]$hoverinfo <- "none"
saveWidget(as_widget(ply), "FunnelChart_YearSelection.html", selfcontained=FALSE)
############################################################

The way I created the data it does not make sense to do a smooth animation for the funnel line, so this just flashes to each new year (via the animation_opts spec). (I could make the data so it would look nicer in an animation, but will wait for someone to pick up the op-ed before I bother too much more with this.) But it accomplishes via the slider the ability for you to pick which year you want.

Fan Chart Just One City

Next we are onto the fan charts for each individual city with the prediction intervals. Again you can just create this simple chart in ggplot, and then use plotly to make a version with tooltips. Here is a link to an interactive version, and below is a screenshot.

###################################################
#Making the fan graph for New Orleans
titleLab <- unique(IntRates[,c("ORI","NiceLab","AgencyName","State")])
p2 <- ggplot(data=IntRates[IntRates$ORI == "LANPD00",], aes(x=Year, y=HomRate)) + 
     geom_ribbon(aes(ymin=LowB, ymax=HighB), alpha=0.2) +
     geom_ribbon(aes(ymin=LagLow25, ymax=LagHigh25), alpha=0.5) +
     geom_point(shape=21, color="white", fill="red", size=2) +
     labs(x = "Year", y="Homicide Rate per 100,000") +
     #scale_x_continuous(breaks=seq(1960,2015,by=5)) + 
     ggtitle(paste0("Prediction Intervals for ",titleLab[titleLab$ORI == "LANPD00",c("NiceLab")])) +
     theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))
#p2
pl2 <- ggplotly(p2, tooltip = c("Year","HomRate"), dynamicTicks=TRUE)
pl2$x$data[[1]]$hoverinfo <- "none"
pl2$x$data[[2]]$hoverinfo <- "none"
pl2 % layout(margin = list(l = 100, b = 65))
#pl2
saveWidget(as_widget(pl2), "FanChart_NewOrleans.html", selfcontained=FALSE)
###################################################

Note when you save the widget to selfcontained=FALSE, it hosts several parts of the data into separate folders. I always presumed this was more efficient than making one huge html file, but I don’t know for sure.

Unfortunately the frame type animation does not make as much sense here. It would be hard for someone to find a particular city of interest in that slider (as a note though the slider can have nominal data, if I only had a few cities it would work out ok, with a few hundred it will not though). So feature request if anyone from plotly is listening — please have a dropdown type option for ggplot graphs! In the meantime though there is an alternative using a tradition plot_ly type chart. ~~Here is that interactive fan chart with a police agency dropdown, and below is a screenshot~~. Note there was problems with this, will need to update based on updates to plotly and ggplot make this easier I believe.

###################################################
#Making the fan graph where you can select the city of interest
#Need to have a dropdown for the city

titleLab <- unique(IntRates[,c("ORI","NiceLab","State")])
nORI <- length(titleLab[,1])
choiceP <- vector("list",nORI)
for (i in 1:nORI){
choiceP[[i]] <- list(method="restyle", args=list("transforms[0].value", unique(IntRates$NiceLab)[i]), label=titleLab[i,c("NiceLab")])
}

trans <- list(list(type='filter',target=~NiceLab, operation="=", value=unique(IntRates$NiceLab)[1]))
textLab <- ~paste("Homicide Rate:",HomRate,'$ Year:',Year,'$ Homicides:',Homicide,'$ Population:',Pop1,'$ Agency Name:',NiceLab) #Lets try with the default plotly #See https://community.plot.ly/t/need-help-on-using-dropdown-to-filter/6596 ply4 %          plot_ly(x= ~Year,y= ~HighB, type='scatter', mode='lines', line=list(color='transparent'), showlegend=FALSE, name="90%", hoverinfo="none", transforms=trans) %>%
        add_trace(y=~LowB,  type='scatter', mode='lines', line=list(color='transparent'), showlegend=FALSE, name='10%', hoverinfo="none", transforms=trans,
          fill = 'tonexty', fillcolor='rgba(105,105,105,0.3)') %>%
        add_trace(x=~Year,y=~HomRate, text=~NiceLab, type='scatter', mode='markers', marker = list(size=10, color = 'rgba(255, 182, 193, .9)', line = list(color = 'rgba(152, 0, 0, .8)', width = 1)),
          hoverinfo='text', text=textLab, transforms=trans) %>%
        layout(title = "Homicide Rates and 80% Prediction Intervals by Police Department",
          xaxis = list(title="Year"),
          yaxis = list(title="Homicide Rate per 100,000"),
          updatemenus=list(list(type='dropdown',active=0,buttons=choiceP)))
                               
saveWidget(as_widget(ply4), "FanChart_Dropdown.html", selfcontained=FALSE)
###################################################

So in short plotly makes it super-easy to make interactive graphs with tooltips. Long term goal I would like to make a visual supplement to the traditional UCR report (I find the complaint of what tables to include to miss the point — there are much better ways to show the information that worrying about the specific tables). So if you would like to work on that with me always feel free to get in touch!

2 Comments

by Andy Wheeler on March 25, 2018 • Permalink

Posted in Crime Analysis, Data Visualization, ggplot2, R

Tagged interactive, plotly, tool-tips

Posted by Andy Wheeler on March 25, 2018

https://andrewpwheeler.com/2018/03/25/making-interactive-plots-with-r-and-plotly/

Paper published: The effect of housing demolitions on crime in Buffalo, New York

I have a new paper published with a few of my colleagues up in Buffalo, Dae-Young Kim and Scott Phillips. This work looks at the crime reduction effects of widespread demolitions in Buffalo, is titled The Effect of Housing Demolitions on Crime in Buffalo, New York, and was published at the Journal of Research in Crime & Delinquency. In short, at the micro level there is very strong evidence that demolitions reduce crime — the neighborhood level the evidence is not as strong. This is likely partly due to the neighborhood level analysis being underpowered, as several of the estimates between the two are very similar overall.

If you cannot get access to that published article, you can always send me an email for a copy, or you can download a pre-print version from SSRN.

Below is one of the images from the paper, a set of small-multiple maps showing demographic characteristics of Buffalo census tracts:

Someone could surely replicate this micro level result in other cities that have experienced widespread demolitions (like Detroit). But for long term city planners I would consider more rigorous designs that incorporate not only selective demolition, but other neighborhood investment strategies to improve neighborhoods over long term. That is, this research is good evidence of the near-term crime reduction effects of demolitions, but for the long haul leaving empty lots is not going to greatly improve neighborhoods.

Andrew Wheeler

All posts by Andy Wheeler

Data sources for crime generators

Alcohol License Data

Open Data Websites

Scraping from public websites

City Observatory Data

Lexis Nexis Business Data (and other proprietary sources)

Sorting rates using empirical Bayes

Sorting OIS Shooting Rates

The Mathy Part, Empirical Bayes Shrinkage

The length it takes from submission to publication

Testing changes in short run crime patterns: The Poisson e-test

No Changes and the False Positive Rate

Testing the power of each procedure

In conclusion

Some more testing coefficient contrasts: Multinomial models and indirect effects

Multinomial Models

Testing the equality of multiple indirect effects

Drawing Google Streetview images down an entire street using python

Work on Shootings in Dallas Published

Making interactive plots with R and Plotly

Funnel Chart for One Year

Funnel Chart over Time

Fan Chart Just One City

Paper published: The effect of housing demolitions on crime in Buffalo, New York

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange

All posts by Andy Wheeler

Alcohol License Data

Open Data Websites

Scraping from public websites

City Observatory Data

Lexis Nexis Business Data (and other proprietary sources)

Sorting OIS Shooting Rates

The Mathy Part, Empirical Bayes Shrinkage

No Changes and the False Positive Rate

Testing the power of each procedure

In conclusion

Multinomial Models

Testing the equality of multiple indirect effects

Funnel Chart for One Year

Funnel Chart over Time

Fan Chart Just One City

Fan Chart with Dropdown Selection

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange