All posts in category Criminal Justice

Prelim results for NIJ Recidivism Challenge

So the prelim results for the NIJ recidivism challenge are up. My team, MCHawks with Gio Circo, did ok. Here is a breakdown of team winnings (minus the student category) per 1k. So while we won the most in the small team category, IdleSpeculation overall kicked our butt!

We actually biased our predictions to meet the racial fairness constraint, so you can see we did much better in those categories in Round 1 and Round 2. Unfortunately you only win if you get top in this category – no second place winners here (it says Brier score in these tables, but this is (1 - BrierScore)*(1 - FPDifference):

But we got lucky and won the overall in Round 2 despite biasing our predictions. Round 3 we have no excuse really, while the predictions were biased it did not matter.

We will do a paper for the results, but overall our approach is pretty standard. For each round we did a grid search over various models – for R1 and R3 we did a L1 logit, for R2 we did an XGBoost model. I did attempt a specialized Logit model with the fairness constraints in the loss function (and just used backpropogation to fit the model, ala deep learning), but in practice the way the fairness metric is done this just added noise into the estimate.

I will have more to say in the future about fairness metrics, unfortunately here I do not think it was well thought out. It was simply the false positive rate comparing white/black subgroups, assuming a threshold of 0.5, which does not make sense in practice. (I’ve written about calculating the threshold for bail here, it applies the same to parole though as well.) So for each model we simply clipped probabilities to be below 0.5 to meet this – no one predicted high means 0 false positives for each group.

So the higher threshold makes it silly, also the multiplication between the metrics I don’t think is a good idea either. I think it can be amended though to be a more reasonable additive fairness constraint. E.g. BrierScore + lambda*FPDifference, where lambda is a tuner to set how you want to make the tradeoff (and FP may be the total N difference, not a proportion difference, which can be volatile for small N). (Also I think it makes more sense to balance false negatives than false positives in the CJ example, but any algorithm to balance one can be flipped to balance the other.)

I do like how NIJ spreads prizes out, instead of Kaggle like with only 1/2/3 big prizes. I wish here we could submit two predictions though (one for main and one for fair). (I am pretty sure we would have placed in Year1 if we did not bias our predictions.)

4 Comments

by Andy Wheeler on July 17, 2021 • Permalink

Posted in Criminal Justice, data science

Tagged prediction, recidivism

Posted by Andy Wheeler on July 17, 2021

https://andrewpwheeler.com/2021/07/17/prelim-results-for-nij-recidivism-challenge/

CCTV and clearance rates paper published

My paper with Yeondae Jung, The effect of public surveillance cameras on crime clearance rates, has recently been published in the Journal of Experimental Criminology. Here is a link to the journal version to download the PDF if you have access, and here is a link to an open read access version.

The paper examines the increase in case clearances (almost always arrests in this sample) for incidents that occurred nearby 329 public CCTV cameras installed and monitored by the Dallas PD from 2014-2017. Quite a bit of the criminological research on CCTV cameras has examined crime reductions after CCTV installations, which the outcome of that is a consistent small decrease in crimes. Cameras are often argued to help solve cases though, e.g. catch the guy in the act. So we examined that in the Dallas data.

We did find evidence that CCTV increases case clearances on average, here is the graph showing the estimated clearances before the cameras were installed (based on the distance between the crime location and the camera), and the line after. You can see the bump up for the post period, around 2% in this graph and tapering off to an estimate of no differences before 1000 feet.

When we break this down by different crimes though, we find that the increase in clearances is mostly limited to theft cases. Also we estimate counterfactual how many extra clearances the cameras were likely to cause. So based on our model, we can say something like, a case would have an estimated probability of clearance without a camera of 10%, but with a camera of 12%. We can then do that counterfactual for many of the events around cameras, e.g.:

Probability No Camera   Probability Camera   Difference
    0.10                      0.12             + 0.02
    0.05                      0.06             + 0.01
    0.04                      0.10             + 0.06

And in this example for the three events, we calculate the cameras increased the total expected number of clearances to be 0.02 + 0.01 + 0.06 = 0.09. This marginal benefit changes for crimes mostly depends on the distance to the camera, but can also change based on when the crime was reported and some other covariates.

We do this exercise for all thefts nearby cameras post installation (over 15,000 in the Dallas data), and then get this estimate of the cumulative number of extra theft clearances we attribute to CCTV:

So even with 329 cameras and over a year post data, we only estimate cameras resulted in fewer than 300 additional theft clearances. So there is unlikely any reasonable cost-benefit analysis that would suggest cameras are worthwhile for their benefit in clearing additional cases in Dallas.

For those without access to journals, we have the pre-print posted here. The analysis was not edited any from pre-print to published, just some front end and discussion sections were lightly edited over the drafts. Not sure why, but this pre-print is likely my most downloaded paper (over 4k downloads at this point) – even in the good journals when I publish a paper I typically do not get 1000 downloads.

To go on, complaint number 5631 about peer review – this took quite a while to publish because it was rejected on R&R from Justice Quarterly, and with me and Yeondae both having outside of academia jobs it took us a while to do revisions and resubmit. I am not sure the overall prevalence of rejects on R&R’s, I have quite a few of them though in my career (4 that I can remember). The dreaded send to new reviewers is pretty much guaranteed to result in a reject (pretty much asking to roll a Yahtzee to get it past so many people).

We then submitted to a lower journal, The American Journal of Criminal Justice, where we had reviewers who are not familiar with what counterfactuals are. (An irony of trying to go to a lower journal for an easier time, they tend to have much worse reviewers, so can sometimes be not easier at all.) I picked it up again a few months ago, and re-reading it thought it was too good to drop, and resubmitted to the Journal of Experimental Criminology, where the reviews were reasonable and quick, and Wesley Jennings made fast decisions as well.

Leave a comment

by Andy Wheeler on July 1, 2021 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, R, scholarly, writing

Tagged CCTV, Papers, peer-review

Posted by Andy Wheeler on July 1, 2021

https://andrewpwheeler.com/2021/07/01/cctv-and-clearance-rates-paper-published/

Using google places API in criminology research?

In my ask me anything series, Thom Snaphaan, a criminologist at Ghent University writes in with this question (slightly edited by me):

I read your blog post on using the Google Places API for criminological research. I am interested in using these data in the context of my PhD research. Can I ask you some questions on this matter? We think Google Places might be a very rich data source, specifically the user ratings of places. (1) Is it allowed to use these data on a large scale (two large cities) for scientific research? (2) Is it possible to download a set without the limit of 1,000 requests per day? (3) Are there, in your experience, other (perhaps more interesting) data sources to conduct this study? Many thanks! Best, Thom

And for my responses to Thom,

For 1) I believe it is OK to use for research purposes. You are not allowed to download the data and resell it though.

For 2) The quotas for the places API are much larger, it is now you get $200 credit per month, which amounts to 100,000 API calls. So that should be sufficient even for a large city.

For 3) I do not know, I haven’t paid much attention to the different online apps that do user reviews. Here in the states we have another service called Yelp (mostly for restaurants), I am not sure if that has more reviews or not though.

One additional piece of information not commonly used in place based research (but have seen it used some Hipp, 2016; Perenzin-Askey, 2018), is the use of the number of employees or sales volume at particular crime generators/attractors. This is not available via google, but is via Reference USA or Lexis Nexis. For Dallas IIRC Reference USA had much better coverage (almost twice as many businesses), but I recently reviewed a paper that did boots on the ground validation for Google data in the Indian city of Chennai and the validation for google businesses was very high (Kuralarason & Bernasco, 2021)

Answer in the comments if you think you have more helpful information on leveraging the place based user reviews in research projects.

In the past I have written about using various google APIs, and which I have used in my research for several different projects.

Google Places API, and here is a code snippet with some functions to scrape places data (although would need to be updated to include the atmosphere ratings)
Google Distance API
Google Streetview API (Address Based, Running Down Street)
Google Geocoding API
Google Vision API with streetview images

Google has new pricing now, where you get $200 in credits per month per API. But overall the Places and the streetview API you get a crazy ton of potential calls, so will work for most research projects. Looking it over I actually don’t think I have used Google places data in any projects, in Wheeler & Steenbeek, 2021 I use reference USA and some other sources.

Geocoding and distance API limits are tougher, I ended up accidentally charging myself ~$150 for my work with Gio on gunshot fatalities (Circo & Wheeler, 2021) calculating network distance and approximate drive times. The vision API is also quite low (1000 per month), so will need to budget/plan if you need those services for your project. Geocoding you should be able to find alternatives, like the census geocoder (R, python) and then only use google for the leftovers.

References

Circo, G. M., & Wheeler, A. P. (2021). Trauma Center Drive Time Distances and Fatal Outcomes among Gunshot Wound Victims. Applied Spatial Analysis and Policy, 14(2), 379-393.
Hipp, J. R. (2016). General theory of spatial crime patterns. Criminology, 54(4), 653-679.
Kuralarasan, K., & Bernasco, W. (2021). Location Choice of Snatching Offenders in Chennai City. Journal of Quantitative Criminology, Online First.
Perezin-Askey, A., Taylor, R., Groff, E., & Fingerhut, A. (2018). Fast food restaurants and convenience stores: Using sales volume to explain crime patterns in Seattle. Crime & Delinquency, 64(14), 1836-1857.
Wheeler, A. P., & Steenbeek, W. (2021). Mapping the risk terrain for crime using machine learning. Journal of Quantitative Criminology, 37(2), 445-480.

Leave a comment

by Andy Wheeler on June 20, 2021 • Permalink

Posted in ask me anything, Crime Analysis, Criminal Justice

Tagged crime-mapping, google, google-places-api

Posted by Andy Wheeler on June 20, 2021

https://andrewpwheeler.com/2021/06/20/using-google-places-api-in-criminology-research/

Open source code projects in criminology

TLDR; please let me know about open source code related criminology projects.

As part of my work with CrimRxiv, we have started the idea of creating a page to link to various open source criminology focused projects. That is overly broad, but high level here we are thinking for pragmatic resources (e.g. code repositories/packages, open source text books), as opposed to more traditional literature.

As part of our overlay journal we are starting, D1G1TAL & C0MPUTAT10NAL CR1M1N0L0GY, we are trying to get folks to submit open source work for a paper. (As a note, this will not have any charges to publish.) The motivation is two-fold: 1) this gives a venue to get your code peer reviewed (e.g. similar to the Journal of Open Source Software). This is mainly for the writer, to give academic recognition for your open source work. 2) Is for the consumer of the information, it is a nice place to keep up on current developments. If you write an R package to do some cool analysis I want to be aware of it!

For 2, we can accomplish something similar by just linking to current projects. I have started a spreadsheet of links I am collating for now, (in the future will update to this page, you need to be signed into CrimRxiv to see that list). For examples of the work I have collated so far:

Crime Analysis ArcGIS John Beck/Chris Delaney have started a video series on using ArcGIS crime analyst plug-in. (Even though ArcGIS is not open source, the tutorials are, so I am counting this.)
Grant Drawve has video tutorials on Youtube using Excel to conduct various crime analyses. (Again Excel is not open source, but the tutorials are.)
Jacob Kaplan’s Crime by the Numbers is an R tutorial.
Reka Solymosi & Juanjo Medina, Crime Mapping in R
Matt Ashby, crime open database and crimedata for managing the open crime data easier (both in R)
Jill Dando Institute JDI open resources has a bunch of different lectures on open science, such as Patricio Estévez-Soto has a tutorial on creating R packages

Then we have various R packages from folks floating around; Greg Ridgeway, Jerry Ratcliffe, Wouter Steenbeek (as well as the others I mentioned previously you can check out their other projects on Github). Please add in info into the google spreadsheet, comment here, or send me an email if you would like some work you have done (or know others have done) that should be added.

Again I want to know about your work!

2 Comments

by Andy Wheeler on June 8, 2021 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, R, writing

Tagged crimrxiv, open-science

Posted by Andy Wheeler on June 8, 2021

https://andrewpwheeler.com/2021/06/08/open-source-code-projects-in-criminology/

Some ACS download helpers and Research Software Papers

The blog has been a bit sparse recently, as moving has been kicking my butt (hanging up curtains and recycling 100 boxes today!). So just a few quick notes.

Downloading ACS Data

First, I have posted some helper functions to work with American Community Survey data (ACS) in python. For a quick overview, if you import/define those functions, here is a quick example of downloading the 2019 Texas micro level files (for census tracts and block groups) from the census FTP site. Can pipe in another year (if available) and and whatever state into the function.

# Python code to download American Community Survey data
base = r'??????' #put your path here where you want to download data
temp = os.path.join(base,'2019_5yr_Summary_FileTemplates')
data = os.path.join(base,'tables')

get_acs5yr(2019,'Texas',base)

Some locations have census tract data to download, I think the FTP site is the only place to download block group data though. And then based on those files you downloaded, you can then grab the variables you want, and here I show selecting out the block groups from those fields:

interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B11003_013','B14006_002','B01001_003','B23025_005',
            'B22010_002','B16002_004','GEOID','NAME']
labs, comp_tabs = merge_tabs(interest,temp,data)
bg = comp_tabs['NAME'].str.find('Block Group') == 0

Then based on that data, I have an additional helper function to calculate proportions given two lists of the numerators and denominators that you want:

top = ['B17010_002',['B11003_016','B11003_013'],'B08141_002']
bot = ['B17010_001',        'B11002_001'       ,'B08141_001']
nam = ['PovertyFamily','SingleHeadwithKids','NoCarWorkers']
prep_sdh = prop_prep(bg, top, bot, nam)

So here to do Single Headed Households with kids, you need to add in two fields for the numerator ['B11003_016','B11003_013']. I actually initially did this example with census tract data, so not sure if all of these fields are available at the block group level.

I have been doing some work on demographics looking at the social determinants of health (see SVI data download, definitions), hence the work with census data. I have posted my prior example fields I use from the census, but criminologists may just use the social-vulnerability-index from the CDC – it is essentially the same as how people typically define social disorganization.

Peer Review for Criminology Software

Second, jumping the gun a bit on this, but in the works is an overlay journal for CrimRxiv. Part of the contributions we will accept are software contributions, e.g. if you write an R package to do some type of analysis function common in criminology.

It is still in the works, but we have some details up currently and a template for submission (I need to work on a markdown template, currently just a word doc). High level I wanted something like the Journal of Statistical Software or the Journal of Open Source Software (I do not think the level of detail of JSS is necessary, but wanted an example use case, which JoSS does not have).

Just get in touch if you have questions whether your work is on topic. Aim is to be more open to contributions at first. Really excited about this, as publicly sharing code is currently a thankless prospect. Having a peer reviewed venue for such code contributions for criminologists fills a very important role that traditional journals do not.

Future Posts?

Hopefully can steal some time to continue writing posts here and there, but will definitely be busy getting the house in order in the next month. Hoping to do some work on mapping grids and KDE in python/geopandas, and writing about the relationship between healthcare data and police incident report data are two topics I hope to get some time to work on in the near future for the blog.

If folks have requests for particular topics on the blog though feel free to let me know in the comments or via email!

Leave a comment

by Andy Wheeler on April 22, 2021 • Permalink

Posted in Criminal Justice, data science, Python, scholarly

Tagged census, research-software

Posted by Andy Wheeler on April 22, 2021

https://andrewpwheeler.com/2021/04/22/some-acs-download-helpers-and-research-software-papers/

Costs and Benefits and CrimeSolutions.gov

The Trace the other day presented an article giving a bit of (superficial overall in the end) critique of CrimeSolutions.gov. They are right in that the particular scenario with the Bronx defenders office highlights the need for a change in the way content aggregators like CrimeSolutions presents overall recommendations. I have reviewed for CrimeSolutions, and I think they did a reasonable job in creating a standardized form, but will give my opinion here about how we can think about social programs like the Bronx defenders program beyond the typical null hypothesis significance testing – we need to think about overall costs and benefits of the programs. The stat testing almost always just focuses on the benefits part, not the cost part.

But first before I go into more details on CrimeSolutions, I want to address Thomas Abt’s comments about potential political interference in this process. This is pizzagate level conspiracy theory nonsense from Abt. So the folks reviewing for Crime Solutions are other professors like me (or I should more specifically say I was a former professor). I’d like to see the logic from Abt how Kate Bowers, a professor at University College London, is compromised by ties to Donald Trump or the Republican Party.

Us professors get a standardized form to fill in the blank on the study characteristics, so there is no reasonable way that the standardized form biases reviews towards any particular political agenda. They are reviewed by multiple people (e.g. if I disagree with another researcher, we have emails back and forth to hash out why we had different ratings). So it not only has to be individuals working for the man, but collusion among many of us researchers to be politically biased like Abt suggests.

The only potential way I can see any political influence in the process is if people at DSG selectively choose particular studies. (This would only make sense though to say promote more CJ oriented interventions over other social service type interventions). Since anyone can submit a study (even non US ones!) highly skeptical political bias happens in that aspect either. Pretty sure the DSG folks want people to submit more studies FYI.

FYI Abt’s book Bleeding Out is excellent, not sure why he is spouting this nonsense about politics in this case though. So to be clear claiming political bias in these reviews is total non-sense, but of course the current implementation of the CrimeSolutions final end recommendation could be improved. (I really like the Trace as well, have talked to them before over Gio’s/my work on shooting fatalities, this article however doesn’t have much meat to critique CrimeSolutions beyond some study authors are unhappy and Abt’s suggestion of nefarious intentions.)

How does CrimeSolutions work now?

At a high level, CrimeSolutions wants to be a repository for policy makers to help make simple decisions on different policy decisions – what I take as a totally reasonable goal. So last I knew, they had five different end results a study could fall into (I am probably violating some TOS here sharing this screenshot but whatever, we do alot of work filling in the info as a reviewer!) These include Effective, Promising, Ineffective, Null Effect, and Inconclusive.

You get weights based on not only the empirical evidence presented, but aspects of the design itself (e.g. experiments are given a higher weight than quasi-experiments), the outcomes examined (shorter time periods less weight than longer time periods), the sample size, etc. It also includes fuzzy things like description of the program (enough to replicate), and evidence presented of adherence to the program (which gets the most points for quantitative evidence, but has categories for qualitative evidence and no evidence of fidelity as well).

So Promising is basically some evidence that it works, but the study design is not the strongest. You only get null effect is the study design is strong and there were no positive effects found. Again I mean ‘no positive effects’ in the limited sense that there are crime end points specified, e.g. reduced recidivism, overall crime counts in an area, etc. (it is named CrimeSolutions). But there can of course be other non-crime beneficial aspects to the program (which is the main point of this blog post).

When I say at the beginning that the Trace article is a bit superficial, it doesn’t actually present any problems with the CrimeSolutions instrument beyond the face argument hey I think this recommendation should be different! If all you take is someone not happy with the end result we will forever be unhappy with CrimeSolutions. You can no doubt ex ante make arguments all day long why you are unhappy for any idiosyncratic reason. You need to objectively articulate the problems with the CrimeSolutions instrument if you want to make any progress.

So I can agree that the brand No Effect for the Bronx defenders office does not tell the whole story. I can also say how the current CrimeSolutions instruments fails in this case, and can suggest solutions about how to amend it.

Going Beyond p-values

So in the case of the Bronx Defenders analysis, what happens is that the results are not statistically significant in terms of crime reductions. Also because it is a large sample and well done experimental design, it unfortunately falls into the more damning category of No Effects (Promising or Inconclusive are actually more uncertain categories here).

One could potentially switch the hypothesis testing on its head and do non-inferiority tests to somewhat fit the current CrimeSolutions mold. But I have an approach I think is better overall – to evaluate the utility of a program, you need to consider both its benefits (often here we are talking about some sort of crime reduction), as well as its costs:

Utility = Benefits - Costs

So here we just want Benefits > Costs to justify any particular social program. We can draw this inequality as a diagram, with costs and benefits as the two axes (I will get to the delta triangle symbols in a minute). Any situation in which the benefits are greater than the costs, we are on the good side of the inequality – the top side of the line in the diagram. Social programs that are more costly will need more evidence of benefits to justify investment.

Often we are not examining a program in a vacuum, but are comparing this program to another counter-factual, what happens if that new proposed program does not exist?

Utility_a = Benefits_a - Costs_a : Program A's utility
Utility_s = Benefits_s - Costs_s : Status Quo utility

So here we want in the end for Utility_a > Utility_s – we rather replace the current status quo with whatever this program is, as it improves overall utility. It could be the case that the current status quo is do nothing, which in the end is Utility_s = Benefits_s - Costs_s = 0 - 0 = 0.

It could also be the case that even if Benefits_a > Costs_a, that Utility_a < Utility_s – so in that scenario the program is beneficial, but is worse in overall utility to the current status quo. So in that case even if rated Effective in current CrimeSolutions parlance, a city would not necessarily be better off ponying up the cash for that program. We could also have the situation Benefits_a < Costs_a but Utility_a > Utility_s – that is the benefits of the program are still net negative, but they still have better utility than the current status quo.

So to get whether the new proposed program has added utility over the status quo, we take the difference in two equations:

  Utility_a = Benefits_a - Costs_a : Program A's utility
- Utility_s = Benefits_s - Costs_s : Status Quo utility
--------------------------------------------------------
Δ Utility = Δ Benefits - Δ Costs

And we end up with our changes in the graph I showed before. Note that this implies a particular program can actually have negative effects on crime control benefits, but if it reduces costs enough it may be worth it. For example Megan Stevenson argues pre-trial detention is not worth the costs – although it no doubt will increase crime some, it may not be worth it. Although Stevenson focuses on harms to individuals, she may even be right just in terms of straight up costs of incarceration.

For the Bronx defenders analysis, they showed no benefits in terms of reduced crime. But the intervention was a dramatic cost savings compared to the current status quo. I represent the Bronx defenders results as a grey box in the diagram. It is centered on the null effects for crime benefits, but is clearly in the positive utility part of the graph. If it happened that it was expensive or no difference in costs though, the box would shift right and not clearly be in the effective portion.

For another example, I show the box as not a point in this graph, but an area. An intervention can show some evidence of efficacy, but not reach the p-value < 0.05 threshold. The Chicago summer jobs program is an example of this. It is rated as no effects. I think DSG could reasonably up the sample size requirement for individual recidivism studies, but even if this was changed to the promising or inconclusive recommendation in CrimeSolutions parlance the problem still remains by having a binary yes/no end decision.

So here the box has some uncertainty associated with it in terms of the benefits, but has more area on the positive side of the utility line. (These are just generic diagrams, not meant to be an exact representation, it could be more area of the square should be above the positive utility line given the estimates.) If the authors want to argue that the correct counter-factual status quo is more expensive – so it would shift the pink box to the left – it could as is be a good idea to invest in more. Otherwise it makes sense for the federal govt to invest in more research programs trying to replicate, although from a local govt perspective may not be worth the risk to invest in something like this given the uncertainty. (Just based on the Chicago experiment it probably would be worth the risk for a local govt IMO, but I believe overall jobs and crime programs have a less than stellar track record.)

So these diagrams are nice, but it leaves implicit how CrimeSolutions would in practice measure costs to put this on the diagram. Worst case scenario costs are totally unknown (so would span the entire X axis here, but in many scenarios I imagine people can give reasonable estimates of the costs of social programs. So I believe a simple solution to the current CrimeSolutions issue is two-fold.

They should incorporate costs somewhere into their measurement instrument. This could either be as another weighted term in the Outcome Evidence/Primary Outcomes portion of the instrument, or as another totally separate section.
It should have breakdowns on the website that are not just a single final decision endpoint, but show a range of potential results in a diagram like I show here. So while not quite as simple as the binary yes/no in the end, I believe that policy makers can handle that minor bit of added level of complexity.

Neither of these will make CrimeSolutions foolproof – but better to give suggestions to improve it than to suggest to get rid of it completely. I can forsee issues of defining in this framework what are the relevant costs. So the Stevenson article I linked to earlier talks about individual harm, it may be someone can argue that is not the right cost to calculate (and could do something like a willingness to pay experiment). But that goes for the endpoint outcomes as well – we could argue whether or not they are reasonable for the situation as well. So I imagine the CrimeSolutions/DSG folks can amend the instrument to take these cost aspects into account.

5 Comments

by Andy Wheeler on April 6, 2021 • Permalink

Posted in Criminal Justice, scholarly

Tagged CrimeSolutions, Decision-Analysis, Policy-Analysis, utility

Posted by Andy Wheeler on April 6, 2021

https://andrewpwheeler.com/2021/04/06/costs-and-benefits-and-crimesolutions-gov/

Health Insurance Claims Data via HMS Data Sharing for Researchers

I have been sharing this with a bunch of people recently so figured it would be appropriate to share on the blog. My company, HMS, which audits health insurance claims has a data sharing agreement for researchers.

So this provides access to micro level Medicaid health insurance claims for a set of states. It includes 10 states currently:

It provides a limited set of person level info, provider level info (e.g. the hospital location of the claim), as well as all the info that comes with the insurance claim itself. Mostly folks will be interested in ICD codes associated with the claim I imagine, as well as maybe the CPT codes. (CPT are for particular procedures, whereas ICD are more like broader diagnoses for the overall visit.)

It is only criminology adjacent, and is tough because the coverage is limited to Medicaid for some research designs. But examples criminology folks may be interested in are say you could look for domestic violence ICD codes, or look at provider level behavior for opioid prescriptions, or mental health treatment claims, etc.

One of the things with criminology research is it is very hard to identify the costs of crime. Looking at victimization costs via health insurance claims may be an underestimate, but has a pretty clear societal cost. And the limited coverage to Medicaid will make cost estimates on the low side (although more directly relevant to the state, and among the most vulnerable population).

Leave a comment

by Andy Wheeler on March 25, 2021 • Permalink

Posted in Criminal Justice, healthcare

Tagged claims-data

Posted by Andy Wheeler on March 25, 2021

https://andrewpwheeler.com/2021/03/25/health-insurance-claims-data-via-hms-data-sharing-for-researchers/

How arrests reduce near repeats: Breaking the Chain paper published

My paper (with colleagues Jordan Riddell and Cory Haberman), Breaking the chain: How arrests reduce the probability of near repeat crimes, has been published in Criminal Justice Review. If you cannot access the peer reviewed version, always feel free to email and I can send an offprint PDF copy. (For those not familiar, it is totally OK/legal for me to do this!) Or if you don’t want to go to that trouble, I have a pre-print version posted here.

The main idea behind the paper is that crimes often have near-repeat patterns. That is, if you have a car break in on 100 1st St on Monday, the probability you have another car break in at 200 1st St later in the week is higher than typical. This is most often caused by the same person going and committing multiple offenses in a short time period. So a way to prevent that would on its face be to arrest the individual for the initial crime.

I estimate models showing the reduction in the probability of a near repeat crime if an arrest occurs, based on publicly available Dallas PD data (paper has links to replication code). Because near repeat in space & time is a fuzzy concept, I estimate models showing reductions in near repeats for several different space-time thresholds.

So here the model is Prob[Future Crime = I(time < t & distance < d)] ~ f[Beta*Arrest + sum(B_x*Control_x)] where the f function is a logistic function, and I plot the Beta estimates given different time and space look aheads. Points indicate statistical significance, so you can see they tend to be negative for many different crime and different specifications (with a linear coefficient of around -0.3).

Part of the reason I pursued this is that the majority of criminal justice responses to near repeat patterns in the past were target hardening or traditional police patrol. Target hardening (e.g. when a break in occurs, go to the neighbors and tell them to lock their doors) does not appear to be effective, but traditional patrol does (see the work of Rachel/Robert Santos for example).

It seems to me ways to increase arrest rates for crimes is a natural strategy that is worthwhile to explore for police departments. Easier said than done, but one way may be to prospectively identify incidents that are likely to spawn near repeats and give them higher priority in assigning detectives. In many urban departments, lower level property crimes are never assigned a detective at all.

Open Data and Reproducible Criminology Research

This is part of a special issue put together by Jonathan Grubb and Grant Drawve on spatial approaches to community violence. Jon and Grant specifically asked contributors to discuss a bit about open data standards and replication materials. I repost my thoughts on that here in full:

In reference to reproducibility of the results, we have provided replication materials. This includes the original data sources collated from open sources, as well as python, Stata, and SPSS scripts used to conduct the near-repeat analysis, prepare the data, generate regression models, and graph the results. The Dallas Police Department has provided one of the most comprehensive open sources of crime data among police agencies in the world (Ackerman & Rossmo, 2015; Wheeler et al., 2017), allowing us the ability to conduct this analysis. But it also identifies one particular weakness in the data as well – the inability to match the time stamp of the occurrence of an arrest to when the crime occurred. It is likely the case that open data sources provided by police departments will always need to undergo periodic revision to incorporate more information to better the analytic potential of the data.

For example, much analysis of the arrest and crime relationship relies on either aggregate UCR data (Chamlin et al., 1992), or micro level NIBRS data sources (Roberts, 2007). But both of these data sources lack specific micro level geographic identifiers (such as census tract or addresses of the events), which precludes replicating the near repeat analysis we conduct. If however NIBRS were to incorporate address level information, it would be possible to conduct a wide spread analysis of the micro level deterrence effects of arrests on near repeat crimes across many police jurisdictions. That would allow much broader generalizability of the results, and not be dependent on idiosyncratic open data sources or special relationships between academics and police departments. Although academic & police practitioner relationships are no doubt a good thing (for both police and academics), limiting the ability to conduct analysis of key policing processes to the privileged few is not.

That being said, currently both for academics and police departments there are little to no incentives to provide open data and reproducible code. Police departments have some slight incentives, such as assistance from governmental bodies (or negative conditions for funding conditional on reporting). As academics we have zero incentives to share our code for this manuscript. We do so simply because that is a necessary step to ensure the integrity of scientific research. Relying on the good will of researchers to share replication materials has the same obvious disadvantage that allowing police departments to pick and choose what data to disseminate does – it can be capricious. What a better system to incentivize openness may look like we are not sure, but both academics and police no doubt need to make strides in this area to be more professional and rigorous.

1 Comment

by Andy Wheeler on March 11, 2021 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, Papers, scholarly

Tagged near-repeat

Posted by Andy Wheeler on March 11, 2021

https://andrewpwheeler.com/2021/03/11/how-arrests-reduce-near-repeats-breaking-the-chain-paper-published/

Podcast and Video Shout Outs

So y’all know I really enjoy blogs. So much so I think they often have a higher value added than traditional peer review papers. There are other mediums I would like to recognize, and those are Podcasts and video tutorials. So while I like to do lab tutorials (pretty much like my blog posts in which I step through some code), I know many students would prefer I do videos and lectures. And I admit some of these I have seen done quite well on Coursera for example.

Another source I have been consuming quite a bit lately are Podcasts. These often take the form of an interview. So are not technical in nature, but are more soft story telling, such as talking about a particular topical area the interviewee is expert in, or that persons career path. So here are my list of these resources I have personally learned from and enjoyed.

None of these I have listened/watched 100% of the offerings, but have listened/watch multiple episodes (and will continue to listen/watch more)! These are very criminal justice focused, so would love to branch out to data science and health care resources if folks have suggestions!

Podcasts

Reducing Crime – Jerry Ratcliffe interviews a mix of academics and folks working in the criminal justice field. I have quite a few of these episodes I found personally very informative. John Eck, Kim Rossmo, and Phil Goff were perhaps my favorites of academics. Danny Murphy and Thomas Abt were really good as well (for my favorite non-academics offhand).

Niro Knowledge – Nicholas Roy is a current crime analyst, and interviews other crime analysts and academics. Favorite interviews so far are Cynthia Lum and Renee Mitchell. Similar to reducing crime is typically more focused on a particular topic of interest to the person being interviewed (e.g. Renee talked about her work on crime harm indices).

Analyst Talk – This is a podcast hosted by Jason Elder where he interviews crime analysts from all over about their careers. Annie Thompson and my former colleague Shelagh Dorn’s are my favorite so far, but I also need to listen in sometime on Sean Bair’s series of talks as well.

Abt Podcasts – This I only came across a week ago, but have listened to several on data science, CJ, and social determinants of health. These are a bit different than the other podcasts here, they are shorter and have two individuals from different fields discuss social science relevant to the chosen topic.

Videos

Canadian Society of Evidence Based Policing – Has many interviews of academics in crim/cj. I have an interview with them (would not recommend, I need to work on sitting still!) I really enjoyed the Peter Neyroud interview though is my favorite.

UARK CASDAL – These are instructional videos uploaded by Grant Drawve, mostly around doing crime analysis in Excel, but also has a few in ArcGIS.

StatQuest with Josh Starmer – This is one of the few non crim/cj examples I watch regularly. As interview questions at my work place for entry data scientists we often ask folks to explain machine learning models (such as random forests or XGBoost) in some simple terms. These videos are excellent resources to get you to understand the basics of the mathematics behind the techniques.

Again let me know if of podcasts/video series I am missing out on in the comments!

1 Comment

by Andy Wheeler on February 25, 2021 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, scholarly

Tagged podcast, video-tutorial

Posted by Andy Wheeler on February 25, 2021

https://andrewpwheeler.com/2021/02/25/podcast-and-video-shout-outs/

The spatial dispersion of NYC shootings in 2020

If you had asked me at the start of widespread Covid lockdown measures what the effect would be on crime, I am pretty sure I would have guessed it will make crime go down. Fewer people out and about causes fewer interactions that can lead to a crime. That isn’t how it has shaped up though, quite a few places have seen increases in serious violent crime. One of the most dramatic examples of this is that shootings in NYC doubled from 900 in 2019 to over 1800 in 2020. I am going to show how to generate this chart later via some R code, but it is easier to show than to say. NYPD’s open data on shootings (historical, current) go back to 2006.

I know I am critical on this site of folks overinterpreting crime increases, for example going from 20 to 35 is pretty weak evidence of an increase given the inherent variance for low count Poisson data (a Poisson e-test has a p-value of 0.04 in that case). But going from 900 to 1800 is a much clearer signal.

Jerry Ratcliffe recently posted an R library to do his crime dispersion analysis, so I figured this would be an excellent example use case. The idea behind this analysis is spatial – we know there is a crime increase, but did the increase happen everywhere, or did it just happen in a few locations. Here I am going to use the NYPD shooting data aggregated at the precinct level to test this.

As another note, while I often use micro-spatial units of analysis in my work, this method, along with others (such as the sppt test), are just not going to work out for very low count, very tiny spatial units of analysis. I would suggest offhand to only do this analysis if the spatial units of analysis under study have an average of at least 10 crimes per area in the pre time period. Which is right about on the mark for the precinct analysis in NYC.

Here is the data and R code to follow along, below I will give a walkthrough.

Crime increase dispersion analysis in R

So first as some front matter, I load in my libraries (Jerry’s crimedispersion you can install from github via devtools, see his page for an example), and the function I define here I’ve gone over in a prior blog post of mine as well.

###############################
library(ggplot2)
library(crimedispersion)

# Increase contours, see https://andrewpwheeler.com/2020/02/21/some-additional-plots-to-go-with-crime-increase-dispersion/
make_cont <- function(pre_crime,post_crime,levels=c(-3,0,3),lr=10,hr=max(pre_crime)*1.05,steps=1000){
    #calculating the overall crime increase
    ov_inc <- sum(post_crime)/sum(pre_crime)
    #Making the sequence on the square root scale
    gr <- seq(sqrt(lr),sqrt(hr),length.out=steps)^2
    cont_data <- expand.grid(gr,levels)
    names(cont_data) <- c('x','levels')
    cont_data$inc <- cont_data$x*ov_inc
    cont_data$lines <- cont_data$inc + cont_data$levels*sqrt(cont_data$inc)
    return(as.data.frame(cont_data))
}

my_dir <- 'D:\\Dropbox\\Dropbox\\Documents\\BLOG\\NYPD_ShootingIncrease\\Analysis'
setwd(my_dir)
###############################

Now we are ready to import our data and stack them into a new data frame. (These are individual incident level shootings, not aggregated. If I ever get around to it I will do an analysis of fatality and distance to emergency rooms like I did with the Philly data.)

###############################
# Get the NYPD data and stack it
# From https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Year-To-Date-/5ucz-vwe8
# And https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Historic-/833y-fsy8
# On 2/1/2021
old <- read.csv('NYPD_Shooting_Incident_Data__Historic_.csv', stringsAsFactors=FALSE)
new <- read.csv('NYPD_Shooting_Incident_Data__Year_To_Date_.csv', stringsAsFactors=FALSE)

# Just one column off
print( cbind(names(old), names(new)) )
names(new) <- names(old)
shooting <- rbind(old,new)
###############################

Now we just want to do aggregate counts of these shootings per year and per precinct. So first I substring out the year, then use table to get aggregate counts in R, then make my nice time series graph using ggplot.

###############################
# Create the current year and aggregate
shooting$Year <- substr(shooting$OCCUR_DATE, 7, 10)
year_stats <- as.data.frame(table(shooting$Year))
year_stats$Year <- as.numeric(as.character(year_stats$Var1))
year_plot <- ggplot(data=year_stats, aes(x=Year,y=Freq)) + 
             geom_line(size=1) + geom_point(shape=21, colour='white', fill='black', size=4) +
             scale_y_continuous(breaks=seq(900,2100,by=100)) +
             scale_x_continuous(breaks=2006:2020) +
             theme(axis.title.x=element_blank(), axis.title.y=element_blank(),
                   panel.grid.minor = element_blank()) + 
             ggtitle("NYPD Shootings per Year")

year_plot
# Not quite the same as Petes, https://copinthehood.com/shooting-in-nyc-2020/
###############################

Part of the reason I do this is not because I don’t trust Pete’s analysis, but because I don’t want to embed pictures from someone elses website! So wanted to recreate the time series graph myself. So next up we need to do the same aggregating, but not for the whole city, but by each precinct. You can use the same table method again, but simply pass in additional columns. That gets you the data in long format, so then I reshape it to wide for later analysis (so each row is a single precinct and each column is a yearly count of shootings). (Note there have been some splits in precincts over the years IIRC, I don’t worry about that here, will cause it to be 0,0 in the 2019/2020 data I look at.)

###############################
#Now aggregating to year and precinct
counts <- as.data.frame(table(shooting$Year, shooting$PRECINCT))
names(counts) <- c('Year','PCT','Count')
# Reshape long to wide
count_wide <-  reshape(counts, idvar = "PCT", timevar = "Year", direction = "wide")
###############################

And now we can give Jerry’s package a test run, where you just pass it your variable names.

# Jerrys function for crime increase dispersion
output <- crimedispersion(count_wide, 'PCT', 'Count.2019', 'Count.2020')
output

The way to understand this is in a hypothetical world in which we could reduce shootings in one precinct at a time, we would need to reduce shootings in 57 of the 77 precincts to reduce 2020 shootings to 2019 levels. So this suggests very widespread increases, it isn’t just concentrated among a few precincts.

Another graph I have suggested to explore this, while taking into account the typical variance with Poisson count data, is to plot the pre crime counts on the X axis, and the post crime counts on the Y axis.

###############################
# My example contour with labels
cont_lev <- make_cont(count_wide$Count.2019, count_wide$Count.2020, lr=5)

eq_plot <- ggplot() + 
           geom_line(data=cont_lev, color="darkgrey", linetype=2, 
                     aes(x=x,y=lines,group=levels)) +
           geom_point(data=count_wide, shape = 21, colour = "black", fill = "grey", size=2.5, 
                      alpha=0.8, aes(x=Count.2019,y=Count.2020)) +
           scale_y_continuous(breaks=seq(0,140,by=10) +
           scale_x_continuous(breaks=seq(0,70,by=5)) +
           coord_cartesian(ylim = c(0, 140)) +
           xlab("2019 Shootings Per Precinct") + ylab("2020 Shootings")
eq_plot
###############################

The contour lines show the hypothesis that crime increased (by around 100% here). So if a point is near the middle line, it follows that doubled mark almost exactly. The upper/lower lines indicate the typical variance, which is a very good fit to the data here you can see. Very few points are outside the boundaries.

Both of these analyses point to the fact that shooting increases were widespread across NYC precincts. Pretty much everywhere doubled in the number of shootings, it is just some places had a larger baseline to double than others (and the data has some noise, you can pick out some places that did not increase if you cherry pick the data).

And as a final R note, if you want to save these graphs as a nice high resolution PNG, here is an example with Jerry’s dispersion object:

# Saving dispersion plot as a high res PNG
png(file = "ODI.png", bg = "transparent", height=5, width=9, units="in", res=1000, type="cairo")
output #this is the object from Jerrys crimedispersion() function earlier
dev.off()

Going forward I am wondering if there is a good way to do spatial monitoring for crime data like this, like some sort of control chart that takes into account both space and time. So isn’t retrospective a year later recap, but in near real time identify spatial increases.

Other References of Interest

Justin Nix & company have a few blog posts looking at NYC data as well. In the first they talk about the variance in cities, many are up but several are down as well in violence. A later post though updated with the clear increase in shootings in NYC.
There are too many papers at this point for me to do a bibliography of all the Covid and crime updates, but two open examples are Matt Ashby did a paper on several US cities, and Campedelli et al have an analysis of Chicago. Each show variance again, so no universal up or down in trends, but various examples of increases or decreases both between cities and between different crime types within a city.

3 Comments

by Andy Wheeler on February 2, 2021 • Permalink

Posted in Crime Analysis, Criminal Justice, ggplot2, R

Tagged NYC, Poisson, shootings

Posted by Andy Wheeler on February 2, 2021

https://andrewpwheeler.com/2021/02/02/the-spatial-dispersion-of-nyc-shootings-in-2020/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 390 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category Criminal Justice

References

Downloading ACS Data

Peer Review for Criminology Software

Future Posts?

How does CrimeSolutions work now?

Going Beyond p-values

Open Data and Reproducible Criminology Research

Podcasts

Videos

Crime increase dispersion analysis in R

Other References of Interest

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange