Dashboards are often not worth the effort

When end users see dashboards, they often think “this is really whiz-bang cool, lets do that for our data”. There are two issues though I commonly see with dashboards. One is the nature of the task to be accomplished with the dashboard is not well defined, and so even a visually well done dashboard mostly goes unused. The second, there are a ton of headaches deploying dashboards with real data – and the effort to do it right is not worth it.

For the first part, consider a small agency that has a simple crime trends dashboard. The intent is to identify anomalous upticks of crime. This requires someone log into the dashboard, click around the different crime trends, and visually seeing if they are higher than expected. It would be easier to either have an automated alert when some threshold is met, or a standardized report (e.g. once a week) that is emailed to review.

This post is not going to even be about ‘most dashboards show stupid data’ or ‘the charts show the data in totally inappropriate ways’. Even in cases in which you can build a nice dashboard, the complexity level is IMO not worth it in many situations I have encountered. Automating alerts and building regular standard reports for the vast majority of situations is in my opinion a better solution for data products. The whiz-bang being able to interactively click stuff will only be used by a very small number of people (who can often build stuff like that for themselves anyway).

So onto the second part, deploying dashboards with real data that others can interact with. Many of the cool example dashboards you see online do tricks that would be inappropriate for a production dashboard. So even if someone is like ‘yeah I can do a demo dashboard in a day on my laptop’, there is still a ton of more work to expose that dashboard to different individuals, which is probably the ultimate goal.

Dashboards have two parts: A) connecting to a source database, then B) exposing the user-interface to outside parties. Seems simple right? Wrong.

Lets start with part A, connecting to a source database. Many dashboard examples you see online cheat at this stage – they embed a static version of the data in the dashboard itself. If Gary needs to re-upload the data to the dashboard every week, what happens when Gary goes on vacation or takes a day off? You now have an out of date dashboard.

It is quite hard to manage connections between live source data and a dashboard. In the case you have a public facing dashboard, I would just host a public file somewhere on the internet (so anyone can download the source data), and point to that source, and then have some automated process update that source data. This solves the Gary went on vacation factor at least, and there is no security risk. You intentionally only upload data that can be disseminated to the public.

One potential alternative with the public facing dashboard is to make a serverless file. This will often embed the dashboard (and maybe data) into the webpage itself (e.g. pyscript wasm), so it may be slow to start up, but will work reasonably well if you can handle a one minute lag. You don’t need to worry about malicious actors in that scenario, as the heavy computation is done on the clients computer, not your server. I have an example on CRIME De-Coder (note it is currently not up to date, my code is running fine, but Dallas post the cyber-attack has not been updating their public data).

Managing a direct connection to your source data in a public facing dashboard is not a good idea, as malicious actors can spam your dashboard. This denial of service attack will not only make your dashboard unresponsive, but will also eat up your database processing. (Big companies often have a reporting database server vs a production database server in the best case scenario, but this requires resources most public sector agencies do not have.)

The solution to this is to limit who can access the dashboard, part B above. Unfortunately, the majority of dashboard software when you want to make a live connection and/or limit who can see the information, you are in the ‘need to pay for this scenario’. A very unfortunate aspect of the ‘you need to pay for this’ is that most of these vendors charge per viewer – it isn’t a flat fee. PowerBI is maybe OK if your organization already pays for Sharepoint, Tableau licensing is so painful I wouldn’t suggest it.

So what is the alternative? You are now in charge of spinning up your own server so people can click a dropdown menu and generate a line graph. Again you need to worry about security, but at least if you can hide behind a local network or VPN it is probably doable for most police departments.

I don’t want to say dashboards are never worth it. I think public facing dashboards are a good thing for police transparency, and if done right are easier to implement than private ones. Private ones are doable as well (especially if limiting to intranet applications). But like I said, this level of effort is over the top compared to just doing a regular report.

Updates on CRIME De-Coder and ASEBP

So I have various updates on my CRIME De-Coder consulting site, as well as new posts on the American Society of Evidence Based Policing Criminal Justician series.

CRIME De-Coder

Blog post, Don’t use percent change for crime data, use this stat instead. I have written a bunch here about using Poisson Z-scores, so if you are reading this it is probably old news. Do us all a favor and in your Compstat reports drop ridiculous percent change metrics with low baselines, and use 2 * ( sqrt(Current) - sqrt(Past) ).

Blog post, Dashboards should be up to date. I will have a more techy blog post here on my love/hate relationship with dashboards (most of the time static reports are a better solution). But one scenario they do make sense is for public facing dashboards, but they should be up to date. The “free” versions of popular tools (Tableau, PowerBI) don’t allow you to link to a source dataset and get auto-updated, so you see many old dashboards out of date online. If you contract with me, I can automate it so it is up to date and doesn’t rely on an analyst manually updating the information.

Demo page – that page currently includes demonstrations for:

The WDD Tool is pure javascript – picking up more of that slowly (the Folium map has a few javascript listener hacks to get it to look the way I want). As a reference for web development, I like Jon Duckett’s three books (HTML, javascript, PHP).

Ultimately too much stuff to learn, but on the agenda are figuring out google cloud compute + cloud databases a bit more thoroughly. Then maybe add some PHP to my CRIME De-Coder site (a nicer contact me form, auto-update sitemap, and rss feed). I also want to learn how to make ArcGIS dashboards as well.

Criminal Justician

The newest post is Situational crime prevention and offender planning – discussing one of my favorite examples of crime prevention through environmental design (on suicide prevention) and how it is a signal about offender behavior that goes beyond simplistic impulsive behavior. I then relate this back to current discussion of preventing mass shootings.

If you have ideas about potential posts for the society (or this blog or crime de-coders blog), always feel free to make a pitch

Hacking folium for nicer legends

I have accumulated various code to hack folium based maps over several recent projects, so figured would share. It is a bit too much to walk through the entire code inline in a blog post, but high level the extras this code does:

  • Adds in svg elements for legends in the layer control
  • Has a method for creating legends for choropleth maps
  • Inserts additional javascript to make a nice legend title (here a clickable company logo) + additional attributions

Here is the link to a live example, and below is a screenshot:

So as a quick rundown, if you are adding in an element to a folium layer, e.g.:

folium.FeatureGroup(name="Your Text Here",overlay=True,control=True)

You can insert arbitrary svg code into the name parameter, e.g. you can do something like <span><svg .... /svg>HotSpots</span>, and it will correctly render. So I have functions to make the svg icon match the passed color. So you can see I have a nice icon for city boundary’s, as well as a blobby thing for hotspots.

There in the end are so many possible parameters, I try to make reasonable functions without too crazy many parameters. So if someone wanted different icons, I might just make a different function (probably wouldn’t worry about passing in different potential svg).

I have a function for choropleth maps as well – I tend to not like the functions that you pass in a continuous variable and it does a color map for you. So here it is simple, you pass in a discrete variable with the label, and a second dictionary with the mapped colors. I tend to not use choropleth maps in interactive maps very often, as they are more difficult to visualize with the background map. But there you have it if you want it.

The final part is using javascript to insert the Crime Decoder logo (as a title in the legend/layer control), as well as the map attribution with additional text. These are inserted via additional javascript functions that I append to the html (so this wouldn’t work say inline in a jupyter notebook). The logo part is fairly simple, the map attribution though is more complicated, and requires creating an event listener in javascript on the correct elements.

The way that this works, I actually have to save the HTML file, then I reread the text back into python, add in additional CSS/javascript, and then resave the file.

If you want something like this for your business/website/analysts, just get in contact.

For next tasks, I want to build a demo-dashboard for Crime De-Coder (probably a serverless dashboard using wasm/pyscript). But in terms of leaflet extras, the ability to embed SVG into different elements, you can create charts in popups/tooltips, which would be a cool addition to my hotspots (click and it has a time series chart inside).

A statistical perspective on year-to-date metrics

Jerry Ratcliffe, and now more recently Jeff Asher, have written about how volatile early year projection of year-to-date (YTD) percent changes. I am going to write about this is not the right way to frame the problem in my opinion – I will present a better behaved estimate that is less volatile, but clearly doesn’t give police departments what they want.

Going to the end advice first – people find me irksome for the suggestion, but you shouldn’t be using percent changes at all. A simple alternative I have stated for low count crime data is a Poisson Z-score, which is simply 2*(sqrt(Current) - sqrt(Past)) – a value of greater than 3 or 4 is a signal the two processes are significantly different (under the null hypothesis that the counts have a Poisson distribution).

A Better YTD estimate

So here I am going to present a more accurate YTD percent change metric – but don’t take that as advice you should be using YTD percent change. It is more of an exercise to say why you shouldn’t be using this metric to begin with. Year end percent change is defined as:

(Current - Past)/Past = % Change

Note that you can rewrite this as:

Current/Past - Past/Past  = % Change
Current/Past - 1          = % Change

So really it is only the ratio of Current/Past that we care about estimating, the translating to a percent doesn’t matter. In the above equations, I am writing these as cumulative totals for the whole year. So lets do breakdowns via subscripts, and shorten Current and Past to C and P respectively. So say we have data through January, people typically estimate the YTD percent change then as:

(C_January - P_January)/P_January = % Change January

To make it easier, I am going to write e subscript for early, and l subscript for later. So if we then estimate YTD for February, we then have C_January + C_February = C_e. Also note that C_e + C_l = Current, the early observed values plus the later unobserved values equals the year totals. This identifies a clear error when people use only subsets of the data to do YTD year end projections (what both Jerry and Jeff did in their posts to argue against early YTD estimates). You should not just use P_e in your estimate, you should use the full prior year counts.

Lets go back to our year end estimate, writing in early/later form:

[C_e + C_l - (P_e + P_l)]/(P_e + P_l) = % Change

This only has one unknown in the equation – C_l, the unknown rest of year projection. You should not use (C_e - P_e)/P_e, as this introduces several stochastic elements where none are needed. P_e is not necessarily a good estimate of P_e + P_l. So lets do a simple example, imagine we had homicide totals:

     Past Current
Jan    2     1
Feb    0      
Mar    1      
Apr    1      
May    1      
Jun    1      
Jul    1      
Aug    1      
Sep    1      
Oct    1      
Nov    1      
Dec    1      
---
Tot   12

The naive way of doing YTD estimates, we would say our January YTD estimates are (1 - 2)/2 = -50%. Whereas I am saying, you should use (1 + C_l)/12 – filling in whatever value you project to the rest of the year totals C_l. Simple ones you can do in a spreadsheet are ‘no change’, just fill in the prior year which here would be C_l = 10, and would give a YTD percent change estimate of (11 - 12)/12 ~ -8%. Or another simple one is extrapolate, which would be C_l = C_e*(1/year_proportion) = 1*12, so (12 - 12)/12 = 0%. (You would really want to fit a model with seasonal and trend components and project out the remaining part of the year, which will often be somewhere between these two simpler methods.)

So far this is just theoretical “should be a better estimator” – lets show with some actual data. Python code to replicate here, but I took open data from Cary, NC, which goes back to 2000, so we have a sample of 22 years. Estimates of the error, broken down by month and version, are below. The naive estimate is how it is typically done (equivalent to Jeff/Jerry’s blog posts), the running estimate is taking prior to fill in C_l, and extrapolate is using the current months to fill in. The error metrics are | (estimated % change) - (actual year end % change) |, and the stats show the mean (standard deviation) of the sample (n=22). Here are the metrics for larceny, which average 123 per month over the sample:

       Naive   Running  Extrapolate
Jan   12 (7)    6 (4)     10 (7)
Feb    8 (6)    6 (4)     11 (7)
Mar    9 (6)    5 (3)      8 (6)
Apr    9 (7)    5 (3)      8 (5)
May    7 (6)    5 (3)      6 (4)
Jun    6 (4)    4 (3)      4 (3)
Jul    5 (3)    4 (3)      4 (3)
Aug    4 (3)    3 (2)      3 (2)
Sep    3 (2)    3 (2)      2 (2)
Oct    2 (1)    2 (1)      2 (1)
Nov    1 (1)    1 (1)      1 (1)
Dec    0 (0)    0 (0)      0 (0)

And here are the metrics for burglary, which average 28 per month over the sample. Although these have higher error metrics (due to lower/more volatile baseline counts), my estimator is still better than the naive one for the majority of the year.

       Naive   Running  Extrapolate
Jan   34 (25)   12 (8)    24 (23)
Feb   15 (14)   11 (7)    16 (13)
Mar   15 (14)   12 (7)    15 (11)
Apr   15 (11)   10 (7)    13 ( 8)
May   14 (10)   10 (7)    10 ( 7)
Jun   11 ( 8)   10 (7)     8 ( 6)
Jul    9 ( 7)    9 (7)     7 ( 5)
Aug    7 ( 5)    8 (5)     6 ( 3)
Sep    6 ( 4)    6 (5)     4 ( 3)
Oct    6 ( 4)    5 (4)     3 ( 3)
Nov    3 ( 3)    3 (3)     2 ( 2)
Dec    0 ( 0)    0 (0)     0 ( 0)

Running tends to do better for earlier in the year (and for smaller N samples). Both the running and extrapolate estimates are closer to the true year end percent change compared to the naive estimate in around 70% of the observations in this sample. (And tends to be even more pronounced in the smaller crime count categories, closer to 80% to 90% of the time better.)

In Jerry’s and Jeff’s posts, they use a metric +/- 5 to say “it is close” – this corresponds to in my tables absolute errors in the range of 5 percentage points. You meet that criteria on average in this sample for my estimator in March for Larcenies (running) and September (extrapolate) for Burglaries.

To be clear though, even with the more accurate projections, you should not use this estimate.

What do police departments want?

So Jeff may literally want an end-of-year projection for when he writes a Times article – similar to how a government might give a year end projection for GDP growth. But this is not what most police departments want when they calculate YTD metrics. So saying in turn “you shouldn’t use YTD because the error is high” to me misses the boat a bit. I can give a metric that has lower error rates, but you still shouldn’t use YTD percent change.

What police departments want to examine is the more general question “are my numbers high?” – you can further parse this into “are my numbers high consistently over the past date range” (of which the past year is just a convenient demarcation) or “are my numbers anomalous high right now”. The former is asking about long term trends, and the latter is asking about short term increases. Part of why I don’t like YTD is that it masks these two metrics – a spike early in the year can look like a perpetual long term upward trend later in the year.

I have training material showing off two different types of charts I like to use in lieu of YTD metrics. These can identify anomalous short term and long term trends. Here is an example weekly chart showing trends (in black line) and short term spikes (outside the error intervals):

So this is an uber nerd post – I hope it has general lessons though. One is that if you need to estimate Y, and you can write Y as a function of other variables, some that are variable and some that are not, e.g. Y = f(x1,c), then maybe you should just focus on estimating x1 in this scenario, not model Y directly.

In terms of more general statistical modeling of crime trends, I have debated in the past examining more thoroughly seasonal-trend decomposition techniques, but I think the examples I give above are quite sufficient for most analysis (and can be implemented in a spreadsheet).

Crime De-Coder LLC Website

So I have created CRIME De-Coder LLC, a firm to do my consulting work with police departments. Check out my website, crimede-coder.com.

Feedback is welcome. In particular check out the services pages, and my first blog post on what distinguishes my services from most firms. Providing computer code to generate the end product is “teaching a man a fish”, whereas most firms just drop a final report and leave.

And of course feel free to reach out to consult@crimede-coder.com if you are interested in pursuing a project. Going forward I plan on making a new post around once a month, so sign up in your feed reader or using a service like IFTTT.


Setting up a stand alone website is not that hard in the end. Currently it is a static site with some custom javascript (hosted on Hostinger). I should do a PHP server for the new blog posts and RSS feed eventually, but for now this is fine. I suggest for those interested in the same get the Jon Duckett books (HTML/Javascript/PHP) for overview of the tech, and then check out Dani Kross’s youtube tutorials (for random things like editing the htaccess file).

I am not doing a newsletter for the blog-posts, as I am concerned it will get my email on random block lists. But if there is demand for it in the future I will figure out some other service I guess to do that.

I wanted a more bare-metal setup (not a hosted wordpress like this site), as in the future I will likely do demo’s of dashboards, host some pyscript, make a sign in for paid content, etc. I just wanted flexibility from the start. So stay tuned for more content from CRIME De-Coder!

Scorpion was probably not doing hot spots policing

So the Wall Street Journal had a recent article describing how crackdowns in hot spots of crime may not be the best policing tactic, Tyre Nichols Case Prompts Questions About Police Tactics in Crime Hot Spots. This is actually an OK article, but to be clear “hot spots” policing isn’t really defined by police tactics, hot spots are just a method to identify small areas with the most crime in a city. Identifying the hot spots does not explicitly determine the policing (or non-policing) tactic that one should use to reduce crime in that area. The Washington Post had a recent article in a similar vein critiquing the work of Tamara Herold in Breonna Taylor’s death. The WaPo article even prompted a response by a group of well known criminologists how it was inappropriate to blame Herold’s strategy.

So hotspots have always had a mix of different policing tactics that go with it, the most common strategies I would say are problem oriented policing (Braga et al., 1999), increased street or traffic stops (MacDonald et al., 2016; Sherman & Rogan, 1995), or simply patrolling/hanging out in the area (Groff et al., 2015; Koper, 1995). The WSJ article talks about Joel Caplan’s RTM group (which I think do good work), and they are really just doing a version of problem oriented policing. (POP has always had a component of working in tandem with the community and different public/private sector agencies.)

One of the reasons I wanted to write about this post though, is that often in my career I see a disconnect in purportedly hot spots policing (or similar tactics, such as DDACTS) on paper and what is actually happening on the ground. So using the Memphis Open Crime Data, I identified the top 100 street segments in terms of violent crime (code on github to replicate). As I suspected, the place where Nichols was pulled over is not a hot spot of crime, making the connection between the Scorpion units behavior and hot spots policing tactics a bit suspect.

If the embedded google map does now work, here is a screen shot to show how none of the top 100 street midpoints are around the location of where Nichol’s was initially stopped:

It happens to be the case that officers often have misperceptions of where hot spots are (Macbeth & Ariel, 2019; Ratcliffe & McCullagh, 2001). And that if left to no oversight, there tends to be a mismatch between where police proactivity is occurring and where the most serious crime is spatially concentrated (Wheeler et al., 2018). That is why a system to feed back information to officers for whether they are making high quality stops is so important (Worden et al., 2018).

To be clear, this is not me making excuses for researchers or crime analysts to not know what is actually occurring in their jurisdictions, and to potentially ignore the secondary harms that can come with intensive policing. But in my experience, taking the time to do hot spots policing right, which at its most basic is actually identifying hot spots using data, is a good sign that police departments take seriously the tactics they use and to seriously think about mitigating some of these secondary harms. Hot spots policing does not intrinsically result in unequal outcomes, which can be done via tactics that mitigate harm (such as problem oriented policing), or constructing a hot spots policy that promotes racial equity in outcomes from the start (Wheeler, 2020).

References

  • Braga, A.A., Weisburd, D.L., Waring, E.J., Mazerolle, L.G., Spelman, W., & Gajewski, F. (1999). Problem‐oriented policing in violent crime places: A randomized controlled experiment. Criminology, 37(3), 541-580.
  • Groff, E. R., Ratcliffe, J. H., Haberman, C. P., Sorg, E. T., Joyce, N. M., & Taylor, R. B. (2015). Does what police do at hot spots matter? The Philadelphia policing tactics experiment. Criminology, 53(1), 23-53.
  • Koper, C.S. (1995). Just enough police presence: Reducing crime and disorderly behavior by optimizing patrol time in crime hot spots. Justice Quarterly, 12(4), 649-672.
  • Macbeth, E., & Ariel, B. (2019). Place-based statistical versus clinical predictions of crime hot spots and harm locations in Northern Ireland. Justice Quarterly, 36(1), 93-126.
  • MacDonald, J., Fagan, J., & Geller, A. (2016). The effects of local police surges on crime and arrests in New York City. PLoS one, 11(6), e0157223.
  • Ratcliffe, J.H., & McCullagh, M.J. (2001). Chasing ghosts? Police perception of high crime areas. British Journal of Criminology, 41(2), 330-341.
  • Sherman, L.W., & Rogan, D.P. (1995). Effects of gun seizures on gun violence:“Hot spots” patrol in Kansas City. Justice Quarterly, 12(4), 673-693.
  • Wheeler, A.P. (2020). Allocating police resources while limiting racial inequality. Justice Quarterly, 37(5), 842-868.
  • Wheeler, A. P., Steenbeek, W., & Andresen, M. A. (2018). Testing for similarity in area‐based spatial patterns: Alternative methods to Andresen’s spatial point pattern test. Transactions in GIS, 22(3), 760-774.
  • Worden, R.E., McLean, S.J., Wheeler, A.P., Reynolds, D.L., Dole, C., Cochran, H. Smart Stops: An Inquiry into Proactive Policing. Summary Report to the National Institute of Justice, Award No. 2013-MU-CX-0012.

ptools R package update

So as an update to my R package ptools, I have bumped a major version change to 2.0, which is now up on CRAN.

There is no new functionality, but I wanted to bump versions because I swapped out using rgdal/rgeos with sf (rgdal and rgeos are being deprecated). All the functions currently still take as inputs/output sp objects. If I ever get around to it, I will convert the functions to take either. They are somewhat inefficient with conversions, but if you are doing something where it matters you should likely switch data-engineering to another system entirely (such as via SQL in postgis directly). Generating hexagons should actually be faster now, as the sf version I swapped out is vectorized (whereas how I was using sp prior was a loop).

I debate every now and then just letting this go. I can see on cranlogs I have a total of just over 1k (as of 2/7/2023) downloads, and averaged 200 some last month (grand total, last month).

Time is finite, so I have debated on dropping this and just porting most of the functions over into python. Those cumulative downloads are partially bots (I may have racked up 100 of those downloads in my CICD actions). Let me know if you actually use this, as that gives me feedback whether to bother continuing to develop/update this.

Preprint: Analysis of LED street light conversions on firearm crimes in Dallas, Texas

I have a new pre-print out, Analysis of LED street light conversions on firearm crimes in Dallas, Texas. This work was conducted in collaboration with the Child Poverty Action Lab, in reference to the Dallas Taskforce report. Instead of installing the new lights though at hotspots that CPAL suggested, Dallas stepped up conversion of street lamps to LED. Here is the temporal number of conversions over time:

And here is an aggregated quadrat map at quarter square mile grid cells (of the total number of LED conversions):

I use a diff-in-diff design (compare firearm crimes in daytime vs nighttime) to test whether the cumulative LED conversions led to reduced firearm crimes at nighttime. Overall I don’t find any compelling evidence that firearm crimes were reduced post LED installs (for a single effect or looking at spatial heterogeneity). This graph shows in the aggregate the DiD parallel trends assumption holds citywide (on the log scale), but the identification strategy really relies on the DiD assumption within each grid cell (any good advice for graphically showing that with noisy low count data for many units I am all ears!).

For now just wanted to share the pre-print. To publish in peer-review I would need to do a bunch more work to get the lit review where most CJ reviewers would want it. Also want to work on spatial covariance adjustments (similar to here, but for GLM models). Have some R code started for that, but needs much more work/testing before ready for primetime. (Although as I say in the pre-print, these should just make standard errors larger, they won’t impact the point estimates.)

So no guarantees that will be done in anytime in the near future. But no reason to not share the pre-print in the meantime.

Hot spots of crime in Raleigh and home buying

So my realtor, Ellen Pitts (who is highly recommended, helped us a ton remotely moving into Raleigh), has a YouTube channel where she talks about real estate trends. Her most recent video she discussed a bit about crime in Raleigh relative to other cities because of the most recent shooting.

My criminologist hot take is that generally most cities in the US are relatively low crime. So Ellen shows Dallas has quite a few more per-capita shootings than Raleigh, but Dallas is quite safe “overall”. Probably somewhat contra to what most people think, the cities that in my opinion really have the most crime problems tend to be smaller rust belt cities. I love Troy, NY (where I was a crime analyst for a few years), but Troy is quite a bit rougher around the edges than Raleigh or Dallas.

So this post is more about, you have already chosen to move to Raleigh – if I am comparing house 1 and house 2 (or looking at general neighborhoods), do I need to worry about crime in this specific location?

So for a few specific resources/strategies for the home hunter. Not just in Raleigh, but many cities now have an open data portal. You can often look at crime. Here is an example with the Raleigh open data:

So if you have a specific address in mind, you can go and see the recent crime around that location (cities often fuzz the address a bit, so the actual points are just nearby on that block of the street). Blue dots in that screenshot are recent crimes in 2022 against people (you can click on each dot and get a more specific breakdown). Be prepared when you do this – crime is everywhere. But that said the vast majority of minor crime incidents should not deter you from buying a house or renting at a particular location.

Note I recommend looking at actual crime data (points on a map) for this. Several vendors release crime stats aggregated to neighborhoods or zipcodes, but these are of very low quality. (Often they “make up” data when it doesn’t exist, and when data does exist they don’t have a real great way to rank areas of low or high crime.)

For the more high level, should I worry about this neighborhood, I made an interactive hotspot map.

For the methodology, I focused on crimes that I would personally be concerned with as a homeowner. If I pull larceny crimes, I am sure the Target in North Hills would be a hotspot (but I would totally buy a condo in North Hills). So this pulls the recent crime data from Raleigh open data starting in 2020, but scoops up aggravated assaults, interpersonal robberies, weapon violations, and residential burglaries. Folks may be concerned about drug incidents and breaking into cars as well, but my experience those also do not tend to be in residential areas. The python code to replicate the map is here.

Then I created DBScan clusters that had at least 34 crimes – so these areas average at least one of these crimes per month over the time period I sampled. Zooming in, even though I tried to filter for more potentially residential related crimes, you can see the majority of these hot spots of crime are commercial areas in Raleigh. So for example you can zoom in and check out the string of hot spots on Capital Blvd (and if you click a hot spot you can get breakdowns of specific crime stats I looked at):

Very few of these hot spots are in residential neighborhoods – most are in more commercial areas. So when considering looking at homes in Raleigh, there are very few spots I would worry about crime at all in the city when making a housing choice. If moving into a neighborhood with a higher proportion of renters I think is potentially more important long term signal than crime here in Raleigh.

Home buying and collective efficacy

With the recent large appreciation in home values, around 20% in the prior year, there have been an increase in private investors purchasing homes to rent out. Recent stories on this by Tyler Dukes and colleagues have collated open parcel data to identify the scope of these companies across all of North Carolina.

For bit of background, I tried to purchase a home in Plano, TX early 2018. Homes in our price range at that time were going in a single day and typically a few thousand over asking price.

Fast forward to early 2021, I am full remote data scientist instead of a professor, and kiddo is in online school. Even with the pay bump, housing competition was even worse in Plano at this point, so we knew we were likely going to have to move school districts to be able to purchase a home. So we decided to strike out, and ended up looking around Raleigh. Ended up quite quickly deciding to purchase a new build home in the suburb of Clayton (totally recommend our realtor, Ellen Pitts, her crew did quite a bit of work for us remotely).

I was lucky to get in then it appears – many of the new developments in the area are being heavily scooped up by these equity firms (and rent would be ~$600 more for my home than the mortgage). So I downloaded the public data Dukes put together, and loaded it into Excel to make a quick map of the properties.

For a NC state view, we have big clusters in Charlotte, Greensboro and Raleigh:

We can zoom in, and here is an overview of triangle area:

So you can see that inside the loop in Raleigh is pretty sparse, but many of the newer developments on the east side have many more of the private firm purchased houses. Charlotte is much more infilled with these private firms purchasing properties.

Zooming in even further to my town of Clayton, there is quite a bit of variance in the proportion of private vs residential purchases across various developments. My development is less than 50% of these purchases, several developments though appear almost 100% private purchased though. (This is not my home/neighborhood FYI.)


So what does this have to do with collective efficacy? Traditionally areas with higher home ownership have been associated with lower rates of crime. For not criminologists reading my blog, one of the most prominent criminological theories is that state actions only move the needle slightly on increasing/decreasing crime, people enforcing social norms is a bigger factor that explains high crime vs low crime areas. Places with people churning out more frequently – which occurs in areas with more renters – tend to have fewer people effectively keeping the peace. Because social scientists love to make up words, we call this concept collective efficacy.

Downloading and looking at this data, while I was mostly just interested in zooming into my neighborhood and seeing the infill of renters, sparked a criminological hypothesis: I expect neighborhoods with higher rates of private equity purchased housing in the long run to have higher rates of criminal behavior.

This hypothesis will be difficult to test in the wild. It is partially confounded with capital – those who buy their homes accumulate more wealth over time (again mortgage is quite a bit cheaper than rent, so even ignoring home value appreciation this is true). But the variance in the number of homes purchased by private equity firms in different areas makes me wonder if there is enough variation to do a reasonable research design to test my hypothesis, especially in the Charlotte area in say two or three years post a development being finished.