Some ad-hoc fuzzy name matching within Police databases

A repeated annoying task I have had to undertake is take a list of names and date-of-births and match them to a reference set. This can happen when you try to merge data from different sources. Or when working with police RMS data, a frequent problem is the same individual can go into the master name list multiple times. This can be either due to data error, or someone being malicious and providing the PD with fraudulent data.

Most often I am trying to match a smaller set of names to a bigger set from a police RMS system. Typically what I do is grab all of the names in the police RMS, make them the same case, and then simply sort the file by last then first name. Then I typically go through one by one from the smaller file and identify the name ID’s that are in the sorted bigger police database. Ctrl-F can make it a quick search for only a few people.

This works quite well for small numbers, but I wanted to see if I could make some simple rules when I need to match a larger list. For example, the above workflow may be fine if I need to look up 10 names quickly, but say you want to eliminate duplicates in the entire PD RMS system? A manually hand search through 100,000+ names is crazy (and will be out of date before you finish).

A tool I’ve used in the past (and would recommend) is FRIL, Fine-grained records integration and linkage tool. In a nutshell the way that tool works is that you can calculate string distances between names and/or date distances between date-of-births from two separate files. Then you specify how close you want the records to be to either automatically match the records or manually view and make a personal determination if the two records are the same person. FRIL has a really great interface to quickly view the suggested matches and manually confirm or reject certain matches.

FRIL uses the Potter Stewart I know it when I see it approach to finding matches. There is no ground truth, you just use your best judgement whether two names belong to the same person, and FRIL uses some metrics to filter out the most unlikely candidates. I have a bit of a unique strategy though to identify typical string and date of birth differences in fuzzy name matches by using Police RMS data itself. Police RMS’s tend to have a variety of people who are linked up to the same master name index value, but for potentially several reasons they have various idiosyncrasies among different individual incidents. This allows me to calculate distances within persons, so a ground truth estimate, and then I can evaluate different distances compared to a control sample to see how well they the metrics discriminate matches.

There can be several reasons for slightly different data among an individuals incidents in a police RMS, but that they end up being linked to the same person. One is that frequently RMS systems incorporate tables from several different data sources, dispatchers have their own CAD system, the PD has a system to type in paper records, custodial arrests/finger printing may have another system, etc. Merging this data into one RMS may simply cause differences in how the data is stored or even how particular fields tend to be populated in the database. A second reason is that individuals can be ex-ante associated to a particular master name index, but can still have differences is various person fields for any particular incident.

One simple example is that for an arrest report the offender may have an old address in the system, so the officer types in the new address. The same thing can happen to slight name changes or DOB changes. The master name index should update with the most recent info, but you have a record trail of all the minor variations through each incident. Depending on the type of involvement in an incident has an impact on what information is collected and the quality of that information as well. For example, if I was interviewed as a witness to a crime, I may just go down in the report as Andy Wheeler with no date of birth info. If I was arrested, someone would take more time to put in my full name, Andrew Wheeler, and my date of birth. But if the original person inputting the data took the time, they would probably realize I was the same person and associate me with the same master ID.

So I can look at these within ID changes to see the typical distances. What I did was take a name database from a police department I work with, make all pair-wise comparisons between unique names and date of births, and then calculate several string distances between the names and the date differences between listed DOB’s. I then made a randomly matched sample for a comparison group. For the database I was working with this ended up being over 100,000 people with the same ID, but different names/DOB’s somewhere in different incidents, with an average of between 2~3 different names/dob’s per person (so a sample of nearly 200,000 same name comparisons, two names only results in one comparison, but three names results in 3 comparisons). My control sample took one of these names person and matched another random person in the database as a control group, so I have a control group sample of over 100,000 cases.

The data I was working with is secondary, so the names were already aggregated to Last, First Middle. If I had the original database I could do distances for the individual fields (and probably not worry about the middle name) but it somewhat simplifies the analysis as well. Here are some histograms of the Levenshtein distance between the name strings for the same person and random samples. The Levenshtein distance is the number of single edits it takes to transform one string to another string, so 0 would be the same word.

Part of the reason the distances within the same name have such a long tail is because of the already aggregated data. There end up being some people with full middle names, some with middle initials, and some with no middle names at all. So what I did was calculate a normalized Levenshtein distance based on the max and min possible values (listed at the Wikipedia page) the string can take given the size of the two input strings. The minimum value is the difference in the length of the two strings, the maximum is the length of the longest string. So then I calculate NormLevenDist = (LevenDist - min)/(max - min). This would cause Wheeler, Andy P and Wheeler, Andy Palmer to have a normalized distance of zero, whereas the edit distance would be 5. So in these histograms you can see even more discrimination between the two classes, based mainly on such names being perfect subsets of other names.

If you eliminate the 0’s in the normalized distance, you can get a better look at the shapes of each distribution. There is no clear cut-off between the samples, but there is a pretty clear difference in the distributions.

I also calculated the Jaro-Winkler and the Dice (bi-gram) string distances. All four of these metrics had a fairly high correlation, around 0.8 with one another, and all did pretty well classifying the same ID’s according to their ROC curves.

If I wanted to train a classifier as accurately as possible, I would use all of these metrics and probably make some sort of decision tree (or estimate their effects via logistic regression), but I wanted to make a simple function (since it will be doing quite a few comparisons) that calculates as few of the metrics as possible, so I just went with the normalized Levenshtein distance here. Jaro-Winkler would probably be more competitive if I had the separate first and last names (and played around with the weights for the beginning and ends of the strings). If you had mixed strings, like some are First Middle Last and others are Last First I suspect the dice similarity would be the best.

But in the end I think all of the string metrics will do a pretty similar job for this input data, and the normalized Levenshtein distance will work pretty well so I am going to stick with that. (I don’t consider soundex matching here, I’ve very rarely come across an example where soundex would match but had a high edit distance for names, e.g. typos are much more common than intentional mis-spellings based on enunciation I believe, and even the intentional mis-spellings tend to have a small edit distance.)

Now looking at the absolute differences in the DOB’s (where both are available) provides a bit of a different pattern. Here is the histograms

But I think the easiest illustration of this is to examine the frequency table of the most common day differences for the same individuals.

Obviously zero is the most common, but you can see a few other bumps that illustrate the nature of data mistakes. The second is 365 days – exactly off by one year. Also in the top 10 are 366, 731 & 730 – off by either a year and a day or two years. 1096, 1095, 1461, 1826, are examples of these yearly cycles as well. 36,525 are examples of being off by a century! Somewhere along the way some of the DOB fields were accidentally assigned to dates in the future (such as 2032 vs. 1932). The final examples are off by some other number typical of a simple typo, such as 10, 20, or by the difference in one month 30,31,27. By the end of this table of PDF of the same persons is smaller than the PDF of the control sample.

I also calculated string distances by transforming the DOB’s to mm/dd/yy format, but when incorporating the yearly cycles and other noted mistakes they did not appear to offer any new information.

So based on this information, I made a set of ad-hoc rules to classify matched names. I wanted to keep the false positive rate at less than 1 in 1,000, but make the true positive as high as possible. The simple rules I came up with were:

  • If a normalized Levenshtein distance of less than 0.2, consider a match
  • If a normalized Levenshtein distance of less than 0.4 and a close date, consider a match.

Close dates are defined as:

  • absolute difference of within 10 days OR
  • days apart are 10,20,27,30,31 OR
  • the number of days within a yearly cycle are less than 10

This match procedure produces the classification table below:

The false positive rate is right where I wanted it to be, but the true positive is a bit lower than I hoped. But it is a simple tool though to implement, and built into it you can have missing data for a birthday.

It is a bit hard to share this data and provide reproducible code, but if you want help doing something similar with your own data just shoot me an email and I will help. This was all done in SPSS and Python (using the extended transforms python code). In the end I wanted to make a simple Python function to use with the FUZZY command to automatically match names.

Tables and Graphs paper rejection/update – and on the use of personal pronouns in scientific writing

My paper, Tables and Graphs for Monitoring Temporal Crime Patterns was recently rejected from Policing: An International Journal of Police Strategies & Management. I’ve subsequently updated the SSRN draft based on feedback from the review, and here I post the reviews and my responses to those reviews (in the text file).

One of the main critiques by both reviewers was that the paper was too informal, mainly because of the use of "I" in the paper. I use personal pronouns in writing intentionally, despite typical conventions in scientific writing, so I figured a blog post about why I do this is in order. I’ve been criticized for it on other occasions as well, but this is the first time it was listed as a main reason to reject an article of mine.

My main motivation comes from Michael Billig’s book Learn to Write Badly: How to Succeed in the Social Sciences (see a prior blog post I wrote on the contents). In a nut-shell, when you use personal pronouns it is clear that you, the author, are doing something. When you rewrite the sentence to avoid personal pronouns, you often obfuscate who the actor is in a particular sentence.

For an example of Billig’s point that personal pronouns can be more informative, I state in the paper:

I will refer to this metric as a Poisson z-score.

I could rewrite this sentence as:

This metric will be referred to as a Poisson z-score.

But that is ambiguous as to its source. Did someone else coin this phrase, and I am borrowing it? No – it is a phrase I made up, and using the personal pronoun clearly articulates that fact.

Pretty much all of the examples where I eliminated first person in the updated draft were of the nature,

In this article I discuss the use of percent change in tables.

which I subsequently changed to:

This article discusses the use of percent changes as a metric in tables.

Formal I suppose, but insipid. All rewriting the sentence to avoid the first person pronoun does is make the article seem like a sentient being, as well as forces me to use the passive tense. I don’t see how the latter is better in any way, shape, or form – yet this is one of the main reasons my paper is rejected above. The use of "we" in academic articles seems to be more common, but using "we" when there is only one author is just silly. So I will continue to use "I" when I am the only author.

Presentation at IACA 2015: An Exploratory Network Analysis of Hot People and Places

My work was accepted at the 2015 International Association of Crime Analysts Conference, so I will be going to Denver this fall to present. These are "NIJ Research Track" presentations, but basically took the place of the old MAPS conference presentations. So on Thursday at 2:30 (Panel 9) I will be presenting. The title of the presentation is An Exploratory Network Analysis of Hot People and Places, and below is the abstract for the talk:

Intelligence led policing practices focus on chronic offenders and hot spots of crime. I examine the connections between these hot people and hot places by considering micro places (street segments and intersections) and people as nodes in an interconnected network. I focus on whether hot people tend to have a finite set of locations they congregate, and whether hot places have unique profiles of chronic offenders. The end goal is to identify if observed patterns can help police combine targeted enforcement of hot people and hot places in one overarching strategy.

This is still a work in progress, but here is a quick preview. This graph is a random sample of offender footprints in each panel.

Also during this slot there are two other presentations, below is their info:

Title: Offender Based Investigations: A Paradigm Shift in Policing, Author: Chief James W. Buie, Gaston County Police

Abstract: Routine activity theory and research conducted by Wiles, P & Costello, A. (2000) ‘The Road to Nowhere’ prove criminals 1) commit crimes where comfortable and 2) are very likely to re-offend. Therefore it makes sense to elevate the importance of offender locations in relation to crimes. Our focus on the offender is called Offender Based Investigations (OBI). We’re using GIS to plot not only crimes but criminals as well. Our experience over the past seven years of utilizing OBI has proven that mapping offenders plays a critical role in solving crimes.

Title: A Spatial Network Analysis of Gangs Author: Davin Hall, Crime Analysis Specialist for the Greensboro PD

Abstract: Gangs are characterized by the associations between members and the territory that members operate in. Many representations of gang territory and gang networks are separate from one another; a map shows the territory while a network map shows linkages between individuals. This presentation demonstrates the combination of territory and network within a single map, for both gangs and gang members. This analysis shows law enforcement a clearer picture of the complex relationships that occur between and within gangs.

You can see the full agenda here. I really enjoyed the presentations last year at Seattle, and this year looks great as well. If you want any more info. on my particular work feel free to send me an email, or if you want to get together at Denver this fall.

How wide to make the net in actuarial tools? (false positives versus false negatives)

An interesting debate/question came up in my work recently. I conducted an analysis of a violence risk assessment tool for a police department. Currently the PD takes around the top 1,000 scores of this tool, and then uses further intelligence and clinical judgements to place a small number of people on a chronic offender list (who are then subject to further interventions). My assessment of the predictive validity when examining ROC curves suggested the tool does a pretty good job discriminating violent people up to around the top 6,000 individuals and after that flattens out. In a sample of over 200,000, the top 1000 scores correctly classified 30 of the 100 violent cases, and the top 6000 classified 60.

So the question came up should we recommend that the analysts widen the net to the top 6,000 scores, instead of only examining the top 1,000 scores? There are of course costs and limitations of what the analysts can do. It may simply be infeasible for the analysts to review 6,000 people. But how do you set the limit? Should the clinical assessments be focused on even fewer individuals than 1,000?

We can make some estimates of where the line should be drawn by setting weights for the cost of a false positive versus a false negative. Implicit in the whole exercise of predicting violence in a small set of people is that false negatives (failing to predict someone will be violent when they are) greatly outweigh a false positive (predicting someone will be violent but they are not). The nature of the task dictates that you will always need to have quite a few false positives to classify even a few true positives, and no matter what you do there will only be a small number of false negatives.

Abstractly, you can place a value on the cost of failing to predict violence, and a cost on the analysts time to evaluate cases. In this situation we want to know whether the costs of widening the net to 6,000 individuals are less than the costs of only examining the top 1,000 individuals. Here I will show we don’t even need to know what the exact cost of a false positive or a false negative is, only the relative costs, to make an estimate about whether the net should be cast wider.

The set up is that if we only take the top 1,000 scores, it will capture 30 out of the 100 violent cases. So there will be (100 – 30) false negatives, and (1000 – 30) false positives. If we increase the scores to evaluate the top 6,000, it will capture 60 out the 100 violent cases, but then we will have (6000 – 60) false positives. I can not assign a specific number to the cost of a false negative and a false positive. So we can write these cost equations as:

1) (100 - 30)*FN + (1000 - 30)*FP = Cost Low
2) (100 - 60)*FN + (6000 - 60)*FP = Cost High

Even though we do not know the exact cost of a false negative, we can talk about relative costs, e.g. 1 false negative = 1000*false positives. There are too many unknowns here, so I am going to set FP = 1. This makes the numbers relative, not absolute. So with this constraint the reduced equations can be written as:

1) 70*FN +  970 = Cost Low
2) 40*FN + 5940 = Cost High

So we want to know the ratio at which there is a net benefit over including the top 6,000 scores versus only the top 1,000. So this means that Cost High < Cost Low. To figure out this point, we can subtract equation 2 from equation 1:

3) (70 - 40)*FN - 4970 = Cost Low - Cost High

If we set this equation to zero and solve for FN we can find the point where these two equations are equal:

30*FN - 4970 = 0
30*FN = 4970
FN = 4970/30 = 165 + 2/3

If the value of a false negative is more than 166 times the value of a false positive, Cost Low - Cost High will be positive, and so the false negatives are more costly to society relative to the analysts time spent. It is still hard to make guesses as to whether the cost of violence to society is 166 times more costly than the analysts time, but that is at least one number to wrap your head around. In a more concrete example, such as granting parole or continuing to be incarcerated, given how expensive prison is net widening (with these example numbers) would probably not be worth it. But here it is a bit more fuzzy especially because the analysts time is relatively inexpensive. (You also have to guess how well you can intervene, in the prison example incarceration essentially reduces the probability of committing violence to zero, whereas police interventions can not hope to be that successful.)

As long as you assume that the classification rate is linear within this range of scores, the same argument holds for net widening any number. But in reality there are diminishing returns the more scores you examine (and 6,000 is basically where the returns are near zero). If you conduct the same exercise between classifying zero and the top 1,000, the rate of the cost of a false negative to a false positive needs be 32+1/3 to justify evaluating the top 1,000 scores. If you actually had an estimate of the ratio of the cost of false positives to false negatives you could then figure out exactly how wide to make the net. But if you think the ratio is well above 166, you have plenty of reason to widen the net to the larger value.

New paper: Tables and graphs for monitoring temporal crime patterns

I’ve uploaded a new pre-print, Tables and graphs for monitoring temporal crime patterns. The paper basically has three parts, which I will briefly recap here:

  • percent change is a bad metric
  • there are data viz. principles to constructing nicer tables
  • graphs >> tables for monitoring trends

Percent change encourages chasing the noise

It is tacitly understood that percent change when the baseline is small can fluctuate wildly – but how about when the baseline average is higher? If the average of crime was around 100 what would you guess would be a significant swing in terms of percent change? Using simulations I estimate for a 1 in 100 false positive rate you need an over 40% increase (yikes)! I’ve seen people make a big deal about much smaller changes with much smaller baseline averages.

I propose an alternative metric based on the Poisson distribution,

2*( SQRT(Post) - SQRT(Pre) )

This approximately follows a normal distribution if the data is Poisson distributed. I show with actual crime data it behaves pretty well, and using a value of 3 to flag significant values has a pretty reasonable rate of flags when monitoring weekly time series for five different crimes.

Tables are visualizations too!

Instead of recapping all the points I make in this section, I will just show an example. The top table is from an award winning statistical report by the IACA. The latter is my remake.

Graphs >> Tables

I understand tables are necessary for reporting of statistics to accounting agencies, but they are not as effective as graphs to monitor changes in time series. Here is an example, a seasonal chart of burglaries per month. The light grey lines are years from 04 through 2013. I highlight some outlier years in the chart as well. It is easy to see whether new data is an outlier compared to old data in these charts.

I have another example of monitoring weekly statistics in the paper, and with some smoothing in the chart you can easily see some interesting crime waves that you would never comprehend by looking at a single number in a table.

As always, if you have comments on the paper I am all ears.

Stacking Intervals

Motivated by visualizing overlapping intervals from the This is not Rocket Science blog (by Valentine Svensson) (updated link here) I developed some code in SPSS to accomplish a similar feat. Here is an example plot from the blog:

It is related to a graph I attempted to make for visualizing overlap in crime events and interval censored crime data is applicable. The algorithm I mimic by Valentine here is not quite the same, but similar in effect. Here is the macro I made to accomplish this in SPSS. (I know I could also use the python code by the linked author directly in SPSS.)

DEFINE !StackInt (!POSITIONAL = !TOKENS(1)
                 /!POSITIONAL = !TOKENS(1) 
                 /Fuzz = !DEFAULT("0") !TOKENS(1) 
                 /Out = !DEFAULT(Y) !TOKENS(1) 
                 /Sort = !DEFAULT(Y) !TOKENS(1) 
                 /TempVals = !DEFAULT(100) !TOKENS(1) )
!IF (!Sort = "Y") !THEN
SORT CASES BY !1 !2.
!IFEND
VECTOR #TailEnd(!TempVals).
DO IF ($casenum = 1).  
  COMPUTE #TailEnd(1) = End + !UNQUOTE(!Fuzz).
  COMPUTE !Out = 1.
  LOOP #i = 2 TO !TempVals.
    COMPUTE #TailEnd(#i) = Begin-1.
  END LOOP.
ELSE.
  DO REPEAT i = 1 TO !TempVals /T = #TailEnd1 TO !CONCAT("#TailEnd",!TempVals).
    COMPUTE T = LAG(T).
    DO IF (Begin > T AND MISSING(Y)).
      COMPUTE T = End + !UNQUOTE(!Fuzz).
      COMPUTE !Out = i.
    END IF.
  END REPEAT.
END IF.
FORMATS !Out !CONCAT("(F",!LENGTH(!TempVals),".0)").
FREQ !Out /FORMAT = NOTABLE /STATISTICS = MAX.
!ENDDEFINE.

An annoyance for this is that I need to place the ending bins in a preallocated vector. Since you won’t know how large the offset values are to begin with, I default to 100 values. The function prints a set of frequencies after the command though (which forces a data pass) and if you have missing data in the output you need to increase the size of the vector.

One simple addition I made to the code over the original is what I call Fuzz in the code above. What happens for crime data is that there are likely to be many near overlaps in addition to many crimes that have an exact known time. What the fuzz factor does is displaces the lines if they touch within the fuzz factor. E.g. if you had two intervals, (1,3) and (3.5,5), if you specified a fuzz factor of 1 the second interval would be placed on top of the first, even though they don’t exactly overlap. This appears to work reasonably well for both small and large n in some experimentation, but if you use too large a fuzz factor it will produce river like runs through the overlap histogram.

Here is an example of using the macro with some fake data.

SET SEED 10.
INPUT PROGRAM.
LOOP #i = 1 TO 1000.
  COMPUTE Begin = RV.NORMAL(50,15).
  COMPUTE End = Begin + EXP(RV.GAMMA(1,2)).
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.

!StackInt Begin End Fuzz = "1" TempVals = 101.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Begin End Mid
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: End=col(source(s), name("End"))
  DATA: Begin=col(source(s), name("Begin"))
  DATA: Y=col(source(s), name("Y"), unit.category())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), null())
  ELEMENT: edge(position((End+Begin)*Y))
END GPL.

And here is the graph:

Using the same burglary data from Arlington I used for my aoristic macro, here is an example plot for a few over 6,300 burglaries in Arlington in 2012 using a fuzz factor of 2 days.

This is somewhat misleading, as events with a known exact time are not given any area in the plot. Here is the same plot but with plotting the midpoint of the line as a circle over top of the edge.

You can also add in other information to the plot. Here I color edges and points according to the beat that they are in.

It produces a pretty plot, but it is difficult to discern any patterns. I imagine this plot would be good to evaluate when there is the most uncertainty in crime reports. I might guess if I did this type of plot with burglaries after college students come back from break you may see a larger number of long intervals, showing that they have large times of uncertainty. It is difficult to see any patterns in the Arlington data though besides more events in the summertime (which an aoristic chart would have shown to begin with.) At the level of abstraction of over a year I don’t think you will be able to spot any patterns (like with certain days of the week or times of the month).

I wonder if this would be useful for survival analysis, in particular to spot any temporal or cohort effects.

Presentation at IACA 2014 – Making Field Stops Smart

Part of the work I am doing with the Finn Institute in collaboration with the Albany Police Department was accepted as a presentation at the upcoming IACA conference in Seattle next week. The NIJ used to have a separate Crime Mapping conference, but they folded it into the yearly IACA conference. So this is one of the NIJ Crime Mapping presentations.

The title of the presentation is Making Field Stops Smart, and below is the abstract:

Mapping hot spots of crime incidents for use in allocating patrol resources has become commonplace. This research is intended to extend the logic to mapping locations of field interviews. The project has two specific spatial analysis components; 1) are most of the stops being conducted a high crime locations, and 2) are locations with the most stops the locations with the most productive stops (in terms of arrests, contraband recovery, stopping chronic offenders). Making stops smart is being conducted as a research partnership between the Albany, NY police department and the Finn Institute of Public Safety.

The time of the presentation is at 15:30 on Thursday 9/11. Two other presenters, Eric Paull from Akron, Ohio and Christian Peterson from Portland, Oregon have presentations on the panel as well (see the IACA agenda for their talk abstracts).

I am uncomfortable publicly releasing the pre-print white papers given the collaboration (Rob Worden and Sarah McLean are co-authors) and because that APD’s name is directly attached to the work. But if you send me an email I can forward the white paper for this presentation and related work we are doing.

If you see me at IACA feel free to come up and say hi. I do not have any other plans while I am in town besides going to presentations.

 

Understanding Uncertainty – crime counts and the Poisson distribution

A regular occurrence for me when I was a crime analyst went along the lines of, "There was a noteworthy crime event in the media, can you provide some related analysis". Most of the time this followed one or multiple noteworthy crimes that caught the public’s attention, which could range from a series of thefts from vehicles over a month, the same gas station being robbed on consecutive days, or a single murder.

Any single violent crime is awful, and this is not meant to deny that. But often single noteworthy events are often misconstrued as crime waves, or general notions that the neighborhood is in decline or the city is a more dangerous place now than it ever was. The media is intentionally hyperbole, so it is not an effective gauge whether or not crime is really increasing or decreasing. Here I will show an example of using the Poisson distribution to show whether or not a recent spree of crimes is more than you would expect by chance.

So lets say that over the course of 20 years, the mean number of homicides in a jurisdiction is 2. Lets also say that in the year so far, we have 5 homicides. Ignoring that the year has not concluded, what is the probability of observing 5 or more homicides? Assuming the number of homicides is a Poisson distributed random variable (often not too unreasonable for low counts over long time periods) the probability is 5.7%. Small, but not totally improbable. To calculate this probability it is just 1 minus the cumulative distribution function for a Poisson distribution with the given mean. This can be calculated easily in R by using ppois, i.e. 1 - ppois(5-1,2) (just replace 5 with your observed count and 2 with your mean). Note that I subtract 1 from 5, otherwise it would be testing the probability of over 5 instead of 5 or more. For those analysts using Excel, the formula is =1 - POISSON.DIST(5-1,2,TRUE). For SPSS it would be COMPUTE Prob = 1 - CDF.POISSON(5-1,2).

The reason for making these calculations is specifically to understand chance variations given the numbers historically. For the crime analyst it is necessary to avoid chasing noise. For the public it is necessary to understand the context of the current events in light of historical data. For another example, say that the average number of robberies in a month is 15, what is the probability of observing 21 or more robberies? It is 8%, so basically you would expect this high of a number to happen at least one time every year (i.e. 0.08*12~1). Without any other information, there is little reason for a crime analyst to spend extra time examining 21 robberies in a month based on the total number of events alone. Ditto for the public there is little reason to be alarmed by that many robberies in a month given the historical data. (The analyst may want to examine the robberies for other reasons, but there is no reason to be fooled into thinking there is an unexpected increase.)

Here I’ve ignored some complications for the sake of simplicity in the analysis. One is that crime may not be Poisson distributed, but may be under or over-dispersed. In the case of over-dispersion (which seems to happen more often with crime data) the series likely has a higher number of 0’s and then high bursts of activity. In this case you would expect the higher bursts more often than you would with the Poisson distribution. For under-dispersed Poisson data, the variance is smaller than the mean, and so higher bursts of activity are less likely. These are fairly simple to check (at least to see if they are grossly violated), either see if the mean approximately equals the variance, or draw a histogram and superimpose a density estimate for the Poisson distribution. This also ignores seasonal fluctuations in crime (e.g. more burglaries occur in the summer than in the winter).

Even if you do not like making the Poisson assumption a very simple analysis to conduct is to plot the time series of the event over a long period. The rarer the crime the larger aggregation and time series you might need, but this is pretty straightforward to conduct with a SQL query and whatever program you use to conduct analysis. If it is for UCR crime counts, you can try going onto the UCR data tool to see if your jurisdiction has historical annual data going back to 1985. My experience is the vast majority of crime waves depicted in the media are simply chance fluctuations, clearly visible as such just by inspecting the time series plot. Similarly such a plot will show if there is an increase or a decrease compared to historical numbers. Another simple analysis is to take the current numbers and rank them against, say the prior 50 to 100 values in the series. If it is abnormal it should be the the highest or very near the highest in those prior values.

Another complication I have ignored is that of multiple testing. When one is constantly observing a series, even a rare event is likely to happen over a long period of observation. So lets say that in your jurisdiction on average there are 3 domestic assaults in a week, and one week you observe 9. The probability of observing 9 or more is 0.003, but over a whole year the probability of this happening at least once is around 18% (i.e. 1 - ppois(8,3)^52 in R code). Over the course of 10 years (around 520 weeks) we would expect around 2 weeks to have 9 or more domestic assaults (i.e. 520*(1-ppois(8,3))). (This probability goes up higher if we consider sliding windows, e.g. 9 or more domestic assaults in any 7 day span, instead of just over different weeks.) These statistics make the assumption that events are independent (likely not true in practice) but I rather make that false assumption to get a sense of the probability then rely on gut feelings or opinions based on the notoriety of the recent crime(s).

The title for this blog post is taken from David Spiegelhalter’s site Understanding Uncertainty, and that link provides a synonymous example with the recent cluster of plane crashes.

New paper – Testing for Randomness in Day of Week Crime Sprees with Small Samples

I’ve uploaded a new paper, Testing for Randomness in Day of Week Crime Sprees with Small Samples, to SSRN. The abstract is below:

This paper discusses exact tests for evaluating whether a series of offences are randomly distributed across days of the week for small sample sizes. The given context is if a crime analyst has identified a series of events that are committed by the same offender(s), can the analyst determine if those events are randomly distributed with respect to the day of week given only a few offences? I detail how one can develop exact reference distributions since the number of potential permutations are small. I also discuss the power of the chi^2 and Kuiper’s V test for small sample sizes, and give an example of hypothesis testing when crimes have uncertain days of occurrence.

Here is an example graph of the power of the tests under different alternate hypotheses of crimes only having probability of being committed on a certain subset of days in the week. Comments on the paper would be wonderful (I will have to bug someone to give me some pre peer-review feedback) – so feel free.

Sparklines for Time Interval Crime Data

I developed some example sparklines for tables when visualizing crime data that occurs in an uncertain window. The use case is small tables that list the begin and end date-times, and the sparklines provide a quick visual assessment of the day of week and time of day. Examining for overlaps in two intervals is one of the hardest things to do when examining a table, and deciphering days of week when looking at dates is just impossible.

Here is an example table of what they look like.

The day of week sparklines are a small bar chart, with Sunday as the first bar and Saturday as the last bar. The height of the bar represents the aoristic estimate for that day of week. An interval over a week long (entirely uncertain what day of week the crime took place) ends up looking like a dashed line over the week. This example uses the sparkline bar chart built into Excel 2010, but the Sparklines for Excel add-on provides synonymous functionality. The time of day sparkline is a stacked bar chart in disguise; it represents the time interval with a dark grey bar, and the remaining stack is white. This allows you to have crimes that occur overnight and are split in the middle of the day. Complete ignorance of when the crime occurred during the day I represent with a lighter grey bar.

The spreadsheet can be downloaded from my drop box account here.

A few notes the use of the formulas within the sheet:

  • The spreadsheet does have formulas to auto-calculate the example sparklines (how they exactly work is worth another blog post all by itself) but it should be pretty clear to replicate the example bar chart for the day of week and time of day in case you just want to hand edit (or have another program return the needed estimates).
  • For the auto-calculations to work for the Day of Week aoristic estimates the crime interval needs to have a positive value. That is, if the exact same time is listed in the begin and end date column you will get a division by zero error.
  • For the day of week aoristic estimates it calculates the proportion as 1/7 if the date range is over one week. Ditto for the time range it is considered the full range if it goes over 24 hours.

A few notes on the aesthetics of sparklines:

  • For the time of day sparkline if you have zero (or near zero) length for the interval it won’t show up in the graph. Some experimentation suggests the interval needs around 15 to 45 minutes for typical cell sizes to be visible in the sheet (and for printing).
  • For the time of day sparkline the empty time color is set to white. This will make the plot look strange if you use zebra stripes for the table. You could modify it to make the empty color whatever the background color of the cell is, but I suspect this might make it confusing looking.
  • A time of day bar chart could be made just the same as the day of week bar chart. It would require the full expansion for times of day, which I might do in the future anyway to provide a conveniant spreadsheet to calculate aoristic estimates. (I typically do them with my SPSS MACRO – but it won’t be too arduous to expand what I have done here to an excel template).
  • If the Sparklines for Excel add-on allowed pie charts with at least two categories or allowed the angle of the pie slice to be rotated, you could make a time of day pie chart sparkline. This is currently not possible though.

I have not thoroughly tested the spreadsheet calculations (missing values will surely return errors, and if you have the begin-end backwards it may return some numbers, but I doubt they will be correct) so caveat emptor. I think the sparklines are a pretty good idea though. I suspect some more ingenious uses of color could be used to cross-reference the days of week and the time of day, but this does pretty much what I hoped for when looking at the table.