Advice for crime analyst to break into data science

I recently received a question about a crime analyst looking to break into data science. Figured it would be a good topic for my advice in a blog post. I have written many resources over the years targeting recent PhDs, but the advice for crime analysts is not all that different. You need to pick up some programming, and likely some more advanced tech skills.

For background, the individual had SQL + Excel skills (which many analysts may just have Excel). Vast majority of analyst roles, you should be quite adept at SQL. But just SQL is not sufficient for even an entry level data science role.


For entry data science, you will need to demonstrate competency in at least one programming language. The majority of positions will want you to have python skills. (I wrote an entry level python book exactly for someone in your position.)

You likely will also need to demonstrate competency in some machine learning or using large language models for data science roles. It used to be Andrew Ng’s courses were the best recommendation (I see he has a spin off DeepLearningAI now). So that is second hand though, I have not personally taken them. LLMs are more popular now, so prioritizing learning how to call those APIs, build RAG systems, prompt engineering I think is going to make you slightly more marketable than traditional machine learning.

I have personally never hired anyone in a data science role without a masters. That said, I would not have a problem if you had a good portfolio. (Nice website, Github contributions, etc.)

You should likely start just looking and applying to “analyst” roles now. Don’t worry about if they ask for programming you do not have experience in, just apply. Many roles the posting is clearly wrong or totally unrealistic expectations.

Larger companies, analyst roles can have a better career ladder, so you may just decide to stay in that role. If not, can continue additional learning opportunities to pursue a data science career.

Remote is more difficult than in person, but I would start by identifying companies that are crime analysis adjacent (Lexis Nexis, ESRI, Axon) and start applying to current open analyst positions.

For additional resources I have written over the years:

The alt-ac newsletter has various programming and job search tips. THe 2023 blog post goes through different positions (if you want, it may be easier to break into project management than data science, you have a good background to get senior analyst positions though), and the 2025 blog post goes over how to have a portfolio of work.

Cover page, data science for crime analysis with python

I translated my book for $7 using openai

The other day an officer from the French Gendarmerie commented that they use my python for crime analysis book. I asked that individual, and he stated they all speak English. But given my book is written in plain text markdown and compiled using Quarto, it is not that difficult to pipe the text through a tool to translate it to other languages. (Knowing that epubs under the hood are just html, it would not suprise me if there is some epub reader that can use google translate.)

So you can see now I have available in the Crime De-Coder store four new books:

ebook versions are normally $39.99, and print is $49.99 (both available worldwide). For the next few weeks, can use promo code translate25 (until 11/15/2025) to purchase epub versions for $19.99.

If you want to see a preview of the books first two chapters, here are the PDFs:

And here I added a page on my crimede-coder site with testimonials.

As the title says, this in the end cost (less than) $7 to convert to French (and ditto to convert to Spanish).

Here is code demo’ing the conversion. It uses OpenAI’s GPT-5 model, but likely smaller and cheaper models would work just fine if you did not want to fork out $7. It ended up being a quite simple afternoon project (parsing the markdown ended up being the bigger pain).

So the markdown for the book in plain text looks like this:

It ends up that because markdown uses line breaks to denote different sections, that ends up being a fairly natural break to do the translation. These GenAI tools cannot repeat back very long sequences, but a paragraph is a good length. Long enough to have additional context, but short enough for the machine to not go off the rails when trying to just return the text you input. Then I just have extra logic to not parse code sections (that start/end with three backticks). I don’t even bother to parse out the other sections (like LaTeX or HTML), and I just include in the prompt to not modify those.

So I just read in the quarto document, split by “”, then feed in the text sections into OpenAI. I did not test this very much, just use the current default gpt-5 model with medium reasoning. (It is quite possible a non-reasoning smaller model will do just as well. I suspect the open models will do fine.)

You will ultimately still want someone to spot check the results, and then do some light edits. For example, here is the French version when I am talking about running code in the REPL, first in English:

Running in the REPL

Now, we are going to run an interactive python session, sometimes people call this the REPL, read-eval-print-loop. Simply type python in the command prompt and hit enter. You will then be greeted with this screen, and you will be inside of a python session.

And then in French:

Exécution dans le REPL

Maintenant, nous allons lancer une session Python interactive, que certains appellent le REPL, boucle lire-évaluer-afficher. Tapez simplement python dans l’invite de commande et appuyez sur Entrée. Vous verrez alors cet écran et vous serez dans une session Python.

So the acronym is carried forward, but the description of the acronym is not. (And I went and edited that for the versions on my website.) But look at this section in the intro talking about GIS:

There are situations when paid for tools are appropriate as well. Statistical programs like SPSS and SAS do not store their entire dataset in memory, so can be very convenient for some large data tasks. ESRI’s GIS (Geographic Information System) tools can be more convenient for specific mapping tasks (such as calculating network distances or geocoding) than many of the open source solutions. (And ESRI’s tools you can automate by using python code as well, so it is not mutually exclusive.) But that being said, I can leverage python for nearly 100% of my day to day tasks. This is especially important for public sector crime analysts, as you may not have a budget to purchase closed source programs. Python is 100% free and open source.

And here in French:

Il existe également des situations où les outils payants sont appropriés. Les logiciels statistiques comme SPSS et SAS ne stockent pas l’intégralité de leur jeu de données en mémoire, ils peuvent donc être très pratiques pour certaines tâches impliquant de grands volumes de données. Les outils SIG d’ESRI (Système d’information géographique) peuvent être plus pratiques que de nombreuses solutions open source pour des tâches cartographiques spécifiques (comme le calcul des distances sur un réseau ou le géocodage). (Et les outils d’ESRI peuvent également être automatisés à l’aide de code Python, ce qui n’est pas mutuellement exclusif.) Cela dit, je peux m’appuyer sur Python pour près de 100 % de mes tâches quotidiennes. C’est particulièrement important pour les analystes de la criminalité du secteur public, car vous n’avez peut‑être pas de budget pour acheter des logiciels propriétaires. Python est 100 % gratuit et open source.

So it translated GIS to SIG in French (Système d’information géographique). Which seems quite reasonable to me.

I paid an individual to review the Spanish translation (if any readers are interested to give me a quote for the French version copy-edits, would appreciate it). She stated it is overall very readable, but just has many minor things. Here is a a sample of suggestions:

Total number of edits she suggested were 77 (out of 310 pages).

If you are interested in another language just let me know. I am not sure about translation for the Asian languages, but I imagine it works OK out of the box for most languages that are derivative of Latin. Another benefit of self-publishing, I can just have the French version available now, but if I am able to find someone to help with the copy-edits

The difference between models, drive-time vs fatality edition

Easily one of the most common critiques I make when reviewing peer reviewed papers is the concept, the difference between statistically significant and not statistically significant is not itself statistically significant (Gelman & Stern, 2006).

If you cannot parse that sentence, the idea is simple to illustrate. Imagine you have two models:

Model     Coef  (SE)  p-value
  A        0.5  0.2     0.01
  B        0.3  0.2     0.13

So often social scientists will say “well, the effect in model B is different” and then post-hoc make up some reason why the effect in Model B is different than Model A. This is a waste of time, as comparing the effects directly, they are quite similar. We have an estimate of their difference (assuming 0 covariance between the effects), as

Effect difference = 0.5 - 0.3 = 0.2
SE of effect difference = sqrt(0.2^2 + 0.2^2) = 0.28

So when you compare the models directly (which is probably what you want to do when you are describing comparisons between your work and prior work), this is a bit of a nothing burger. It does not matter that Model B is not statistically significant, a coefficient of 0.3 is totally consistent with the prior work given the standard errors of both models.

Reminded again about this concept, as Arredondo et al. (2025) do a replication of my paper with Gio on drive time fatalities and driving distance (Circo & Wheeler, 2021). They find that distance (whether Euclidean or drive time) is not statistically significant in their models. Here is the abstract:

Gunshot fatality rates vary considerably between cities with Baltimore, Maryland experiencing the highest rate in the U.S.. Previous research suggests that proximity to trauma care influences such survival rates. Using binomial logistic regression models, we assessed whether proximity to trauma centers impacted the survivability of gunshot wound victims in Baltimore for the years 2015-2019, considering three types of distance measurements: Euclidean, driving distance, and driving time. Distance to a hospital was not found to be statistically associated with survivability, regardless of measure. These results reinforce previous findings on Baltimore’s anomalous gunshot survivability and indicate broader social forces’ influence on outcomes.

This ends up being a clear example of the error I describe above. To make it simple, here is a comparison between their effects and the effects in my and Gio’s paper (in the format Coef (SE)):

Paper     Euclid          Network       Drive Time
Philly     0.042 (0.021)  0.030 (0.016)    0.022 (0.010)
Baltimore  0.034 (0.022)  0.032 (0.020)    0.013 (0.006)

At least for these coefficients, there is literally nothing anomalous at all compared to the work me and Gio did in Philadelphia.

To translate these coefficients to something meaningful, Gio and I estimate marginal effects – basically a reduction of 2 minutes results in a decrease of 1 percentage point in the probability of death. So if you compare someone who is shot 10 minutes from the hospital and has a 20% chance of death, if you could wave a wand and get them to the ER 2 minutes faster, we would guess their probability of death goes down to 19%. Tiny, but over many such cases makes a difference.

I went through some power analysis simulations in the past for a paper comparing longer drive time distances as well (Sierra-Arévalo et al. 2022). So the (very minor) differences could also be due to omitted variable bias (in logit models, even if not confounded with the other X, can bias towards 0). The Baltimore paper does not include where a person was shot, which was easily the most important factor in my research for the Philly work.

To wrap up – we as researchers cannot really change broader social forces (nor can we likely change the location of level 1 emergency rooms). What we can change however are different methods to get gun shot victims to the ER faster. These include things like scoop-and-run (Winter et al., 2022), or even gun shot detection tech to get people to scenes faster (Piza et al., 2023).

References

Using Esri + python: arcpy notes

I shared a series of posts this week using Esri + arcpy tools on my Crime De-Coder LinkedIn page. LinkedIn eventually removes the posts though, so I am putting those same tips here on the blog. Esri’s tools do not have great coverage online, so blogging is a way to get more coverage in those LLM tools long term.


A little arcpy tip, if you import a toolbox, it can be somewhat confusing what the names of the methods are available. So for example, if importing some of the tools Chris Delaney has created for law enforcement data management, you can get the original methods available for arcpy, and then see the additional methods after importing the toolbox:

import arcpy
d1 = dir(arcpy) # original methods
arcpy.AddToolbox("C:\LawEnforcementDataManagement.atbx")
d2 = dir(arcpy) # updated methods available after AddToolbox
set(d2) - set(d1) # These are the new methods
# This prints out for me
# {'ConvertTimeField_Defaultatbx', 'toolbox_code', 'TransformCallData_Defaultatbx', 'Defaultatbx', 'TransformCrimeData_Defaultatbx'}
# To call the tool then
arcpy.TransformCrimeData_Defaultatbx(...)

Many of the Arc tools have the ability to copy python code, when I use Chris’s tool it copy-pastes arcpy.Defaultatbx.TransformCrimeData, but if running from a standalone script outside of an Esri session (using the python environment that ArcPro installs) that isn’t quite the right code to call the function.

You can check out Chris’s webinar that goes over the law enforcement data management tool, and how it fits into the different crime analysis solutions that Chris and company at Esri have built.


I like using conda for python environments on Window’s machines, as it is easier to install some particular packages. So I mostly use:

conda create --name new_env python=3.11 pip
conda activate new_env
pip install -r requirements.txt

But for some libraries, like geopandas, I will have conda figure out the install. E.g.

conda create --name geo_env python=3.11 pip geopandas
conda activate geo_env
pip install -r requirements.txt

As they are particularly difficult to install with many restrictions.

And if you are using ESRI tools, and you want to install a library, conda is already installed and you can clone that environment.

conda create --clone "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3" --name proclone
conda activate proclone
pip install -r requirements.txt

As you do not want to modify the original ESRI environment.


Using conda to run scheduled jobs in Windows is alittle tricky. Here is an example of setting up a .bat file (which can be set up in Windows scheduler) to activate conda, set a new conda environment, and call a python script.

::: For log, showing date/time
echo:
echo --------------------------
echo %date% %time%
::: This sets the location of the script, as conda may change it
set "base=%cd%"
::: setting up conda in Windows, example Arc's conda activate
call "C:\Program Files\ArcGIS\Pro\bin\Python\Scripts\activate.bat"
::: activating a new environment
call conda activate proclone
::: running a python script
call cd %base%
call python auto_script.py
echo --------------------------
echo:

Then, when I set up the script in Window’s scheduler, I often have the log file at that level. So the task scheduler I will have the action as:

"script.bat" >> log.txt 2>&1

And have the options where the script runs from the location of script.bat. This will append both the normal log and error log to the shell script. So if something goes wrong, you can open log.txt and see what is up.


When working with arcpy, often you need to have tables inside of a geodatabase to use particular geoprocessing tools. Here is an example of taking an external csv file, and importing that file into a geodatabase as a table.

import arcpy
gdb = "./project/LEO_Tables.gdb"
tt = "TempTable"
arcpy.env.workspace = gdb

# Convert CSV into geodatabase
arcpy.TableToTable_conversion("YourData.csv",gdb,tt)
#arcpy.ListTables() # should show that new table

# convert time fields into text, useful for law enforcement management tools
time_fields = ['rep_date','begin','end']
for t in time_fields:
    new_field = f"{t}2"
    arcpy.management.AddField(tt,new_field,"TEXT")
    arcpy.management.CalculateField(tt,new_field,f"!{t}!.strftime('%Y/%m/%d %H:%m')", "PYTHON3")

# This will show the new fields
#fn = [f.name for f in arcpy.ListFields(tt)]

When you create a new project, it automatically creates a geodatabase file to go along with that project. If you just want a standalone geodatabase though, you can use something like this in your python script:

import arcpy
import os

gdb = "./project/LEO_Tables.gdb"

if os.path.exists(gdb):
    pass
else:
    loc, db = os.path.split(gdb)
    arcpy.management.CreateFileGDB(loc,db)

So if the geodatabase does not exist, it creates it. If it does exist though, it will not worry about creating a new one.


One of the examples for automation is taking a basemap, updating some of the elements, and then exporting that map to an image or PDF. This sample code, using Dallas data, shows how to set up a project to do this. And here is the original map:

Because ArgGIS has so many different elements, the arcpy module tends to be quite difficult to navigate. Basically I try to seperate out data processing (which often takes inputs and outputs them into a geodatabase) vs visual things on a map. So to do this project, you have step 1 import data into a geodatabase, and 2 update the map elements. Here legend, title, copying symbology, etc.

You can go to the github project to download all of the data (including the aprx project file, as well as the geodatabase file). But here is the code to review.

import arcpy
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import os

# Set environment to a particular project
gdb = "DallasDB.gdb"
ct = "TempCrimes"
ol = "ExampleCrimes"
nc = "New Crimes"
arcpy.env.workspace = gdb
aprx = arcpy.mp.ArcGISProject("DallasExample.aprx")
dallas_map = aprx.listMaps('DallasMap')[0]
temp_layer = f"{gdb}/{ct}"

# Load in data, set as a spatial dataframe
df = pd.read_csv('DallasSample.csv') # for a real project, will prob query your RMS
df = df[['incidentnum','lon','lat']]
sdf = pd.DataFrame.spatial.from_xy(df,'lon','lat', sr=4326)

# Add the feature class to the map, note this does not like missing data
sdf.spatial.to_featureclass(location=temp_layer)
dallas_map.addDataFromPath(os.path.abspath(temp_layer)) # it wants the abs path for this

# Get the layers, copy symbology from old to new
new_layer = dallas_map.listLayers(ct)[0]
old_layer = dallas_map.listLayers(ol)[0]
old_layer.visible = False
new_layer.symbology = old_layer.symbology
new_layer.name = nc

# Add into the legend, moving to top
layout = aprx.listLayouts("DallasLayout")[0]
leg = layout.listElements("LEGEND_ELEMENT")[0]
item_di = {f.name:f for f in leg.items}
leg.moveItem(item_di['Dallas PD Divisions'], item_di[nc], move_position='BEFORE')

# Update title in layout "TitleText"
txt = layout.listElements("TEXT_ELEMENT")
txt_di = {f.name:f for f in txt}
txt_di['TitleText'].text = "New Title"
# If you need to make larger, can do
#txt_di['TitleText'].elementWidth = 2.0

# Export to high res PNG file
layout.exportToPNG("DallasUpdate.png",resolution=500)

# Cleaning up, to delete the file in geodatabase, need to remove from map
dallas_map.removeLayer(new_layer)
arcpy.management.Delete(ct)

And here is the updated map:

Some notes on ESRI server APIs

Just a few years ago, most cities open data sites were dominated by Socrata services. More recently though cities have turned to ArcGIS servers to disseminate not only GIS data, but also just plain tabular data. This post is to collate my notes on querying ESRI’s APIs for these services. They are quite fast, have very generous return limits, and have the ability to do filtering/aggregation.

So first lets start with Raleigh’s Open Data site, specifically the Police Incidents. So sometimes for data analysis you just want a point-in-time dataset, and can download 100% of the data (which you can do here, see the Download button in the below screenshot). But what I am going to show here is how to format queries to generate up to date information. This is useful in web-applications, like dashboards.

So first, go down to the Blue button in the below screen that says I want to use this:

Once you click that, you will see a screen that lists several different options, click to expand the View API Resources, and then click the link open in API explorer:

To save a few steps, here is the original link and the API link side by side, you can see you just need to change explore to api in the url:

https://data-ral.opendata.arcgis.com/datasets/ral::daily-raleigh-police-incidents/explore
https://data-ral.opendata.arcgis.com/datasets/ral::daily-raleigh-police-incidents/api

Now on this page, it has a form to be able to fill in a query, but first check out the Query URL string on the right:

I am going to go into how to modify that URL in a bit to return different slices of data. But first check out the link https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services

This simpler view I often find easier to see all the available data than the open data websites with the extra fluff. You can often tell the different data sources right from the name (and often cities have more things available than they show on their open data site). But lets go to the Police Incidents Feature Server page, the link is https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0:

This gives you some meta-data (such as the fields and projection). Scroll down to the bottom of the page, and click the Query button, it will then take you to https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0/query:

I find this tool to format queries easier than the Open Data site. Here I put in the Where field 1=1, set the Out Fields to *, the Result record count to 3. I then hit the Query (GET)

This gives an annoyingly long url. And here are the resulting images

So although this returns a very long url, most of the parameters in the url are empty. So you could have a more minimal url of https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0/query?where=1%3D1&outFields=*&resultRecordCount=3&f=json. (There I changed the format to json as well.)

In python, it is easier to work with the json or geojson output. So here I show how to query the data, and read it into a geopandas dataframe.

from io import StringIO
import geopandas as gpd
import requests

base = "https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0/query"
params = {"where": "1=1",
          "outFields": "*",
          "resultRecordCount": "3",
          "f": "geojson"}
res = requests.get(base,params)
gdf = gpd.read_file(StringIO(res.text)) # note I do not use res.json()

Now, the ESRI servers will not return a dataset that has 1,000,000 rows, it limits the outputs. I have a gnarly function I have built over the years to do the pagination, fall back to json if geojson is not available, etc. Left otherwise uncommented.

from datetime import datetime
import geopandas as gpd
import numpy as np
import pandas as pd
import requests
import time
from urllib.parse import quote

def query_esri(base='https://services.arcgis.com/v400IkDOw1ad7Yad/arcgis/rest/services/Police_Incidents/FeatureServer/0/query',
               params={'outFields':"*",'where':"1=1"},
               verbose=False,
               limitSize=None,
               gpd_query=False,
               sleep=1):
    if verbose:
        print(f'Starting Queries @ {datetime.now()}')
    req = requests
    p2 = params.copy()
    # try geojson first, if fails use normal json
    if 'f' in p2:
        p2_orig_f = p2['f']
    else:
        p2_orig_f = 'geojson'
    p2['f'] = 'geojson'
    fin_url = base + "?"
    amp = ""
    fi = 0
    for key,val in p2.items():
        fin_url += amp + key + "=" + quote(val)
        amp = "&"
    # First, getting the total count
    count_url = fin_url + "&returnCountOnly=true"
    if verbose:
        print(count_url)
    response_count = req.get(count_url)
    # If error, try using json instead of geojson
    if 'error' in response_count.json():
        if verbose:
            print('geojson query failed, going to json')
        p2['f'] = 'json'
        fin_url = fin_url.replace('geojson','json')
        count_url = fin_url + "&returnCountOnly=true"
        response_count2 = req.get(count_url)
        count_n = response_count2.json()['count']
    else:
        try:
            count_n = response_count.json()["properties"]["count"]
        except:
            count_n = response_count.json()['count']
    if verbose:
        print(f'Total count to query is {count_n}')
    # Getting initial query
    if p2_orig_f != 'geojson':
        fin_url = fin_url.replace('geojson',p2_orig_f)
    dat_li = []
    if limitSize:
        fin_url_limit = fin_url + f"&resultRecordCount={limitSize}"
    else:
        fin_url_limit = fin_url
    if gpd_query:
        full_response = gpd.read_file(fin_url_limit)
        dat = full_response
    else:
        full_response = req.get(fin_url_limit)
        dat = gpd.read_file(StringIO(full_response.text))
    # If too big, getting subsequent chunks
    chunk = dat.shape[0]
    if chunk == count_n:
        d2 = dat
    else:
        if verbose:
            print(f'The max chunk size is {chunk:,}, total rows are {count_n:,}')
            print(f'Need to do {np.ceil(count_n/chunk):,.0f} total queries')
        offset = chunk
        dat_li = [dat]
        remaining = count_n - chunk
        while remaining > 0:
            if verbose:
                print(f'Remaining {remaining}, Offset {offset}')
            offset_val = f"&cacheHint=true&resultOffset={offset}&resultRecordCount={chunk}"
            off_url = fin_url + offset_val
            if gpd_query:
                part_response = gpd.read_file(off_url)
                dat_li.append(part_response.copy())
            else:
                part_response = req.get(off_url)
                dat_li.append(gpd.read_file(StringIO(part_response.text)))
            offset += chunk
            remaining -= chunk
            time.sleep(sleep)
        d2 = pd.concat(dat_li,ignore_index=True)
    if verbose:
        print(f'Finished queries @ {datetime.now()}')
    # checking to make sure numbers are correct
    if d2.shape[0] != count_n:
        print('Warning! Total count {count_n} is different than queried count {d2.shape[0]}')
    # if geojson, just return
    if p2['f'] == 'geojson':
        return d2
    # if json, can drop geometry column
    elif p2['f'] == 'json':
        if 'geometry' in list(d2):
            return d2.drop(columns='geometry')
        else:
            return d2

And so, to get the entire dataset of crime data in Raleigh, it is then df = query_esri(verbose=True). It is pretty large, so I show here limiting the query.

params = {'where': "reported_date >= CAST('1/1/2025' AS DATE)", 
          'outFields': '*'}
df = query_esri(base=base,params=params,verbose=True)

Here this shows doing a datetime comparison, by casting the input to a date. Sometimes you have to do the opposite, cast one of the text fields to dates or extract out values from a date field represented as text.

Example Queries

So I showed about you can do a WHERE clause in the queries. You can do other stuff as well, such as get aggregate counts. For example, here is a query that shows how to get aggregate statistics.

If you click the link, it will go to the query form ESRI webpage. And that form shows how to enter in the output statistics fields.

And this produces counts of the total crimes in the database.

Here are a few additional examples I have saved in my notes:

Do not use the query_esri function above for aggregate counts, just form the params and pass them into requests directly. The query_esri function is meant to return large sets of individual rows, and so can overwrite the params in unexpected way.

Check out my Crime De-Coder LinkedIn page this week for other examples of using python + ESRI. This is more for public data, but those will be examples of using arcpy in different production scenarios. Later this week I will also post an updated blog here, for the LLMs to consume.

The story of my dissertation

My dissertation is freely available to read on my website (Wheeler, 2015). I still open up my hardcover I purchased every now and then. No one cites it, because no one reads dissertations, but it is easily the work I am the most proud of.

Most of the articles I write there is some motivating story behind the work you would never know about just from reading the words. I think this is important, as the story often is tied to some more fundamental problem, which solving specific problems is the main way we make progress in science. The stifling way that academics write peer reviewed papers currently doesn’t allow that extra narrative in.

For example, my first article (and what ended up being my masters thesis, Albany at that time you could go directly into PhD from undergrad and get your masters on the way), was an article about the journey to crime after people move (Wheeler, 2012). The story behind that paper was, while working at the Finn Institute, Syracuse PD was interested in targeted enforcement of chronic offenders, many of whom drive around without licenses. I thought, why not look at the journey to crime to see where they are likely driving. When I did that analysis, I noticed a few hundred chronic offenders had something like a 5 fold number of home addresses in the sample. (If you are still wanting to know where they drive, they drive everywhere, chronic offenders have very wide spatial footprints.)

Part of the motivation behind that paper was if people move all the time, how can their home matter? They don’t really have a home. This is a good segue into the motivation of the dissertation.

More of my academic reading at that point had been on macro and neighborhood influences on crime. (Forgive me, as I am likely to get some of the timing wrong in my memory, but this writing is as best as I remember it.) I had a class with Colin Loftin that I do not remember the name of, but discussed things like the southern culture of violence, Rob Sampson’s work on neighborhoods and crime, and likely other macro work I cannot remember. Sampson’s work in Chicago made the biggest impression on me. I have a scanned copy of Shaw & McKay’s Juvenile Delinquency (2nd edition). I also took a spatial statistics class with Glenn Deane in the sociology department, and the major focus of the course was on areal units.

When thinking about the dissertation topic, the only advice I remember receiving was about scope. Shawn Bushway at one point told me about a stapler thesis (three independent papers bundled into a single dissertation). I just wanted something big, something important. I intentionally sought out to try to answer some more fundamental question.

So I had the first inkling of “how can neighborhoods matter if people don’t consistently live in the same neighborhood”? The second was that my work at the Finn Institute working with police departments, hot spots were the only thing any police department cared about. It is not uncommon even now for an academic to fit a spatial model at the neighborhood level to crime and demographics, and have a throwaway paragraph in the discussion about how it would help police better allocate resources. It is comically absurd – you can just count up crimes at addresses or street segments and rank them and that will be a much more accurate and precise system (no demographics needed).

So I wanted to do work on micro level units of analysis. But I had on my dissertation Glenn and Colin – people very interested in macro and some neighborhood level processes. So I would need to justify looking at small units of analysis. Reading the literature, Weisburd and Sherman did not have to me clearly articulated reasons to be interested in micro places, beyond just utility for police. Sherman had the paper counting up crimes at addresses (Sherman et al., 1989), and none of Weisburd’s work had to me any clear causal reasoning to look at micro places to explain crime.

To be clear wanting to look at small units as the only guidepost in choosing a topic is a terrible place to start. You should start from a more specific, articulable problem you wish to solve. (If others pursuing Phds are reading.) But I did not have that level of clarity in my thinking at the time.

So I set out to articulate a reason why we would be interested to look at micro level areas that I thought would satisfy Glenn and Colin. I started out thinking about doing a simulation study, similar to what Stan Openshaw did (1984) that was motivated by Robinson’s (1950) ecological fallacy. While doing that I realized there was no point in doing the simulation, you could figure it out all in closed form (as have others before me). So I proved that random spatial aggregation would not result in the ecological fallacy, but aggregating nearby spatial areas would, assuming there is a spatial covariance between nearby areas. I thought at the time it was a novel proof – it was not (Footnote 1 on page 9 were all things I read after this). Even now the Wikipedia page on the ecological fallacy has an unsourced overview of the issue, that cross-spatial correlations make the micro and macro equations not equal.

This in and of itself is not interesting, but in the process did clearly articulate to me why you want to look at micro units. The example I like to give is as follows – imagine you have a bar you think causes crime. The bar can cause crime inside the bar, as well as the bar diffusing risk into the nearby area. Think people getting in fights in the bar, vs people being robbed walking away from a night of drinking. If you aggregate to large units of analysis, you cannot distinguish between “inside bar crime” vs “outside bar crime”. So that is a clear causal reasoning for when you want to look at particular units of analysis – the ability to estimate diffusion/displacement effects are highly dependent on the spatial unit of analysis. If you have an intervention that is “make the bar hire better security” (ala John Eck’s work) that should likely not have any impact outside the bar, only inside the bar. So local vs diffusion effects are not entirely academic, they can have specific real world implications.

This logic does not explicitly always value smaller spatial units of analysis though. Another example I liked to give is say you are evaluating a city wide gun buy back. You could look at more micro areas than the entire city, e.g. see if it decreased in neighborhood A and increased in neighborhood B, but it likely does not invalidate the macro city wide analysis. Which is just an aggregate estimate over the entire city – which in some cases is preferable.

Glenn Deane at some point told me that I am a reductionist, which was the first time I heard that word, but it did encapsulate my thinking. You could always go smaller, there is no atom to stop at. But often it just doesn’t matter – you could examine the differences in crime between the front stoop and the back porch, but there is not likely meaningful causal reasons to do so. This logic works for temporal aggregation and aggregating different crime types as well.

I would need to reread Great American city, but I did not take this to be necessarily contradictory to Sampson’s work on neighborhood processes. Rob came to SUNY Albany to give a talk at the sociology department (I don’t remember the year). Glenn invited me to whatever they were doing after the talk, and being a hillbilly I said I need to go back to work at DCJS, I am on my lunch break. (To be clear, no one at DCJS would have cared.) I am sure I would have not been able to articulate anything of importance to him, but I do wish I had taken that opportunity in retrospect.

So with the knowledge of how aggregation bias occurs in hand, I had formulated a few different empirical research projects. One was the idea behind bars and crime I have already given an example of. I had a few interesting findings, one of which is that diffusion effects are larger than the local effects. I also estimated the bias of bars selecting into high crime areas via a non-equivalent dependent variable design – the only time I have used a DAG in any of my work.

I gave a job talk at Florida State before the dissertation was finished. I had this idea in the hotel room the night before my talk. It was a terrible idea to add it to my talk, and I did not prepare what I was going to say sufficiently, so it came out like a jumbled mess. I am not sure whether I would want to remember or forget that series of events (which include me asking Ted Chiricos if you can fish in the Gulf of Mexico at dinner, I feel I am OK in one-on-one chats, group dinners I am more awkward than you can possibly imagine). It also included nice discussions though, Dan Mear’s asked me a question about emergent macro phenomenon that I did not have a good answer to at the time, but now I would say simple causal processes having emergent phenomenon is a reason to look at micro, not the macro. Eric Stewart asked me if there is any reason to look at neighborhood and I said no at the time, but I should have said my example gun buy back analogy.

The second empirical study I took from broken windows theory (Kelling & Wilson, 1982). So the majority of social science theories some spatial diffusion is to be expected. Broken windows theory though had a very clear spatial hypothesis – you need to see disorder for it to impact your behavior. So you do not expect spatial diffusion, beyond someones line of site, to occur. To measure disorder, I used 311 calls (I had this idea before I read Dan O’Brien’s work, see my prospectus, but Dan published his work on the topic shortly thereafter, O’Brien et al. 2015).

I confirmed this to be the case, conditional on controlling for neighborhood effects. I also discuss how if the underlying process is smooth, using discrete neighborhood boundaries can result in negative spatial autocorrelation, which I show some evidence of as well.

This suggests that using a smooth measure of neighborhoods, like Hipp’s idea of egohoods (Hipp et al., 2013), I think is probably more reasonable than discrete neighborhood boundaries (which are often quite arbitrary).

While I ended up publishing those two empirical applications (Wheeler, 2018; 2019), which was hard, I was too defeated to even worry about posting a more specific paper on the aggregation idea. (I think I submitted this paper to Criminology, but it was not well received.) I was partially burned out from the bars and crime paper, which went at least one R&R at Criminology and was still rejected. And then I went through four rejections for the 311 paper. I had at that point multiple other papers that took years to publish. It is a slog and degrading to be rejected so much.

But that is really my only substantive contribution to theoretical criminology in any guise. After the dissertation, I just focused on either policy work or engineering/method applications. Which are much easier to publish.

References

Reducing folium map sizes

Recently for a crimede-coder project I have been building out a custom library to make nice leaflet maps using the python folium library. See the example I have posted on my website. Below is a screenshot:

This map ended up having around 3000 elements in it, and was a total of 8mb. 8mb is not crazy to put on a website, but is at the stage where you can actually notice latency when first rendering the map.

Looking at the rendered html code though it was verbose in a few ways for every element. One is that lat/lon are in crazy precision by default, e.g. [-78.83229390597961, 35.94592660794455]. So a single polygon can have many of those. Six digits of precision for lat/lon is still under 1 meter of precision, which is plenty sufficient for my mapping applications. So you can reduce 8+ characters per lat/lon and not really make a difference to the map (you can technically have invalid polygons doing this, but this is really pedantic and should be fine).

A second part of the rendered folium html map for every object is given a full uuid, e.g. geo_json_a19eff2648beb3d74760dc0ddb58a73d.addTo(feature_group_2e2c6295a3a1c7d4c8d57d001c782482);. This again is not necessary. I end up reducing the 32 length uuids to the first 8 alphanumeric characters.

A final part is that the javascript is not minified – it has quite a bit of extra lines/spaces that are not needed. So here are my notes on using python code to take care of some of those pieces.

To clean up the precision for geometry objects, I do something like this.

import re

# geo is the geopandas dataframe
redg = geo.geometry.set_precision(10**-6).to_json()
# redg still has floats, below regex clips values
rs = r'(\d{2}\.|-\d{2}\.)(\d{6})(\d+)'
re.sub(rs,r'\1\2',redg)

As most of my functions add the geojson objects to the map one at a time (for custom actions/colors), this is sufficient to deal with that step (for markers, can round lat/lon directly). It may make more sense for the set precision to be 10**-5 and then clip the regex. (For these regex’s I am showing there is some risk they will replace something they should not, I think it will be pretty safe though.)

Then to clean up the UUID’s and extra whitespace, what I do is render the final HTML and then use regex’s:

# fol is the folium object
html = fol.get_root()
res = html.script.get_root().render()
# replace UUID with first 8
ru = r'([0-9a-f]{8})[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{12}'
res = re.sub(ru,r'\1',res)
# clean up whitespace
rl = []
for s in res.split('\n'):
    ss = s.strip()
    if len(ss) > 0:
        rl.append(ss)
rlc = '\n'.join(rl)

There is probably a smarter way to do this directly with the folium object for the UUID’s. For whitespace though it would need to be after the HTML is written. You want to be careful with the cleaning up the whitespace step – it is possible you wanted blank lines in say a leaflet popup or tooltip. But for my purposes this is not really necessary.

Doing these two steps in the Durham map reduces the size of the rendered HTML from 8mb to 4mb. So reduced the size of the file by around 4 million characters! The savings will be even higher for maps with more elements.

One last part is my map has redundant svg inserted for the map markers. I may be able to use css to insert the svg, e.g. something like in css .mysvg {background-image: url("vector.svg");}, and then in the python code for the marker svg insert <div class="mysvg"></div>. For dense point maps this will also save quite a few characters. Or you could add in javascript to insert the svg as well (although that would be a bit sluggish I think relative to the css approach, although sluggish after first render if the markers are turned off).

I have not done this yet, as I need to tinker with getting the background svg to look how I want, but could save another 200-300 characters per marker icon. So will save a megabyte in the map for every 3000-5000 markers I am guessing.

The main reason I post webdemo’s on the crimede-coder site is that there a quite a few grifters in the tech space. Not just for data analysis, but for front-end development as well. I post stuff like that so you can go and actually see the work I do and its quality. There are quite a few people now claiming to be “data viz experts” who just embed mediocre Tableau or PowerBI apps. These apps in particular tend to produce very bad maps, so here you can see what I think a good map should look like.

If you want to check out all the interactions in the map, I posted a YouTube video walking through them

Durham hotspot map walkthrough of interactions

Harmweighted hotspots, using ESRI python API, and Crime De-Coder Updates

Haven’t gotten the time to publish a blog post in a few. There has been a ton of stuff I have put out on my Crime De-Coder website recently. For some samples since I last mentioned here, have published four blog posts:

  • on what AI regulation in policing would look like
  • high level advice on creating dashboards
  • overview of early warning systems for police
  • types of surveys for police departments

For surveys a few different groups have reached out to me in regards to the NIJ measuring attitudes solicitation (which is essentially a follow up of the competition Gio and myself won). So get in touch if interested (whether a PD or a research group), may try to coordinate everyone to have one submission instead of several competing ones.

To keep up with everything, my suggestion is to sign up for the RSS feed on the site. If you want an email use the if this than that service. (I may have to stop doing my AltAc newsletter emails, it is so painful to send 200 emails and I really don’t care to sign up for another paid for service to do that.)

I also have continued the AltAc newsletter. Getting started with LLMs, using secrets, advice on HTML, all sorts of little pieces of advice every other week.

I have created a new page for presentations. Including, my recent presentation at the Carolina Crime Analysis Association Conference. (Pic courtesy of Joel Caplan who was repping his Simsi product – thank you Joel!)

If other regional IACA groups are interested in a speaker always feel free to reach out.

And finally a new demo on creating a static report using quarto/python. It is a word template I created (I like often generating word documents that are easier to post-hoc edit, it is ok to automate 90% and still need a few more tweaks.)

Harmweighted Hotspots

If you like this blog, also check out Iain Agar’s posts, GIS/SQL/crime analysis – the good stuff. Here I wanted to make a quick note about his post on weighting Crime Harm spots.

So the idea is that when mapping harm spots, you could have two different areas with same high harm, but say one location had 1 murder and one had 100 thefts. So if murder harm weight = 100 and theft harm weight = 1, they would be equal in weight. Iain talks about different transformations of harm, but another way to think about it is in terms of variance. So here assuming Poisson variance (although in practice that is not necessary, you could estimate the variance given enough historical time series data), you would have for your two hotspots:

Hotspot1: mean 1 homicide, variance 1
Hotspot2: mean 100 thefts, variance 100

Weight of 100 for homicides, 1 for theft

Hotspot1: Harmweight = 1*100 = 100
          Variance = 100^2*1 = 10,000
          SD = sqrt(10,000) = 100

Hotspot2: Harmweight = 100*1 = 100
          Variance = 1^2*100 = 100
          SD = sqrt(100) = 10

When you multiply by a constant, which is what you are doing when multiplying by harm weights, the relationship with variance is Var(const*x) = const^2*Var(x). The harm weights add variance, so you may simple add a penalty term, or rank by something like Harmweight - 2*SD (so the lower end of the harm CI). So in this example, the low end of the CI for Hotspot 1 is 0, but the low end of the CI for Hotspot2 is 80. So you would rank Hotspot2 higher, even though they are the same point estimate of harm.

The rank by low CI is a trick I learned from Evan Miller’s blog.

You could fancy this up more with estimating actual models, having multiple harm counts, etc. But this is a quick way to do it in a spreadsheet with just simple counts (assuming Poisson variance). Which I think is often quite reasonable in practice.

Using ESRI Python API

So I knew you could use python in ESRI, they have a notebook interface now. What I did not realize is now with Pro you can simply do pip install arcgis, and then just interact with your org. So for a quick example:

from arcgis.gis import GIS

# Your ESRI url
gis = GIS("https://modelpd.maps.arcgis.com/", username="user_email", password="???yourpassword???")
# For batch geocoding, probably need to do GIS(api_key=<your api key>)

This can be in whatever environment you want, so you don’t even need ArcGIS installed on the system to use this. It is all web-api’s with Pro. To geocode for example, you would then do:

from arcgis.geocoding import geocode, Geocoder, get_geocoders, batch_geocode

# Can search to see if any nice soul has published a geocoding server

arcgis_online = GIS()
items = arcgis_online.content.search('geocoder north carolina', 'geocoding service', max_items=30)

# And we have four
#[<Item title:"North Carolina Address Locator" type:Geocoding Layer owner:ecw31_dukeuniv>,
# <Item title:"Southeast North Carolina Geocoding Service" type:Geocoding Layer owner:RaleighGIS>, 
# <Item title:"Geocoding Service - AddressNC " type:Geocoding Layer owner:nconemap>, 
# <Item title:"ArcGIS World Geocoding Service - NC Extent" type:Geocoding Layer owner:NCDOT.GOV>]

geoNC = Geocoder.fromitem(items[0]) # lets try Duke
#geoNC = Geocoder.fromitem(items[-1]) # NCDOT.GOV
# can also do directly from URL
# via items[0].url
# url = 'https://utility.arcgis.com/usrsvcs/servers/8caecdf6384144cbafc9d56944af1ccf/rest/services/World/GeocodeServer'
# geoNC = Geocoder(url,gis)

# DPAC
res = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1)
print(res[0])

Note you cannot cache the geocoding results. To do that, you need to use credits and probably sign in via a token and not a username password.

# To cache, need a token
r2 = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1,for_storage=True)

# If you have multiple addresses, use batch_geocode, again need a token
#dc_res = batch_geocode(FullAddressList, geocoder=geoNC) 

Geocoding to this day is still such a pain. I will need to figure out if you can make a local geocoding engine with ESRI and then call that through Pro (I mean I know you can, but not sure pricing for all that).

Overall being able to work directly in python makes my life so much easier, will need to dig more into making some standard dashboards and ETL processes using ESRI’s tools.

I have another post that has been half finished about using the ESRI web APIs, hopefully will have time to put that together before another 6 months passes me by!

Won NIJ competition on surveys

The submission Gio and myself put together, Using Every Door Direct Mail Web Push Surveys and Multi-level modelling with Post Stratification to estimate Perceptions of Police at Small Geographies, has won the NIJ Innovations in Measuring Community Attitudes Survey challenge.

Specifically we took 1st in the non-probability section of the competition. The paper has the details, but using every door direct mail + post-stratifying the estimates is the approach we advocate. If you are a city or research group interested in implementing this and need help, feel free to get in touch.

Of course if you want to do this yourself go for it (part of the reason it won was because the method should be doable for many agencies in house), but letting me and Gio know we were the inspiration is appreciated!

Second, for recruiting for criminology PhDs, CRIME De-Coder has teamed up with the open access CrimRXiv consortium

This example shows professor adverts, but I think the best value add for this is for more advanced local govt positions. Anymore many of those civil service gigs are very competitive with lagging professor salaries.

For people hiring advanced roles, there are two opportunities. One is advertising – so for about the same amount as advertising on LinkedIn, you can publish a job advert. This is much more targeted than LinkedIn, so if you want PhD talent this is a good deal to get your job posting on the right eyeballs.

The second service is recruiting for a position. This is commission based – if I place a candidate for the role then you pay the recruiter (me and CrimRXiv) a commission. For that I personally reach out to my network of people with PhDs interested in positions, and do the first round of vetting for your role.

Third, over on Crime De-Coder I have another round of the newsletter up, advice this round is that many smaller cities have good up and coming tech markets, plus advice about making fonts larger in python/R plots. (Note in response to that post, Greg Ridgeway says it is better to save as vector graphics as oppossed to high res PNG. Vector is slightly more work to check everything is kosher in the final produced plot, but that is good advice from Greg. I am lazy with the PNG advice.)

No more newsletters this year, but let me know if you want to sign up and I will add you to the list.

Last little tidbit, in the past I have used the pdftk tool to combine multiple PDFs together. This is useful when using other tools to create documents, so you have multiple outputs in the end (like a cover page or tech appendix), and want to combine those all together into a single PDF to share. But one thing I noticed recently, if your PDF has internal table of content (TOC) links (as is the case for LaTeX, or in my case a document built using Quarto), using pdftk will make the TOC links go away. You can however use ghostscript instead, and the links still work as normal.

On my windows machine, it looks something like:

gswin64 -q -sDEVICE=pdfwrite -o MergedDoc.pdf CoverPage.pdf Main.pdf Appendix.pdf

So a few differences that if you just google. Installing the 64 bit version on my windows machine, the executable is gswin64, not gs from the command line. Second, I needed to manually add C:\Program Files\gs\gs10.02.1\bin to my PATH for this to work at the command prompt the way you would expect, installing did not do that directly.

Quarto is awesome by the way – definitely suggest people go check that out.

Youtube interview with Manny San Pedro on Crime Analysis and Data Science

I recently did an interview with Manny San Pedro on his YouTube channel, All About Analysis. We discuss various data science projects I conducted while either working as an analyst, or in a researcher/collaborator capacity with different police departments:

Here is an annotated breakdown of the discussion, as well as links to various resources I discuss in the interview. This is not a replacement for listening to the video, but is an easier set of notes to link to more material on what particular item I am discussing.

0:00 – 1:40, Intro

For rundown of my career, went to do PhD in Albany (08-15). During that time period I worked as a crime analyst at Troy, NY, as well as a research analyst for my advisor (Rob Worden) at the Finn Institute. My research focused on quant projects with police departments (predictive modeling and operations research). In 2019 went to the private sector, and now work as an end-to-end data scientist in the healthcare sector working with insurance claims.

You can check out my academic and my data science CV on my about page.

I discuss the workshop I did at the IACA conference in 2017 on temporal analysis in Excel.

Long story short, don’t use percent change, use other metrics and line graphs.

7:30 – 13:10, Patrol Beat Optimization

I have the paper and code available to replicate my work with Carrollton PD on patrol beat optimization with workload equality constraints.

For analysts looking to teach themselves linear programming, I suggest Hillier’s book. I also give examples on linear programming on this blog.

It is different than statistical analysis, but I believe has as much applicability to crime analysis as your more typical statistical analysis.

13:10 – 14:15, Million Dollar Hotspots

There are hotspots of crime that are so concentrated, the expected labor cost reduction in having officers assigned full time likely offsets the position. E.g. if you spend a million dollars in labor addressing crime at that location, and having a full time officer reduces crime by 20%, the return on investment for hotspots breaks even with paying the officers salary.

I call these Million dollar hotspots.

14:15 – 28:25, Prioritizing individuals in a group violence intervention

Here I discuss my work on social network algorithms to prioritize individuals to spread the message in a focussed deterrence intervention. This is opposite how many people view “spreading” in a network, I identify something good I want to spread, and seed the network in a way to optimize that spread:

I also have a primer on SNA, which discusses how crime analysts typically define nodes and edges using administrative data.

Listen to the interview as I discuss more general advice – in SNA it matters what you want to accomplish in the end as to how you would define the network. So I discuss how you may want to define edges via victimization to prevent retaliatory violence (I think that would make sense for violence interupptors to be proactive for example).

I also give an example of how detective case allocation may make sense to base on SNA – detectives have background with an individuals network (e.g. have a rapport with a family based on prior cases worked).

28:25 – 33:15, Be proactive as an analyst and learn to code

Here Manny asked the question of how do analysts prevent their role being turned into more administrative role (just get requests and run simple reports). I think the solution to this (not just in crime analysis, but also being an analyst in the private sector) is to be proactive. You shouldn’t wait for someone to ask you for specific information, you need to be defining your own role and conducting analysis on your own.

He also asked about crime analysis being under-used in policing. I think being stronger at computer coding opens up so many opportunities that learning python, R, SQL, is the area I would like to see stronger skills across the industry. And this is a good career investment as it translates to private sector roles.

33:15 – 37:00, How ChatGPT can be used by crime analysts

I discuss how ChatGPT may be used by crime analysis to summarize qualitative incident data and help inform . (Check out this example by Andreas Varotsis for an example.)

To be clear, I think this is possible, but the tech I don’t think is quite up to that standard yet. Also do not submit LEO sensitive data to OpenAI!

Also always feel free to reach out if you want to nerd out on similar crime analysis questions!