All posts in category Python

I translated my book for $7 using openai

The other day an officer from the French Gendarmerie commented that they use my python for crime analysis book. I asked that individual, and he stated they all speak English. But given my book is written in plain text markdown and compiled using Quarto, it is not that difficult to pipe the text through a tool to translate it to other languages. (Knowing that epubs under the hood are just html, it would not suprise me if there is some epub reader that can use google translate.)

So you can see now I have available in the Crime De-Coder store four new books:

French version (paperback and epub)
Spanish version (paperback and epub)

ebook versions are normally $39.99, and print is $49.99 (both available worldwide). For the next few weeks, can use promo code translate25 (until 11/15/2025) to purchase epub versions for $19.99.

If you want to see a preview of the books first two chapters, here are the PDFs:

And here I added a page on my crimede-coder site with testimonials.

As the title says, this in the end cost (less than) $7 to convert to French (and ditto to convert to Spanish).

Here is code demo’ing the conversion. It uses OpenAI’s GPT-5 model, but likely smaller and cheaper models would work just fine if you did not want to fork out $7. It ended up being a quite simple afternoon project (parsing the markdown ended up being the bigger pain).

So the markdown for the book in plain text looks like this:

It ends up that because markdown uses line breaks to denote different sections, that ends up being a fairly natural break to do the translation. These GenAI tools cannot repeat back very long sequences, but a paragraph is a good length. Long enough to have additional context, but short enough for the machine to not go off the rails when trying to just return the text you input. Then I just have extra logic to not parse code sections (that start/end with three backticks). I don’t even bother to parse out the other sections (like LaTeX or HTML), and I just include in the prompt to not modify those.

So I just read in the quarto document, split by “”, then feed in the text sections into OpenAI. I did not test this very much, just use the current default gpt-5 model with medium reasoning. (It is quite possible a non-reasoning smaller model will do just as well. I suspect the open models will do fine.)

You will ultimately still want someone to spot check the results, and then do some light edits. For example, here is the French version when I am talking about running code in the REPL, first in English:

Running in the REPL

Now, we are going to run an interactive python session, sometimes people call this the REPL, read-eval-print-loop. Simply type python in the command prompt and hit enter. You will then be greeted with this screen, and you will be inside of a python session.

And then in French:

Exécution dans le REPL

Maintenant, nous allons lancer une session Python interactive, que certains appellent le REPL, boucle lire-évaluer-afficher. Tapez simplement python dans l’invite de commande et appuyez sur Entrée. Vous verrez alors cet écran et vous serez dans une session Python.

So the acronym is carried forward, but the description of the acronym is not. (And I went and edited that for the versions on my website.) But look at this section in the intro talking about GIS:

There are situations when paid for tools are appropriate as well. Statistical programs like SPSS and SAS do not store their entire dataset in memory, so can be very convenient for some large data tasks. ESRI’s GIS (Geographic Information System) tools can be more convenient for specific mapping tasks (such as calculating network distances or geocoding) than many of the open source solutions. (And ESRI’s tools you can automate by using python code as well, so it is not mutually exclusive.) But that being said, I can leverage python for nearly 100% of my day to day tasks. This is especially important for public sector crime analysts, as you may not have a budget to purchase closed source programs. Python is 100% free and open source.

And here in French:

Il existe également des situations où les outils payants sont appropriés. Les logiciels statistiques comme SPSS et SAS ne stockent pas l’intégralité de leur jeu de données en mémoire, ils peuvent donc être très pratiques pour certaines tâches impliquant de grands volumes de données. Les outils SIG d’ESRI (Système d’information géographique) peuvent être plus pratiques que de nombreuses solutions open source pour des tâches cartographiques spécifiques (comme le calcul des distances sur un réseau ou le géocodage). (Et les outils d’ESRI peuvent également être automatisés à l’aide de code Python, ce qui n’est pas mutuellement exclusif.) Cela dit, je peux m’appuyer sur Python pour près de 100 % de mes tâches quotidiennes. C’est particulièrement important pour les analystes de la criminalité du secteur public, car vous n’avez peut‑être pas de budget pour acheter des logiciels propriétaires. Python est 100 % gratuit et open source.

So it translated GIS to SIG in French (Système d’information géographique). Which seems quite reasonable to me.

I paid an individual to review the Spanish translation (if any readers are interested to give me a quote for the French version copy-edits, would appreciate it). She stated it is overall very readable, but just has many minor things. Here is a a sample of suggestions:

Total number of edits she suggested were 77 (out of 310 pages).

If you are interested in another language just let me know. I am not sure about translation for the Asian languages, but I imagine it works OK out of the box for most languages that are derivative of Latin. Another benefit of self-publishing, I can just have the French version available now, but if I am able to find someone to help with the copy-edits I will just update the draft after I get their feedback.

Leave a comment

by Andy Wheeler on October 25, 2025 • Permalink

Posted in Crime Analysis, Crime Mapping, data science, Python, writing

Tagged book, epub, generative-ai, translation

Posted by Andy Wheeler on October 25, 2025

https://andrewpwheeler.com/2025/10/25/i-translated-my-book-for-7-using-openai/

I scraped the Crime solutions site

Before I get to the main gist, I am going to talk about another site. The National Institute of Justice (NIJ) paid RTI over $10 million dollars to develop a forensic technology center of excellence over the past 5 years. While this effort involved more than just a website, the only thing that lives in perpetuity for others to learn from the center of excellence are the resources they provide on the website.

Once funding was pulled, this is what RTI did with those resources:

The website is not even up anymore (it is probably a good domain to snatch up if no one owns it anymore), but you can see what it looked like on the internet archive. It likely had over 1000+ videos and pages of material.

I have many friends at RTI. It is hard for me to articulate how distasteful I find this. I understand RTI is upset with the federal government cuts, but to just simply leave the website up is a minimal cost (and likely worth it to RTI just for the SEO links to other RTI work).

Imagine you paid someone $1 million dollars for something. They build it, and then later say “for $1 million more, I can do more”. You say ok, then after you have dispersed $500,000 you say “I am not going to spend more”. In response, the creator destroys all the material. This is what RTI did, except it was they had been paid $11 million and they were still to be paid another $1 million. Going forward, if anyone from NIJ is listening, government contracts to build external resources should be licensed in a way that prevents that from happening.

And this brings me to the current topic, CrimeSolutions.gov. It is a bit of a different scenario, as NIJ controls this website. But recently they cut funding to the program, which was administered by DSG.

Crime Solutions is a website where they have collected independent ratings of research on criminal justice topics. To date they have something like 800 ratings on the website. I have participated in quite a few, and I think these are high quality.

To prevent someone (for whatever reason) simply turning off the lights, I scraped the site and posted the results to github. It is a PHP site under the hood, but changing everything to run as a static HTML site did not work out too badly.

So for now, you can view the material at the original website. But if that goes down, you have a close to same functional site mirrored at https://apwheele.github.io/crime-solutions/index.html. So at least those 800 some reviews will not be lost.

What is the long term solution? I could be a butthead and tomorrow take down my github page (so clone it locally), so me scraping the site is not really a solution as much as a stopgap.

Ultimately we want a long term, public, storage solution that is not controlled by a single actor. The best solution we have now is ArDrive via the folks from Arweave. For a one time upfront purchase, Arweave guarantees the data will last a minimum of 200 years (they fund an endowment to continually pay for upkeep and storage costs). If you want to learn more, stay tuned, as me and Scott Jacques are working on migrating much of the CrimRXiv and CrimConsortium work to this more permanent solution.

Leave a comment

by Andy Wheeler on September 26, 2025 • Permalink

Posted in Criminal Justice, data science, Python

Tagged scraping

Posted by Andy Wheeler on September 26, 2025

https://andrewpwheeler.com/2025/09/26/i-scraped-the-crime-solutions-site/

Reloading classes in python and shared borders

For some housekeeping, if you are not signed up, also make sure to sign up for the RSS feed of my crime de-coder blog. I have not been cross posting here consistently. For the last few posts:

Using the SPPT to Monitor Changes in Crime (soft launching my crimepy python package)
Cost-benefit analysis of Gun Shot Detection Tech (this is a repost from the Criminal Justician series on the American Society of Evidence Based Policing)
Using LLMs to Extract Data from Text
Finding outliers in proportions

For ASEBP, conference submissions for 2026 are open. (I will actually be going to this in 2026, submitted a 15 minute talk on planning experiments.)

Today will just be a quick post on two pieces of code I thought might be useful to share. The first is useful for humans, when testing code in functions, you can use the importlib library to reload functions. That is, imagine you have code:

import crimepy as cpy

test = cpy.func1(....)

And then when you run this, you see that func1 has an error. You can edit the source code, and then run:

from importlib import reload

reload(cpy)
test = cpy.func1(....)

This is my preferred approach to testing code. Note you need to import a library and reload the library. Not from crimepy import *, this will not work unfortunately.

Recently was testing out code that took quite a while to run, my Patrol Districting. This is a class, and I was editing my methods to make maps. The code itself takes around 5 minutes to run it through when remaking the entire class. What I found was a simpler approach, I can dump out the file to pickle, then reload the library, then load the pickle object. The may the pickle module works, it pulls the method definition from the global environment (pickle just saves the dict items under the hood). So the code looked like this:

from crimepy import pmed
...
pmed12 = pmed.pmed(...)
pmed12.map_plot() # causes error

And then instead of using the reload method as is (which would require me to create an entirely new object), use this approach for de-bugging:

# save file
import pickle
with open('pmed12.pkl', 'wb') as file:
    # Dump data with highest protocol for best performance
    pickle.dump(pmed12, file)

from importlib import reload
# edit method
reload(pmed)

# reload the object
with open('pmed12.pkl', 'rb') as file:
    # Load the pickled data
    pmed12_new = pickle.load(file)

# retest the method
pmed12_new.map_plot()

Writing code itself is often not the bottleneck – testing is. So figuring out ways to iterate testing faster is often worth the effort (I might have saved a day or two of work if I did this approach sooner when debugging that code).

The second code snippet is useful for the machines; I have been having Claude help me write quite a bit of the crimepy work. Here was one though it was having trouble with – calculating the shared border length between two polygons. Basically it went down an overly complicated path to get the exact calculation, whereas here I have an approximation using tiny buffers that works just fine and is much simpler.

def intersection_length(poly1,poly2,smb=1e-15):
    '''
    Length of the intersection between two shapely polygons
    
    poly1 - shapely polygon
    poly2 - shapely polygon
    smb - float, defaul 1e-15, small distance to buffer
    
    The way this works, I compute a very small buffer for
    whatever polygon is simpler (based on length)
    then take the intersection and divide by 2
    so not exact, but close enough for this work
    '''
    # buffer the less complicated edge of the two
    if poly1.length > poly2.length:
        p2, p1 = poly1, poly2
    else:
        p1, p2 = poly1, poly2
    # This basically returns a very skinny polygon
    pb = p1.buffer(smb,cap_style='flat').intersection(p2)
    if pb.is_empty:
        return 0.0
    elif hasattr(pb, 'length')
        return (pb.length-2*smb)/2
    else:
        return 0.0

And then for some tests:

from shapely.geometry import Polygon

poly1 = Polygon([(0, 0), (4, 0), (4, 3), (0, 3)])  # Rectangle
poly2 = Polygon([(2, 1), (6, 1), (6, 4), (2, 4)])  # Overlapping rectangle

intersection_length(poly1,poly2) # should be close to 0

poly3 = Polygon([(0, 0), (2, 0), (2, 2), (0, 2)])
poly4 = Polygon([(2, 0), (4, 0), (4, 2), (2, 2)])

intersection_length(poly3,poly4) # should be close to 2

poly5 = Polygon([(0, 0), (2, 0), (2, 2), (0, 2)])
poly6 = Polygon([(2, 0), (4, 0), (4, 3), (1, 3), (1, 2), (2, 2)])

intersection_length(poly5,poly6) # should be close to 3

poly7 = Polygon([(0, 0), (2, 0), (2, 2), (0, 2)])
poly8 = Polygon([(3, 0), (5, 0), (5, 2), (3, 2)])

intersection_length(poly7,poly8) # should be 0

Real GIS data often has imperfections (polygons that do not perfectly line up). So using the buffer method (and having an option to increase the buffer size) can often help smooth out those issues. It will not be exact, but the inexactness we are talking about will often be to well past the 10th decimal place.

Leave a comment

by Andy Wheeler on August 26, 2025 • Permalink

Posted in Crime Analysis, data science, Mapping, Python

Posted by Andy Wheeler on August 26, 2025

https://andrewpwheeler.com/2025/08/26/reloading-classes-in-python-and-shared-borders/

Build Stuff

I have had this thought in my head for a while – criminology research to me is almost all boring. Most of the recent advancement in academia is focused on making science more rigorous – more open methods, more experiments, stronger quasi-experimental designs. These are all good things, but to me still do not fundamentally change the practical implementation of our work.

Criminology research is myopically focused on learning something – I think this should be flipped, and the emphasis be on doing something. We should be building things to improve the crime and justice system.

How criminology research typically goes

Here is a screenshot of the recent articles published in the Journal of Quantitative Criminology. I think this is a pretty good cross-section of high-quality, well-respected research in criminology.

Three of the four articles are clearly ex-ante evaluations of different (pretty normal) policies/behavior by police and their subsequent downstream effects on crime and safety. They are all good papers, and knowing how effective a particular policy works (like stop and frisk, or firearm seizures) are good! But they are the literal example where the term ivory tower comes from – these are things happening in the world, and academics passively observe and say how well they are working. None of the academics in those papers were directly involved in any boots on the ground application – they were things normal operations the police agencies in question were doing on their own.

Imagine someone said “I want to improve the criminal justice system”, and then “to accomplish this, I am going to passively observe what other people do, and tell them if it is effective or not”. This is almost 100% of what academics in criminology do.

The article on illicit supply chains is another one that bothers me – it is sneaky in the respect that many academics would say “ooh that is interesting and should be helpful” given its novelty. I challenge anyone to give a concrete example of how the findings in the article can be directly useful in any law enforcement context. Not hypothetical, “can be useful in targeting someone for investigation”, like literal “this specific group can do specific X to accomplish specific Y”. We have plenty of real problems with illicit supply chains – drug smuggling in and out of the US (recommend the contraband show on Amazon, who knew many manufactures smuggle weed from US out to the UK!). Fentanyl or methamphetamine production from base materials. Retail theft groups and selling online. Plenty of real problems.

Criminology articles tend to be littered with absurdly vague accusations that they can help operations. They almost always cannot.

So we have articles that are passive evaluations of policies other people thought up. I agree this is good, but who exactly comes up with the new stuff to try out? We just have to wait around and hope other people have good ideas and take the time to try them out. And then we have theoretical articles larping as useful in practice (since other academics are the ones reviewing the papers, and no one says “erm, that is nice but makes no sense for practical day to day usage”).

Some may say this is the way science is supposed to work. My response to that is I don’t know dude, go and look at what folks are doing in the engineering or computer science or biology department. They seem to manage both theoretical and practical advancements at the same time just fine and dandy.

Well what have you built Andy?

It is a fair critique if you say “most of your work is boring Andy”. Most of my work is the same “see how a policy works from the ivory tower”, but a few are more “build stuff”. Examples of those include:

making patrol areas (also see fairness constraints in patrol)
prioritization for call-ins (an actual algorithm to identify whom to call in in a focused deterrence intervention, not a hypothetical scenario)
simple rules for flagging outlying cases when monitoring crime series

In the above examples, the one that I know has gotten the most traction are simple rules to identify crime spikes. I know because I have spent time demonstrating that work to various crime analysts across the country, and so many have told me “I use your Poisson Z-score Andy”. (A few have used the patrol area work as well, so I should be in the negative for carbon generation.)

Papers are not what matter though – papers are a distraction. The applications are what matter. The biggest waste currently in academic criminology work is peer reviewed papers. Our priorities as academics are totally backwards. We are evaluated on whether we get a paper published, we should be evaluated on whether we make the world a better place. Papers by themselves do not make the world a better place.

Instead of writing about things other people are doing and whether they work, we should spend more of our time trying to create things that improve the criminal justice system.

Some traditional academics may not agree with this – science is about formulating and testing hypotheses. This need not be in conflict with doing stuff. Have a theory about human nature, what better way to prove the theory than building something to attempt to change things for the better according to your theory. If it works in real life to accomplish things people care about guess what – other people will want to do it. You may even be able to sell it.

Examples of innovations I am excited about

Part of what prompted this was I was talking to a friend, and basically none of the things we were excited about have come from academic criminologists. I think a good exemplar of what I mean here is Anthony Tassone, the head of Truleo. To be clear, this is not a dig but a compliment, following some of Anthony’s posts on social media (LinkedIn, X), he is not a Rhodes Scholar. He is just some dude, building stuff for criminal justice agencies mostly using the recent advancements in LLMs.

For a few other examples of products I am excited about how they can improve criminal justice (I have no affiliations with these beyond I talk to people). Polis for evaluating body worn camera feeds. Dan Tatenko for CaseX is building an automated online crime reporting system that is much simpler to use. The folks at Carbyne (for 911 calls) are also doing some cool stuff. Matt White at Multitude Insights is building a SaaS app to better distribute BOLOs.

The folks at Polis (Brian Lande and Jon Wender) are the only two people in this list that have anything remotely to do with academic criminology. They each have PhDs (Brian in sociology and Jon in criminology). Although they were not tenure track professors, they are former/current police officers with PhDs. Dan at CaseX was a detective not that long ago. The folks at Carbyne I believe are have tech backgrounds. Matt has a military background, but pursued his start up after doing an MBA.

The reason I bring up Anthony Tassone is because when we as criminologists say we are going to passively evaluate what other people are doing, we are saying “we will just let tech people like Anthony make decisions on what real practitioners of criminal justice pursue”. Again not a dig on Anthony – it is a good thing for people to build cool stuff and see if there is a market. My point is that if Anthony can do it, why not academic criminologists?

Rick Smith at Axon is another example. While Axon really got its dominate market due to conducted energy devices and then body worn cameras (so hardware), quite a bit of the current innovation at Axon is software. And Rick did not have a background in hardware engineering either, he just had an idea and built it.

Transferring over into professional software engineering since 2020, let me tell my fellow academics, you too can write software. It is more about having a good idea that actually impacts practice.

Where to next?

Since the day gig (working on fraud-waste-abuse in Medicaid claims) pays the bills, most of my build stuff is now focused on that. The technical skills to learn software engineering are currently not effectively taught in Criminal Justice PhD programs, but they could be. Writing a dissertation is way harder than learning to code.

While my python book has a major focus on data analysis, it is really the same skills to jump to more general software engineering. (I specifically wrote the book to cover more software engineering topics, like writing functions and managing environments, as most of the other python data science books lack that material.)

Skills gap is only part of the issue though. The second is supporting work that pursues building stuff. It is really just norms in the current academe that stop this from occurring now. People value papers, NIJ (at least used to) mostly fund very boring incremental work.

I discussed start ups (people dreaming and building their own stuff) and other larger established orgs (like Axon). Academics are in a prime position to pursue their own start ups, and most Universities have some support for this (see Joel Caplan and Simsi for an example of that path). Especially for software applications, there are few barriers. It is more about time and effort spent pursuing that.

I think the more interesting path is to get more academic criminologists working directly with software companies. I will drop a specific example since I am pretty sure he will not be offended, everyone would be better off if Ian Adams worked directly for one of these companies (the companies, Ian’s take home pay, long term advancement in policing operations). Ian writes good papers – it would be better if Ian worked directly with the companies to make their tools better from the get go.

My friend I was discussing this with gave the example of Bell Labs. Software orgs could easily have professors take part time gigs with them directly, or just go work with them on sabbaticals. Axon should support something like that now.

While this post has been focused on software development, I think it could look similar for collaborating with criminal justice agencies directly. The economics will need to be slightly different (they do not have quite as much expendable capital to support academics, the ROI for private sector I think should be easily positive in the long run). But that I think that would probably be much more effective than the current grant based approach. (Just pay a professor directly to do stuff, instead of asking NIJ to indirectly support evaluation of something the police department has decided to already put into operation.)

Scientific revolutions are not happening in journal articles. They are happening by people building stuff and accomplishing things in the real world with those innovations.

For a few responses to this post, Alex sent me this (saying my characterization of Larry as passively observing is not quite accurate), which is totally reasonable:

Nice post on building/ doing things and thanks for highlighting the paper with Larry. One error however, Larry was directly involved in the doing. He was the chief science officer for the London Met police and has designed their new stop and frisk policy (and targeting areas) based directly on our work. Our work was also highlighted by the Times London as effective crime policy and also by the Chief of the London Met Police as well who said it was one of the best policy relevant papers he’s ever seen. All police are now being by trained on the new legislation on stop and search in procedurally just ways. You may not have known this background but it’s directly relevant to your post.

Larry Sherman (and David Weisburd), and their work on hot spots + direct experiments with police are really exemplars of “doing” vs “learning”. (David Kennedy and his work on focused deterrence is another good example.) In the mid 90s when Larry or David did experiments, they likely were directly involved in a way that I am suggesting – the departments are going and asking Larry “what should we do”.

My personal experience, trying to apply many of the lessons of David’s and Larry’s work (which was started around 30 years ago at this point), is not quite like that. It is more of police departments have already committed to doing something (like hotspots), and want help implementing the project, and maybe some grant helps fund the research. Which is hard and important work, but honestly just looks like effective project management (and departments should just invest in researchers/project managers directly, the external funding model does not make sense long term). For a more on point example of what I mean by doing, see what Rob Guerette did as an embedded criminologist with Miami PD.

Part of the reason I wrote the post, if you think about the progression of policing, we have phases – August Vollmer for professionalization in the early 1900’s. I think you could say folks like Larry and David (and Bill Bratton) brought about a new age of metrics to PDs in the 90s.

There are also technology changes that fundamentally impact PDs. Cars + 911 is one. The most recent one is a new type of oversight via body worn cameras. Folks who are leading this wave of professionalization changes are tech folks (like Rick Smith and Anthony Tassone). I think it is a mistake to just sit on the sidelines and see what these folks come up with – I want academic criminologists to be directly involved in the nitty gritty of the implementations of these systems and making them better.

A second response to this is that building stuff is hard, which I agree and did not mean to imply it was as easy as writing papers (it is not). Here is Anthony Tassone’s response on X:

I know this is hard. This is part of why I mentioned the Bell labs path. Working directly for an already established company is much easier/safer than doing your own startup. Bootstrapping a startup is additionally much different than doing VC go big or go home – which academics on sabbaticals and as a side hustle are potentially in a good position to do this.

Laura Huey did this path, and does not have nice things to say about it:

I have not talked to Laura specifically about this, but I suspect it is her experience running the Canadian Society of Evidence Based Policing. Which I would not suggest starting a non-profit either honestly. Even if you start a for-profit, there is no guarantee you will be in a good position in your current academic position to be well supported.

Again no doubt building useful stuff is harder than writing papers. For a counter to these though, doing my bootstrapped consulting firm is definitely not as stressful as building a large company like Anthony. And working for a tech company directly was a good career move for me (although now I spend most of my day building stuff to limit fraud-waste-abuse in Medicaid claims, not improving policing).

My suggestion that the field should be more focused on building stuff was not because it was easier, it was because if you don’t there is a good chance you are mostly irrelevant.

1 Comment

by Andy Wheeler on June 12, 2025 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Python, scholarly

Tagged research-software

Posted by Andy Wheeler on June 12, 2025

https://andrewpwheeler.com/2025/06/12/build-stuff/

Using Esri + python: arcpy notes

I shared a series of posts this week using Esri + arcpy tools on my Crime De-Coder LinkedIn page. LinkedIn eventually removes the posts though, so I am putting those same tips here on the blog. Esri’s tools do not have great coverage online, so blogging is a way to get more coverage in those LLM tools long term.

A little arcpy tip, if you import a toolbox, it can be somewhat confusing what the names of the methods are available. So for example, if importing some of the tools Chris Delaney has created for law enforcement data management, you can get the original methods available for arcpy, and then see the additional methods after importing the toolbox:

import arcpy
d1 = dir(arcpy) # original methods
arcpy.AddToolbox("C:\LawEnforcementDataManagement.atbx")
d2 = dir(arcpy) # updated methods available after AddToolbox
set(d2) - set(d1) # These are the new methods
# This prints out for me
# {'ConvertTimeField_Defaultatbx', 'toolbox_code', 'TransformCallData_Defaultatbx', 'Defaultatbx', 'TransformCrimeData_Defaultatbx'}
# To call the tool then
arcpy.TransformCrimeData_Defaultatbx(...)

Many of the Arc tools have the ability to copy python code, when I use Chris’s tool it copy-pastes arcpy.Defaultatbx.TransformCrimeData, but if running from a standalone script outside of an Esri session (using the python environment that ArcPro installs) that isn’t quite the right code to call the function.

You can check out Chris’s webinar that goes over the law enforcement data management tool, and how it fits into the different crime analysis solutions that Chris and company at Esri have built.

I like using conda for python environments on Window’s machines, as it is easier to install some particular packages. So I mostly use:

conda create --name new_env python=3.11 pip
conda activate new_env
pip install -r requirements.txt

But for some libraries, like geopandas, I will have conda figure out the install. E.g.

conda create --name geo_env python=3.11 pip geopandas
conda activate geo_env
pip install -r requirements.txt

As they are particularly difficult to install with many restrictions.

And if you are using ESRI tools, and you want to install a library, conda is already installed and you can clone that environment.

conda create --clone "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3" --name proclone
conda activate proclone
pip install -r requirements.txt

As you do not want to modify the original ESRI environment.

Using conda to run scheduled jobs in Windows is alittle tricky. Here is an example of setting up a .bat file (which can be set up in Windows scheduler) to activate conda, set a new conda environment, and call a python script.

::: For log, showing date/time
echo:
echo --------------------------
echo %date% %time%
::: This sets the location of the script, as conda may change it
set "base=%cd%"
::: setting up conda in Windows, example Arc's conda activate
call "C:\Program Files\ArcGIS\Pro\bin\Python\Scripts\activate.bat"
::: activating a new environment
call conda activate proclone
::: running a python script
call cd %base%
call python auto_script.py
echo --------------------------
echo:

Then, when I set up the script in Window’s scheduler, I often have the log file at that level. So the task scheduler I will have the action as:

"script.bat" >> log.txt 2>&1

And have the options where the script runs from the location of script.bat. This will append both the normal log and error log to the shell script. So if something goes wrong, you can open log.txt and see what is up.

When working with arcpy, often you need to have tables inside of a geodatabase to use particular geoprocessing tools. Here is an example of taking an external csv file, and importing that file into a geodatabase as a table.

import arcpy
gdb = "./project/LEO_Tables.gdb"
tt = "TempTable"
arcpy.env.workspace = gdb

# Convert CSV into geodatabase
arcpy.TableToTable_conversion("YourData.csv",gdb,tt)
#arcpy.ListTables() # should show that new table

# convert time fields into text, useful for law enforcement management tools
time_fields = ['rep_date','begin','end']
for t in time_fields:
    new_field = f"{t}2"
    arcpy.management.AddField(tt,new_field,"TEXT")
    arcpy.management.CalculateField(tt,new_field,f"!{t}!.strftime('%Y/%m/%d %H:%m')", "PYTHON3")

# This will show the new fields
#fn = [f.name for f in arcpy.ListFields(tt)]

When you create a new project, it automatically creates a geodatabase file to go along with that project. If you just want a standalone geodatabase though, you can use something like this in your python script:

import arcpy
import os

gdb = "./project/LEO_Tables.gdb"

if os.path.exists(gdb):
    pass
else:
    loc, db = os.path.split(gdb)
    arcpy.management.CreateFileGDB(loc,db)

So if the geodatabase does not exist, it creates it. If it does exist though, it will not worry about creating a new one.

One of the examples for automation is taking a basemap, updating some of the elements, and then exporting that map to an image or PDF. This sample code, using Dallas data, shows how to set up a project to do this. And here is the original map:

Because ArgGIS has so many different elements, the arcpy module tends to be quite difficult to navigate. Basically I try to seperate out data processing (which often takes inputs and outputs them into a geodatabase) vs visual things on a map. So to do this project, you have step 1 import data into a geodatabase, and 2 update the map elements. Here legend, title, copying symbology, etc.

You can go to the github project to download all of the data (including the aprx project file, as well as the geodatabase file). But here is the code to review.

import arcpy
import pandas as pd
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import os

# Set environment to a particular project
gdb = "DallasDB.gdb"
ct = "TempCrimes"
ol = "ExampleCrimes"
nc = "New Crimes"
arcpy.env.workspace = gdb
aprx = arcpy.mp.ArcGISProject("DallasExample.aprx")
dallas_map = aprx.listMaps('DallasMap')[0]
temp_layer = f"{gdb}/{ct}"

# Load in data, set as a spatial dataframe
df = pd.read_csv('DallasSample.csv') # for a real project, will prob query your RMS
df = df[['incidentnum','lon','lat']]
sdf = pd.DataFrame.spatial.from_xy(df,'lon','lat', sr=4326)

# Add the feature class to the map, note this does not like missing data
sdf.spatial.to_featureclass(location=temp_layer)
dallas_map.addDataFromPath(os.path.abspath(temp_layer)) # it wants the abs path for this

# Get the layers, copy symbology from old to new
new_layer = dallas_map.listLayers(ct)[0]
old_layer = dallas_map.listLayers(ol)[0]
old_layer.visible = False
new_layer.symbology = old_layer.symbology
new_layer.name = nc

# Add into the legend, moving to top
layout = aprx.listLayouts("DallasLayout")[0]
leg = layout.listElements("LEGEND_ELEMENT")[0]
item_di = {f.name:f for f in leg.items}
leg.moveItem(item_di['Dallas PD Divisions'], item_di[nc], move_position='BEFORE')

# Update title in layout "TitleText"
txt = layout.listElements("TEXT_ELEMENT")
txt_di = {f.name:f for f in txt}
txt_di['TitleText'].text = "New Title"
# If you need to make larger, can do
#txt_di['TitleText'].elementWidth = 2.0

# Export to high res PNG file
layout.exportToPNG("DallasUpdate.png",resolution=500)

# Cleaning up, to delete the file in geodatabase, need to remove from map
dallas_map.removeLayer(new_layer)
arcpy.management.Delete(ct)

And here is the updated map:

Leave a comment

by Andy Wheeler on April 20, 2025 • Permalink

Posted in Crime Analysis, Crime Mapping, Python

Tagged arcpy, conda, ESRI, process-automation

Posted by Andy Wheeler on April 20, 2025

https://andrewpwheeler.com/2025/04/20/using-esri-python-arcpy-notes/

Some notes on ESRI server APIs

Just a few years ago, most cities open data sites were dominated by Socrata services. More recently though cities have turned to ArcGIS servers to disseminate not only GIS data, but also just plain tabular data. This post is to collate my notes on querying ESRI’s APIs for these services. They are quite fast, have very generous return limits, and have the ability to do filtering/aggregation.

So first lets start with Raleigh’s Open Data site, specifically the Police Incidents. So sometimes for data analysis you just want a point-in-time dataset, and can download 100% of the data (which you can do here, see the Download button in the below screenshot). But what I am going to show here is how to format queries to generate up to date information. This is useful in web-applications, like dashboards.

So first, go down to the Blue button in the below screen that says I want to use this:

Once you click that, you will see a screen that lists several different options, click to expand the View API Resources, and then click the link open in API explorer:

To save a few steps, here is the original link and the API link side by side, you can see you just need to change explore to api in the url:

https://data-ral.opendata.arcgis.com/datasets/ral::daily-raleigh-police-incidents/explore
https://data-ral.opendata.arcgis.com/datasets/ral::daily-raleigh-police-incidents/api

Now on this page, it has a form to be able to fill in a query, but first check out the Query URL string on the right:

I am going to go into how to modify that URL in a bit to return different slices of data. But first check out the link https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services

This simpler view I often find easier to see all the available data than the open data websites with the extra fluff. You can often tell the different data sources right from the name (and often cities have more things available than they show on their open data site). But lets go to the Police Incidents Feature Server page, the link is https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0:

This gives you some meta-data (such as the fields and projection). Scroll down to the bottom of the page, and click the Query button, it will then take you to https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0/query:

I find this tool to format queries easier than the Open Data site. Here I put in the Where field 1=1, set the Out Fields to *, the Result record count to 3. I then hit the Query (GET)

This gives an annoyingly long url. And here are the resulting images

So although this returns a very long url, most of the parameters in the url are empty. So you could have a more minimal url of https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0/query?where=1%3D1&outFields=*&resultRecordCount=3&f=json. (There I changed the format to json as well.)

In python, it is easier to work with the json or geojson output. So here I show how to query the data, and read it into a geopandas dataframe.

from io import StringIO
import geopandas as gpd
import requests

base = "https://services.arcgis.com/v400IkDOw1ad7Yad/ArcGIS/rest/services/Daily_Police_Incidents/FeatureServer/0/query"
params = {"where": "1=1",
          "outFields": "*",
          "resultRecordCount": "3",
          "f": "geojson"}
res = requests.get(base,params)
gdf = gpd.read_file(StringIO(res.text)) # note I do not use res.json()

Now, the ESRI servers will not return a dataset that has 1,000,000 rows, it limits the outputs. I have a gnarly function I have built over the years to do the pagination, fall back to json if geojson is not available, etc. Left otherwise uncommented.

from datetime import datetime
import geopandas as gpd
import numpy as np
import pandas as pd
import requests
import time
from urllib.parse import quote

def query_esri(base='https://services.arcgis.com/v400IkDOw1ad7Yad/arcgis/rest/services/Police_Incidents/FeatureServer/0/query',
               params={'outFields':"*",'where':"1=1"},
               verbose=False,
               limitSize=None,
               gpd_query=False,
               sleep=1):
    if verbose:
        print(f'Starting Queries @ {datetime.now()}')
    req = requests
    p2 = params.copy()
    # try geojson first, if fails use normal json
    if 'f' in p2:
        p2_orig_f = p2['f']
    else:
        p2_orig_f = 'geojson'
    p2['f'] = 'geojson'
    fin_url = base + "?"
    amp = ""
    fi = 0
    for key,val in p2.items():
        fin_url += amp + key + "=" + quote(val)
        amp = "&"
    # First, getting the total count
    count_url = fin_url + "&returnCountOnly=true"
    if verbose:
        print(count_url)
    response_count = req.get(count_url)
    # If error, try using json instead of geojson
    if 'error' in response_count.json():
        if verbose:
            print('geojson query failed, going to json')
        p2['f'] = 'json'
        fin_url = fin_url.replace('geojson','json')
        count_url = fin_url + "&returnCountOnly=true"
        response_count2 = req.get(count_url)
        count_n = response_count2.json()['count']
    else:
        try:
            count_n = response_count.json()["properties"]["count"]
        except:
            count_n = response_count.json()['count']
    if verbose:
        print(f'Total count to query is {count_n}')
    # Getting initial query
    if p2_orig_f != 'geojson':
        fin_url = fin_url.replace('geojson',p2_orig_f)
    dat_li = []
    if limitSize:
        fin_url_limit = fin_url + f"&resultRecordCount={limitSize}"
    else:
        fin_url_limit = fin_url
    if gpd_query:
        full_response = gpd.read_file(fin_url_limit)
        dat = full_response
    else:
        full_response = req.get(fin_url_limit)
        dat = gpd.read_file(StringIO(full_response.text))
    # If too big, getting subsequent chunks
    chunk = dat.shape[0]
    if chunk == count_n:
        d2 = dat
    else:
        if verbose:
            print(f'The max chunk size is {chunk:,}, total rows are {count_n:,}')
            print(f'Need to do {np.ceil(count_n/chunk):,.0f} total queries')
        offset = chunk
        dat_li = [dat]
        remaining = count_n - chunk
        while remaining > 0:
            if verbose:
                print(f'Remaining {remaining}, Offset {offset}')
            offset_val = f"&cacheHint=true&resultOffset={offset}&resultRecordCount={chunk}"
            off_url = fin_url + offset_val
            if gpd_query:
                part_response = gpd.read_file(off_url)
                dat_li.append(part_response.copy())
            else:
                part_response = req.get(off_url)
                dat_li.append(gpd.read_file(StringIO(part_response.text)))
            offset += chunk
            remaining -= chunk
            time.sleep(sleep)
        d2 = pd.concat(dat_li,ignore_index=True)
    if verbose:
        print(f'Finished queries @ {datetime.now()}')
    # checking to make sure numbers are correct
    if d2.shape[0] != count_n:
        print('Warning! Total count {count_n} is different than queried count {d2.shape[0]}')
    # if geojson, just return
    if p2['f'] == 'geojson':
        return d2
    # if json, can drop geometry column
    elif p2['f'] == 'json':
        if 'geometry' in list(d2):
            return d2.drop(columns='geometry')
        else:
            return d2

And so, to get the entire dataset of crime data in Raleigh, it is then df = query_esri(verbose=True). It is pretty large, so I show here limiting the query.

params = {'where': "reported_date >= CAST('1/1/2025' AS DATE)", 
          'outFields': '*'}
df = query_esri(base=base,params=params,verbose=True)

Here this shows doing a datetime comparison, by casting the input to a date. Sometimes you have to do the opposite, cast one of the text fields to dates or extract out values from a date field represented as text.

Example Queries

So I showed about you can do a WHERE clause in the queries. You can do other stuff as well, such as get aggregate counts. For example, here is a query that shows how to get aggregate statistics.

If you click the link, it will go to the query form ESRI webpage. And that form shows how to enter in the output statistics fields.

And this produces counts of the total crimes in the database.

Here are a few additional examples I have saved in my notes:

Do not use the query_esri function above for aggregate counts, just form the params and pass them into requests directly. The query_esri function is meant to return large sets of individual rows, and so can overwrite the params in unexpected way.

Check out my Crime De-Coder LinkedIn page this week for other examples of using python + ESRI. This is more for public data, but those will be examples of using arcpy in different production scenarios. Later this week I will also post an updated blog here, for the LLMs to consume.

Leave a comment

by Andy Wheeler on April 14, 2025 • Permalink

Posted in Crime Mapping, Criminal Justice, Python

Tagged api, ESRI

Posted by Andy Wheeler on April 14, 2025

https://andrewpwheeler.com/2025/04/14/some-notes-on-esri-server-apis/

Notes on making line plots in matplotlib

Line plots are probably the most common type of plot I make. Here are my notes on making nice line plots in matplotlib in python. You can see the full replication code on Github here.

First, I will be working with UCR crime reports, for national level and then city level data from the Real Time Crime Index. The AH Datalytics crew saves their data in github as a simple csv file, and with the FBI CDE this code also downloads the most recent as well. getUCR are just helper functions to download the data, and cdcplot are some of my plot helpers, such as my personal matplotlib theme.

import cdcplot, getUCR
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm

# Get the National level and City data from Real Time Crime Index
us = getUCR.cache_fbi()
city = getUCR.prep_city()

So first, lets just do a basic plot of the national level MV Theft rate (here the rates are per 100,000 population, not per vehicles).

# Lets do a plot for National of the Motor Vehicle Theft Rate per pop
us['Date'] = pd.to_datetime(us['Date'])
us2020 = us[us['Date'].dt.year >= 2020]
var = 'motor-vehicle-theftrate'

# Basic
fig, ax = plt.subplots()
ax.plot(us2020['Date'],us2020[var])
fig.savefig('Line00.png', dpi=500, bbox_inches='tight')

The big spike in December 2020 is due to the way FBI collects data. (Which I can’t find the specific post, but I am pretty sure Jeff Asher has written about in his substack.) So the glut of December reports are not actually extra reports in December, it is just the silly way the FBI reports the backlogged incidents.

You can also see the X axis labels are too close together. But otherwise (besides lack of labels) is acceptable. One thing I like to do with line plots is to superimpose point markers on the sample points. It doesn’t matter here so much, but this is helpful when you have irregular time points or missing data, it is clear that the time period is missing.

In matplotlib, you can do this by specifying -o after the x/y coordinates for the line. I also like the look of plotting with a white marker edge. Also making the plot slightly larger fixes the X axis labels (which have a nice default to showing Jan/July and the year). And finally, since the simplicity of the chart, instead of doing x or y axis labels, I can just put the info I need into the title. For a publication I would likely also put “per 100,000 population” somewhere (in a footnote on the chart or if the figure caption).

# Marker + outline + size
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(us2020['Date'],us2020[var],'-o',
        color='k',markeredgecolor='white')
ax.set_title('Motor Vehicle Theft Rate in US')
fig.savefig('Line01.png', dpi=500, bbox_inches='tight')

Markers are one way to distinguish between multiple lines as well. So you can do -s for squares superimposed on the lines, -^ for a triangle, etc. The white edge only looks nice for squares and circles though in my opinion. See the list of filled markers in this matplotlib documentation. Circles and squares IMO look the nicest and carry a similar visual weight. Here is superimposed Charlotte and US, showing off stars just to create show how to do it.

# Multiple cities
city[var] = (city['Motor Vehicle Theft']/city['Population'])*100000
nc2020 = city[(city['Year'] >= 2020) & (city['State_ref'] == 'NC')]
ncwide = nc2020.pivot(index='Mo-Yr',columns='city_state',values=var)
cityl = list(ncwide)
ncwide.columns = cityl # removing index name
ncwide['Date'] = pd.to_datetime(ncwide.index,format='%m-%Y')
ncwide.sort_values(by='Date',inplace=True)

fig, ax = plt.subplots(figsize=(8,5))
ax.plot(ncwide['Date'],ncwide['Charlotte, NC'],'-s',
        color='green',markeredgecolor='white',label='Charlotte')
ax.plot(us2020['Date'],us2020[var],'-*',
        color='k',label='US')
ax.legend()
ax.set_title('Motor Vehicle Theft Rate')
fig.savefig('Line02.png', dpi=500, bbox_inches='tight')

Charlotte was higher, but looked like it had parallel trends (just increased by around 10 per 100,000) with national trends from 2020 until early 2022. In early 2022, Charlotte dramatically increased though, and peaked/had high volatility since mid 2023 in a different regime shift from the earlier years.

When you make line plots, you want the lines to be a more saturated color in my opinion. It both helps them stand out, as well as makes it more likely to survive printing. No pastel colors. With the points superimposed, even with greyscale printing it will be fine. I commonly tell crime analysts to make a printable report for regular meetings, it is more likely to be viewed than an interactive dashboard.

You can technically do dashes as well via the text string input. I do not like them though typically, as they are less saturated. Here I show two different dash styles. And you could do dashes and points, e.g. :o, (see this matplotlib doc for the styles) I have never bothered to do that though.

# Dashes instead of points
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(ncwide['Date'],ncwide['Charlotte, NC'],':',
        color='green',markeredgecolor='white',label='Charlotte')
ax.plot(ncwide['Date'],ncwide['Asheville, NC'],'--',
        color='#455778',label='Asheville')
ax.plot(us2020['Date'],us2020[var],'-',
        color='k',label='US')
ax.legend()
ax.set_title('Motor Vehicle Theft Rate')
fig.savefig('Line03.png', dpi=500, bbox_inches='tight')

You can see Asheville had an earlier spike, went back down, and then in 2023 had another pronounced spike. Asheville has close to a 100k population, so the ups/downs correspond pretty closely to just the total counts per month. So the spikes in 2023 are an extra 10, 20, 40 mv thefts than you might have expected based on historical patterns.

If you must have many lines differentiated via colors in a static plot, the Tableau color palette or the Dark2 colors work the best. Here is an example plotting the North Carolina cities in a loop with the Tableau colors:

# It is difficult to untangle multiple cities
# https://matplotlib.org/stable/users/explain/colors/colormaps.html
fig, ax = plt.subplots(figsize=(12,8))
for i,v in enumerate(cityl):
    ax.plot(ncwide['Date'],ncwide[v],'-',color=cm.tab10(i),label=v)

ax.legend()
fig.savefig('Line04.png', dpi=500, bbox_inches='tight')

So you could look at this and see “blue does not fit the same pattern”, and then go to the legend to see blue is Asheville. It is a bit of work though to disentangle the other lines though.

And here is an example using the pandas plotting method with the Dark2 palette. I do this more for exploratory data analysis, I often end up editing so much of the axis that using the pandas short cuts are not less work. Here I would edit the axis so the lines do not abut the x axis ends. For old school R people, this is similar to matplot in R, so the data needs to be in wide format, not long. (And all the limitations that come with that.)

# pandas can be somewhat more succinct
fig, ax = plt.subplots(figsize=(12,8))
ncwide.plot.line(x='Date',ax=ax,color=cm.Dark2.colors)
fig.savefig('Line05.png', dpi=500, bbox_inches='tight')

I tend to like the Tableau colors somewhat better though. The two greenish colors (Asheville and Greensboro) and the two orangish colors (Raleigh and Charlotte) I personally have to look quite closely to tell them apart. Men tend to have lower color resolution than women, I am not color blind and you may find them easier to tell the difference. Depending on your audience it would be good to assume lower than higher color acuity in the audience’s vision in general.

In my opinion, often you can only have 3 lines in a graph and it becomes too busy. It is partly due to how tortuous the lines are, so you can have many lines if they are parallel and don’t cross. But assuming you can have max 3 is a good baseline assumption.

An alternative though is to highlight specific lines. Here I highlight Durham and US, the other cities are light grey and in the background. Also looping over you can specific the order. I draw Durham last (so it goes on top). The grey cities are first (so are at the bottom). Here I only give the first grey background city a label, so the legend does not have duplicates.

# Highlight one city, compared to the rest
fig, ax = plt.subplots(figsize=(12,8))
ncwide.plot.line(x='Date',ax=ax,color=cm.Dark2.colors)
fig.savefig('Line05.png', dpi=500, bbox_inches='tight')


# Highlight one city, compared to the rest
fig, ax = plt.subplots(figsize=(12,8))
lab = 'Other NC'

for v in cityl:
    if v == 'Durham, NC':
        pass
    else:
        ax.plot(ncwide['Date'],ncwide[v],'-',color='lightgrey',label=lab)
        lab = None


ax.plot(us2020['Date'],us2020[var],'-o',
        color='k',markeredgecolor='white',label='US')
ax.plot(ncwide['Date'],ncwide['Durham, NC'],'-',linewidth=2,color='red',label='Durham')
ax.legend()
ax.set_title('Motor Vehicle Theft Rate')
fig.savefig('Line06.png', dpi=500, bbox_inches='tight')

If I had many more lines, I would make the grey lines either smaller in width, e.g. linewidth=0.5, and/or make them semi-transparent, e.g. alpha=0.8. And ditto if you want to emphasize a particular line, making it a larger width (here I use 2 for Durham), makes it stand out more.

Durham here you can see has very similar overall trends compared to Charlotte.

And those are my main notes on making nice line plots in matplotlib. Let me know in the comments if you have any other typical visual flourishes you often use for line plots.

1 Comment

by Andy Wheeler on December 16, 2024 • Permalink

Posted in Crime Analysis, Data Visualization, Python

Tagged matplotlib, mvtheft, ucr

Posted by Andy Wheeler on December 16, 2024

https://andrewpwheeler.com/2024/12/16/notes-on-making-line-plots-in-matplotlib/

Identifying excess rounding

Given the hubbub about Blue Cross and paying for anesthesiologists, there was an interesting paper making the rounds, Comparison of Anesthesia Times and Billing Patterns by Anesthesia Practitioners. (Shared via Crémieux.)

Most medical claims are billed via what are call CPT codes (Current Procedural Terminology). So if you go to the ER, you will get billed for a code 99281 to 99285. The final digit encodes different levels of complexity for the case, with 5’s being more complex. It was news to me that anesthesiologists actually bill for time directly, but the above linked paper showed (pretty plainly) that there is strong evidence they round up to every 5 minutes.

Now the paper just selected the anesthesiologists that have the highest proportion of billed times ending in 5’s. Here I will show a better way to flag specific problematic anesthesiologists (using repeated binomial tests and false discovery rate corrections).

Here I simulate 1,000 doctors, and select 40 of them to be bad, and those 40 round all of their claims up to the nearest 5 minute mark. Whereas the other docs are just billing the time as is. And they have varying total number of claims, between 100 and 500.

import numpy as np
from scipy.stats import gamma,uniform
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import binomtest,false_discovery_control

np.random.seed(10)

docs = 1000
# pick a random set 
bad_docs = np.random.choice(np.arange(docs),40,replace=False)
doci = []
claims = []

for i in range(docs):
    totn = int(np.ceil(uniform.rvs(100,500))) # number of claims
    doci += [i]*totn
    g = gamma.rvs(6,scale=12,size=totn)
    if i in bad_docs:
        g = np.ceil(g/5)*5
    else:
        g = g.round()
    claims += g.tolist()

dat = pd.DataFrame(zip(doci,claims),columns=['Doc','Time'])

# Histogram
fig, ax = plt.subplots(figsize=(8,4))
dat['Time'].hist(bins=range(201),alpha=0.8,color='k',ax=ax)
plt.savefig('Histogram.png',dpi=500,bbox_inches='tight')

You can see my gamma distribution is not as heavy tailed as the JAMA paper, but qualitatively has the same spike traits. Based on this, you can see the spike is relative to the other density, and so that shows in the JAMA paper there are more anesthesiologists rounding at 60 minutes than there are at the other 5 minute intervals.

In this particular example, it would be trivial to spot the bad docs, since they round 100% of the time, and you would only expect around 20% (since billing in minute intervals).

dat['Round'] = dat['Time'] % 5 == 0
dat['N'] = 1
gs = dat.groupby('Doc',as_index=False)[['N','Round']].sum()
gs['Perc'] = gs['Round']/gs['N']
gs.sort_values(by='Perc',ascending=False,ignore_index=True,inplace=True)
gs

# Upper Quantiles
np.quantile(gs['Perc'],[0.75,0.8,0.85,0.9,0.95,0.99])

But you can see a problem with using a top 5% quantile cut off here. Since I only have 4% bad doctors, using that hard cut-off will result in a few false positive flags. My suggested approach is to create a statistical test (Chi-Square, binominal, KS, whatever makes sense for individual doctors). Run the test for each doctor, get the p-value, and then run a false discovery rate correction on the p-values.

The above where doctors round 100% of the time is too easy, so here I simulate 40 doctors will round up 10% to 30% of the time. I also have fewer cases (more cases and more rounding make it much easier to spot).

# Redoing sim, but it is a smaller percentage of the stats are bad
for i in range(docs):
    totn = int(np.ceil(uniform.rvs(100,500))) # number of claims
    doci += [i]*totn
    g = gamma.rvs(6,scale=12,size=totn)
    if i in bad_docs:
        randperc = int(np.round(totn*uniform.rvs(0.1,0.2)))
        badn = np.random.choice(np.arange(totn),randperc,replace=False)
        g[badn] = np.ceil(g[badn]/5)*5
        g = g.round()
    else:
        g = g.round()
    claims += g.tolist()

dat = pd.DataFrame(zip(doci,claims),columns=['Doc','Time'])
dat['Round'] = dat['Time'] % 5 == 0
dat['N'] = 1
gs = dat.groupby('Doc',as_index=False)[['N','Round']].sum()
gs['Perc'] = gs['Round']/gs['N']
gs.sort_values(by='Perc',ascending=False,ignore_index=True,inplace=True)

Now we can apply the binomial test to each doctor, then adjust for false discovery rate.

# Calculating binom test
def bt(x):
    k,n = x.iloc[0],x.iloc[1]
    b = binomtest(k,n,p=0.2,alternative='greater')
    return b.pvalue

gs['p'] = gs[['Round','N']].apply(bt,axis=1)
gs['q'] = false_discovery_control(gs['p'],method='by')

# Captures 28 out of 40 bad docs, no false positives
gs['BadDocs'] = gs['Doc'].isin(bad_docs)
gs[gs['q'] < 0.05]

If you check out the doctors via gs.head(50), you can see that a few of the bad-docs where adjusted to have q-values of 1, but they ended up being low N and in the range you would expect.

While anesthesiologists billing is different, this same approach would be fine for CPT codes that have the 1-5 modifier (you might use a leave one out strategy and a Chi-Square test). Anesthesiologists if they know they will be scrutinized with exact 5 minutes, they will likely adjust and round up, but to not regular numbers. If that is the case, the default distribution will be expected to be uniform on the 0-9 digits (sometimes people use a Benford like test for the trailing digits, this is the same idea). That will be harder to fake.

I don’t have an issue with Blue Cross saying they will only bill for pre-allotted times. But even without making an explicit policy like that, they can identify bad actors and pursue investigations into those problematic anesthesiologists even without making widespread policy changes.

2 Comments

by Andy Wheeler on December 7, 2024 • Permalink

Posted in Crime Analysis, data science, Data Visualization, Python

Tagged fdr, outliers

Posted by Andy Wheeler on December 7, 2024

https://andrewpwheeler.com/2024/12/07/identifying-excess-rounding/

Question Sets and All Paths

I was nerd-sniped with a question at work the other day. The set up was like this, imagine a survey where all of the questions have yes-no answers. Some of the answers are terminating, so if you answer No to a particular question, the questions stop. But some if you reach that part of the tree will keep going.

The ask was a formula to articulate the total number of potential unique answer sets. Which I was not able to figure out, I did however write python code to generate all of the unique sets. So here is that function, where I create a networkx directed graph in a particular format, and then find all of the potential start to end paths in that graph. You just add a dummy begin node, and end nodes at the terminating locations and the final question. You then just search for all of the paths from the begin to end nodes.

import networkx as nx

def question_paths(totq,term):
    '''
    totq -- positive integer, total number of questions
    term -- list of integers, where ints are questions that terminate
    '''
    nodes = [f'Q{i}{r}' for i in range(totq) for r in ['N','Y']]
    edges = []
    for i in range(totq-1):
        edges.append([f'Q{i}Y',f'Q{i+1}N'])
        edges.append([f'Q{i}Y',f'Q{i+1}Y'])
        if i not in term:
            edges.append([f'Q{i}N',f'Q{i+1}N'])
            edges.append([f'Q{i}N',f'Q{i+1}Y'])
    # adding in begin/end
    nodes += ['Begin','End']
    edges += [['Begin','Q0N'],['Begin','Q0Y']]
    for t in term:
        edges.append([f'Q{t}N','End'])
    edges += [[f'Q{totq-1}N','End'],[f'Q{totq-1}Y','End']]
    # Now making graph
    G = nx.DiGraph()
    G.add_nodes_from(nodes)
    G.add_edges_from(edges)
    # Getting all paths
    paths = []
    for p in nx.all_simple_paths(G,source='Begin',target='End'):
        nicer = [v[-1] for v in p[1:-1]]
        paths.append(nicer)
    return paths

And now we can check out where all paths are terminating, you only get further up the tree if you answer all yes for prior questions.

>>> question_paths(3,[0,1,2])
[['N'],
 ['Y', 'N'],
 ['Y', 'Y', 'N'],
 ['Y', 'Y', 'Y']]

So in that scenario we have four potential answer sets. For all binary, we have the usual 2^3 number of paths:

>>> question_paths(3,[])
[['N', 'N', 'N'],
 ['N', 'N', 'Y'],
 ['N', 'Y', 'N'],
 ['N', 'Y', 'Y'],
 ['Y', 'N', 'N'],
 ['Y', 'N', 'Y'],
 ['Y', 'Y', 'N'],
 ['Y', 'Y', 'Y']]

And then you can do a mixture, here just question 2 terminates, but 1/3 are binary:

>>> question_paths(3,[1])
[['N', 'N'],
 ['N', 'Y', 'N'],
 ['N', 'Y', 'Y'],
 ['Y', 'N'],
 ['Y', 'Y', 'N'],
 ['Y', 'Y', 'Y']]

One of the things doing this exercise taught me is that it matters where the terminating nodes are in the tree. Earlier terminating nodes results in fewer potential paths.

# can different depending on where the terminators are
len(question_paths(10,[1,5])) # 274
len(question_paths(10,[6,8])) # 448
len(question_paths(10,[0,1])) # 258
len(question_paths(10,[8,9])) # 768

So the best in terms of a formula for the total number of paths I could figure out was 2^(non-terminating questions) <= paths <= 2^(questions) (which is not a very good bound!) I was trying to figure out a product formula but was unable (any suggestions let me know!)

This reminds me of a bit of product advice from Jennifer Pahlka, you should have eliminating questions earlier in the form. So instead of filling out 100 questions and at the end being denied for TANF, you ask questions that are the most likely to eliminate people first.

It works the same for total complexity of your application. So asking the terminating questions earlier reduces the potential number of permutations you ultimately have in your data as well. Good for the user and good for the developers.

2 Comments

by Andy Wheeler on November 2, 2024 • Permalink

Posted in data science, Python

Tagged NetworkX, survey

Posted by Andy Wheeler on November 2, 2024

https://andrewpwheeler.com/2024/11/02/question-sets-and-all-paths/

Suits, Money Laundering, and Linear Programming

Currently watching Suits with the family, and an interesting little puzzle came up in the show. In Season 1, episode 8, there is a situation with money laundering from one account to many smaller accounts.

So the set up is something like “transfer out 100 million”, and then you have some number of smaller accounts that sum to 100 million.

When Lewis brought this up in the show, and Mike had a stack of papers to go through to identify the smaller accounts that summed up to the larger account, my son pointed out it would be too difficult to go through all of the permutations to figure it out. The total number of permutations would be something like:

N = Sum(b(A choose k) for 1 to k)

Where A is the total number of accounts to look over, and k is the max number of potential accounts the money was spread around. b(A choose k) is my cheeky text format for binomial coefficients.

So this will be a very large number. 10,000 choose 4 for example is 416 416 712 497 500 – over 416 trillion. You still need to add in b(10,000 choose 2) + b(10,000 choose 3) + ….. b(10,000 choose k). With even a small number of accounts, the number of potential combinations will be very large.

But, I said in response to my son I could write a linear program to find them. And this is what this post shows, but doing so made me realize there are likely too many permutations in real data to make this a feasible approach in conducting fraud investigations. You will have many random permutations that add up to the same amount. (I figured some would happen, but it appears to me many happen in random data.)

(Note I have not been involved with any financial fraud examinations in my time as an analyst, I would like to get a CFA time permitting, and if you want to work on projects let me know! So all that said, I cannot speak to the veracity of whether this is a real thing people do in fraud examinations.)

So here I use python and the pulp library (and its default open source CBC solver) to show how to write a linear program to pick bundles of accounts that add up to the right amount. First I simulate some lognormal data, and choose 4 accounts at random.

import pulp
import numpy as np

# random values
np.random.seed(10)
simn = 10000
x = np.round(np.exp(np.random.normal(loc=5.8,scale=1.2,size=simn)),2)

# randomly pick 4
slot = np.arange(0,simn)
choice = np.random.choice(slot,size=4,replace=False)
tot = x[choice].sum()
print(tot,choice)

So we are expecting the answer to be accounts 2756 5623 5255 873, and the solution adds up to 1604.51. So the linear program is pretty simple.

# make a linear program
P = pulp.LpProblem("Bundle",pulp.LpMinimize)
D = pulp.LpVariable.dicts("D",slot,cat='Binary')

# objective selects smallest group
P += pulp.lpSum(D)

# Constraint, equals the total
P += pulp.lpSum(D[i]*x[i] for i in range(simn)) == tot

# Solving with CBC
P.solve()

# Getting the solution
res = []
for i in range(simn):
    if D[i].varValue == 1:
        res.append(i)

In practice you will want to be careful with floating point (later I convert the account values to ints, another way though is instead of the equal constraint, make it <= tot + eps and >= tot - eps, where eps = 0.001. But for the CBC solver this ups the time to solve by quite a large amount.)

You could have an empty objective function (and I don’t know the SAT solvers as much, I am sure you could use them to find feasible solutions). But here I have the solution by the minimal number of accounts to look for.

So here is the solution, as you can see it does not find our expected four accounts, but two accounts that also add up to 1604.51.

You can see it solved it quite fast though, so maybe we can just go through all the potential feasible solutions. I like making a python class for this, that contains the “elimination” constraints to prevent that same solution from coming up again.

# Doing elimination to see if we ever get the right set
class Bundle:
    def __init__(self,x,tot,max_bundle=10):
        self.values = (x*100).astype('int')
        self.totn = len(x)
        self.tot = int(round(tot*100))
        self.pool = []
        self.max_bundle = max_bundle
        self.prob_init()
    def prob_init(self):
        P = pulp.LpProblem("Bundle",pulp.LpMinimize)
        D = pulp.LpVariable.dicts("D",list(range(self.totn)),cat='Binary')
        # objective selects smallest group
        P += pulp.lpSum(D)
        # Constraint, equals the total
        P += pulp.lpSum(D[i]*self.values[i] for i in range(simn)) == self.tot
        # Max bundle size constraint
        P += pulp.lpSum(D) <= self.max_bundle
        self.prob = P
        self.d = D
    def getsol(self,solver=pulp.PULP_CBC_CMD,solv_kwargs={'msg':False}):
        self.prob.solve(solver(**solv_kwargs))
        if self.prob.sol_status != -1:
            res = []
            for i in range(self.totn):
                if self.d[i].varValue == 1:
                    res.append(i)
            res.sort()
            self.pool.append(res)
            # add in elimination constraints
            self.prob += pulp.lpSum(self.d[r] for r in res) <= len(res)-1
            return res
        else:
            return -1
    def loop_sol(self,n,solver=pulp.PULP_CBC_CMD,solv_kwargs={'msg':False}):
        for i in range(n):
            rf = self.getsol(solver,solv_kwargs)
            if rf != 1:
                print(f'Solution {i}:{rf} sum: {self.values[rf].sum()}')
            if rf == -1:
                print(f'No more solutions at {i}')
                break

And now we can run through and see if the correct solution comes up:

be = Bundle(x=x,tot=tot)
be.loop_sol(n=10)

Uh oh – I do not feel like seeing how many potential two solutions there are in the data (it is small enough I could just enumerate those directly). So lets put a constraint in that makes it so it needs to be four specific accounts that add up to the correct amount.

# add in constraint it needs to be 4 selected
choice.sort()
be.prob += pulp.lpSum(be.d) == 4
print(f'Should be {choice}')
be.loop_sol(n=10)

And still no dice. I could let this chug away all night and probably come up with the set I expected at some point. But if you have so many false positives, this will not be very useful from an investigative standpoint. So you would ultimately need more constraints.

Lets say our example the cumulative total will be quite large. Will that help limit the potential feasible solutions?

# Select out a sample of high
sel2 = np.random.choice(x[x > 5000],size=4,replace=False)
tot2 = sel2.sum()
ch2 = np.arange(simn)[np.isin(x,sel2)]
ch2.sort()
print(tot2,ch2)

be2 = Bundle(x=x,tot=tot2)
be2.loop_sol(n=20)

There are still too many potential feasible solutions. I thought maybe a problem with real data is that you will have fixed numbers, say you are looking at transactions, and you have specific $9.99 transactions. If one of those common transactions results in a total sum, it will just be replicated in the solution over and over again. I figured with random data the sum would still be quite unique, but I was wrong for that.

So I was right in that I could write a linear program to find the solution. I was wrong that there would only be one solution!

Leave a comment

by Andy Wheeler on October 28, 2024 • Permalink

Posted in Crime Analysis, data science, Python

Tagged linear-programming, pulp

Posted by Andy Wheeler on October 28, 2024

https://andrewpwheeler.com/2024/10/28/suits-money-laundering-and-linear-programming/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 392 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category Python

Running in the REPL

Exécution dans le REPL

How criminology research typically goes

Well what have you built Andy?

Examples of innovations I am excited about

Where to next?

Example Queries

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange