All posts in category data science

Types of websites and trade-offs

For some minor updates on different fronts. I have a new blog post on Crime De-Coder about how to figure out the proper ODBC connection string to query your record management system. I have done this song and dance with maybe a dozen different PDs at this point (and happened to do it twice in the prior week), so figured a post would make sense.

Two, article with Kim Rossmo has been published, The journey-to-crime buffer zone: Measurement issues and methodological challenges. Can check out the github repo and CrimRXiv version for free.

Main reason I wanted to make a post today about the types of websites. I have seen several influencers discuss using GenAI to create simple apps. These tools are nice, but many seem to make bad architecture decisions from the start (many people should not be making python served websites). So I will break down a few of the different options for creating a website in this post.

The most basic is a static html site – this just requires you create the HTML and place it somewhere on a server. A free option is github pages. You can still use javascript apps, but they are run client side and there is no real ability to limit who can see the site (e.g. you cannot make it password protected to log in). These can handle as much traffic as you want. If you have data objects in the site (such as a dashboard) the objects just need to be stored directly on the server (e.g. in json files or csv if you want to parse them). You can build data dashboards using D3 or other WASM apps.

The other types of websites you can do anything you can in HTML, so I focus more on what makes them different.

PHP sites – this probably requires you to purchase an online server (ignoring self hosting in your closet). There are many low cost vendors (my Crime De-Coder site is PHP on Hostinger. But they are pretty low price, think $5 per month. These do have the ability to create password protected content and have server side functions hidden. My $5 a month website has a service level agreement to handle 20k responses per day, it also has a built in MySQL database that can hold 2 gig of data for my plan. (These cheap sites are not bad if all you want is a smallish database.) WordPress is PHP under the hood (although if you want a custom site, I would just start from scratch and not modify WordPress. WordPress is good to use a GUI to style the site with basic templates.)

IMO if you need to protect stuff behind a server, and have fairly low traffic requirements, using a PHP site is a very good and cheap option. (For higher traffic I could pay under $20 a month for a beefier machine as well, we are talking crazy site traffic like well over 100k visits per day before you need to worry about it.)

Node.js – this is a server technology, popular for various apps. It is javascript under the hood, but you can have stuff run server side (so can be hidden from end user, same as PHP). The tech to host a site is a bit more involved than the PHP hosting sites. You can either get a VPS (for typically less than $10 a month, and can probably handle close to the same amount of traffic as the cheap PHP), and write some code to host it yourself. Think of a VPS as renting a machine (so can be used for various things, not just webhosting.) Or use some more off the shelf platform (like FlyIO, which has variable pricing). You typically need to think about a separate database hosting as well with these tools though. (I like Supabase.)

Python – python has several libraries, e.g. django, flask, as well as many different dashboard libraries. Similar to Node, you will need to either host this on a VPS (Hostinger has VPS’s as well, and I know DigitalOcean is popular), or some other service. Which is more expensive than the cheaper PHP options. It is possible to have authentication in python apps, but I do not tend to see many examples of that. Most python websites/dashboards I am familiar with are self-hosted, and so limit who can seem them intrinsically to the companies network (e.g. not online in a public website for everyone to see).

Personal story, at one point my Dallas crime dashboard (which is WASM + python) on my Crime De-Coder site (which is served on the PHP site, so takes a few extra seconds to install), was broken due to different library upgrades. So I hosted the python panel dashboard on Google cloud app while I worked on fixing the WASM. I needed one up from the smallest machine due to RAM usage (maybe 2 gigs of RAM). Google cloud app was slower than WASM on start up, sometimes would fail, and cost more than $100 per month with very little traffic. I was glad I was able to get the WASM version fixed.

It is all about trade-offs though in the architecture. So the WASM app you can right click and see how I wrote the code to do that. Even though it is on a PHP site, it is rendered client side. So there is no way to protect that content from someone seeing it. So imagine I wanted you to pay $5 a month to access the dashboard – someone could right click and copy the code and cancel the subscription (or worse create their own clone for $4 per month). For another example, if I was querying a private database (that you don’t want people to be able to see), someone could right click and see that as well. So the WASM app only makes sense for things that don’t need to be private. Google cloud app though that is not a limitation.

The mistake I see many people make is often picking Node/Python where PHP would probably be a better choice. Besides costs, you need to think about what is exposed to the end user and the level of effort to create/manage the site. So if you say to GenAI “I want to build a dashboard website” it may pop out a python example, but many of the examples I am seeing it would have been better to use PHP and say “I have a php website, build a function to query my database and return an array of crimes by month”, and then as a follow up question say, “ok I have that array, create a line chart in PHP and javascript using D3.js”.

So to me GenAI does not obviate the need to understand the technology, which can be complicated. You need a basic understanding of what you want, the constraints, and then ask the machine for help writing that code.

Leave a comment

by Andy Wheeler on September 24, 2024 • Permalink

Posted in Crime Analysis, data science, Data Visualization, javascript, Python

Tagged php, web-design

Posted by Andy Wheeler on September 24, 2024

https://andrewpwheeler.com/2024/09/24/types-of-websites-and-trade-offs/

Reducing folium map sizes

Recently for a crimede-coder project I have been building out a custom library to make nice leaflet maps using the python folium library. See the example I have posted on my website. Below is a screenshot:

This map ended up having around 3000 elements in it, and was a total of 8mb. 8mb is not crazy to put on a website, but is at the stage where you can actually notice latency when first rendering the map.

Looking at the rendered html code though it was verbose in a few ways for every element. One is that lat/lon are in crazy precision by default, e.g. [-78.83229390597961, 35.94592660794455]. So a single polygon can have many of those. Six digits of precision for lat/lon is still under 1 meter of precision, which is plenty sufficient for my mapping applications. So you can reduce 8+ characters per lat/lon and not really make a difference to the map (you can technically have invalid polygons doing this, but this is really pedantic and should be fine).

A second part of the rendered folium html map for every object is given a full uuid, e.g. geo_json_a19eff2648beb3d74760dc0ddb58a73d.addTo(feature_group_2e2c6295a3a1c7d4c8d57d001c782482);. This again is not necessary. I end up reducing the 32 length uuids to the first 8 alphanumeric characters.

A final part is that the javascript is not minified – it has quite a bit of extra lines/spaces that are not needed. So here are my notes on using python code to take care of some of those pieces.

To clean up the precision for geometry objects, I do something like this.

import re

# geo is the geopandas dataframe
redg = geo.geometry.set_precision(10**-6).to_json()
# redg still has floats, below regex clips values
rs = r'(\d{2}\.|-\d{2}\.)(\d{6})(\d+)'
re.sub(rs,r'\1\2',redg)

As most of my functions add the geojson objects to the map one at a time (for custom actions/colors), this is sufficient to deal with that step (for markers, can round lat/lon directly). It may make more sense for the set precision to be 10**-5 and then clip the regex. (For these regex’s I am showing there is some risk they will replace something they should not, I think it will be pretty safe though.)

Then to clean up the UUID’s and extra whitespace, what I do is render the final HTML and then use regex’s:

# fol is the folium object
html = fol.get_root()
res = html.script.get_root().render()
# replace UUID with first 8
ru = r'([0-9a-f]{8})[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{12}'
res = re.sub(ru,r'\1',res)
# clean up whitespace
rl = []
for s in res.split('\n'):
    ss = s.strip()
    if len(ss) > 0:
        rl.append(ss)
rlc = '\n'.join(rl)

There is probably a smarter way to do this directly with the folium object for the UUID’s. For whitespace though it would need to be after the HTML is written. You want to be careful with the cleaning up the whitespace step – it is possible you wanted blank lines in say a leaflet popup or tooltip. But for my purposes this is not really necessary.

Doing these two steps in the Durham map reduces the size of the rendered HTML from 8mb to 4mb. So reduced the size of the file by around 4 million characters! The savings will be even higher for maps with more elements.

One last part is my map has redundant svg inserted for the map markers. I may be able to use css to insert the svg, e.g. something like in css .mysvg {background-image: url("vector.svg");}, and then in the python code for the marker svg insert <div class="mysvg"></div>. For dense point maps this will also save quite a few characters. Or you could add in javascript to insert the svg as well (although that would be a bit sluggish I think relative to the css approach, although sluggish after first render if the markers are turned off).

I have not done this yet, as I need to tinker with getting the background svg to look how I want, but could save another 200-300 characters per marker icon. So will save a megabyte in the map for every 3000-5000 markers I am guessing.

The main reason I post webdemo’s on the crimede-coder site is that there a quite a few grifters in the tech space. Not just for data analysis, but for front-end development as well. I post stuff like that so you can go and actually see the work I do and its quality. There are quite a few people now claiming to be “data viz experts” who just embed mediocre Tableau or PowerBI apps. These apps in particular tend to produce very bad maps, so here you can see what I think a good map should look like.

If you want to check out all the interactions in the map, I posted a YouTube video walking through them

Durham hotspot map walkthrough of interactions

Leave a comment

by Andy Wheeler on August 4, 2024 • Permalink

Posted in Crime Analysis, Crime Mapping, data science, Data Visualization, Mapping, Python

Tagged folium, hot spots, leaflet, webmapping

Posted by Andy Wheeler on August 4, 2024

https://andrewpwheeler.com/2024/08/04/reducing-folium-map-sizes/

Using GenAI to describe charts for reports

One of the ideas that has come up with the recent GenAI craze is to use these tools to conduct end-to-end data analysis. So you feed it a dataset + a question and out pops an analysis. Mike Zidar has notes on using a Google tool to do this:

I am not worried about these tools usurping crime analysts though. The reason is that the vast majority of data analysis (crime or business or whatever) is very superficial. The hardest part is not generating the chart, it is knowing what chart to generate, and how it will be used by real people to make real decisions. Often in crime analysis you get the ambiguous “well this will help allocate resources” – when in reality your chart can in no way help dictate any realistic decision a department is going to make.

If you cut out analysts, and just have front line individuals asking google “do crime analysis”, it will be hopelessly superficial, and the front line will just ignore it altogether.

I however do think that GenAI has the ability to become power-tools for super-users. That is, someone who does know what they want to calculate, but uses the computer to help them get that information faster. Not dissimilar to how auto-complete while texting helps you type faster. And here is one use case I have been thinking about – so analysts spend a ton of time automating different products, such as monthly CompStat reports. The reports should have tables/graphs in them, like this chart of gun crimes in DC for example:

Now, most reports you want to also have a plain text summary of what is going on in the chart. Currently when auto-generating reports, it is difficult to mix that plain text in and not make it very superficial using a rule based system. The newest round of many of the GenAI tools lets you upload an image and ask it questions about the image. So this is still very open ended, but has many more guiderails than simply telling the machine “go brr and generate analysis”. You have already decided the chart, you are just asking for a nice description of the chart to fill in your already pre-made metrics.

I did this with ChatGPT, Google’s Gemini, and Claude to see how it does with the gun crime chart, the below Raleigh weekly chart:

And in some cases a more tame monthly time series chart:

Because these models are changing all the time, keep in mind that when you read this the newest models may do even better than what I show here (and see the end section on how you may prompt engineer this to produce better results as well). That said lets check out the results!

ChatGPT

For ChatGPT I used the GPT-4o model, I first just asked about the DC chart “Describe the patterns in this chart”. ChatGPT as it is known, is quite verbose:

I then asked ChatGPT to keep it to two to three sentences, and I think it did very well.

Ok, now to the Raleigh chart with error bars:

So this is OK, it spotted the recent increasing trend. “Frequently exceeded the average of the prior 8 weeks (gray shaded area)” is wrong, there are only 2! But I think the last sentence about notable recent spikes is good.

I then gave it the Durham chart that had the anomalies in early 2019/2020:

And I again think this is quite reasonable. I mean an analyst should probably say “these must be reporting idiosyncrasies”, but I don’t think this is not so bad a description as to be misleading.

All in all very happy with the results for ChatGPT here – these charts are not typical line and bar charts, and ChatGPT interpreted each quite well. At least the description is not so bad that if I did these directly in an automated report it would be embarrassing.

Google

For all of these examples I am using the free tools (that typically have limits that I run out by just these two queries). I did this on 5/22 for Google (which I think is Gemini-1.5, I am not 100% sure). So here is the DC gun crime seasonal chart for Google with the prompt Describe the trends in this chart:

This is very wrong in multiple places. It did not do any better with the Raleigh chart:

There is not strong seasonality here, and it includes some filler “important to note” that is not actually important to note. After this I gave it a more tame monthly crime counts of Robberies in Houston chart to see if Google could redeem itself:

And it flopped pretty hard again. Maybe most charts are increasing, so the model is biased to say increasing (I don’t know).

So again this is Google’s free version, and so the paid may be better (or recent updates may be better). But this isn’t even close to make me want to prompt engineer further.

Claude

I did the tests with Sonnet 3.5 (so around a month after the tests with Google/ChatGPT). I used the shorter 2/3 sentences prompt.

I like this description even slightly better than ChatGPT. How about the Raleigh MV thefts chart:

Similar to ChatGPT it is not quite right but in very subtle ways. It does catch the upward trend. It is wrong in terms of data ends before 2024, and the gray area indicating greater variability is technically true but I would not describe it as noticeable. So again not embarrassingly wrong (like Google), but not quite right either.

How about the anomaly Durham chart:

Very similar to ChatGPT, which I think is again OK.

Prompt Engineering Ideas

So the idea behind prompt engineering is you can ask “Describe this chart” or you can ask “Describe this chart in two to three sentences” and it changes the results (in any of these tools). Subsequently a big part of this is figuring out the prompts that give the most reasonable results. Prompts in these GenAI tools when submitting images have two parts, the image and the text. Do not take me as an expert by any means, but for other analysts here are my guesses as to how to prompt engineer this to maybe return better results.

For the text part, one thing you may do for auto reports is to give examples of the text you want. So for example, you could have a prompt that is “Please fill in the blanks: The chart shows _____ trends over time.”. So provide more guidance as to the structure of how you want to the response to look. Or you could do: “Here is example description 1: …., description 2: ….”. This is how RAG applications work, but with a static report you can give static exemplars, they don’t really need to be dynamically looked up from prior reports.

For the image part of the prompt, in an auto-application you may submit a different image than is actually shown in the report. So for example I may have the X axis for monthly crimes be labeled with the actual months (instead of numbers). Putting all the months in smaller font I bet the GenAI tools will still read it just fine, even if I don’t want it to look like that in the final report.

And I probably shouldn’t include logos (since they are immaterial and just cause extra info that may distract the description), and the text footers. I also think making my legends more descriptive may help guide the tools interpretation. I may remove the title text all together and place the relevant info in the prompt “Here is a chart of Robberies in Houston from …. to … Please describe the chart, including any long term trends, or anomalous spikes (high or low) in any month.” The text prompt may keep the tools on track a bit more with the specific details, but still allow them leeway to interpret the chart without being too rigid.

For the error bar chart, you could insert into the prompt explicit dates they were outside, e.g. “weeks a and b were high, make sure to mention that”. So you could have a mix of explicit anomaly detection, insert those anomalies into the prompt, to just keep the results on track.

It would still be a lot of work to automate a report with such plain text language, but I think it could be a quite reasonable iterative workflow. So you generate the report in a format you can edit, like Word, review it. And then in subsequent reports try to tweak the parameters a wee bit to produce better outputs.

2 Comments

by Andy Wheeler on July 16, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science

Tagged ChatGPT, genai

Posted by Andy Wheeler on July 16, 2024

https://andrewpwheeler.com/2024/07/16/using-genai-to-describe-charts-for-reports/

Wake LP talk on LPRs and javascript hacks in WooCommerce

For some Crime De-Coder news, I will be giving a tech talk on automated license plate readers for the Wake County Libertarian Party on Wednesday July 17th in Raleigh.

See my slides on the CRIME De-Coder website to get a preview of what I will be talking about.

This post will just be a quick one that is hopefully helpful to others. So I use WooCommerce + LuLu to distribute shipping for my print book. For those who have purchased a copy thank you! Many of the paperback purchases will be arriving at your homes very soon. There have been two hiccups with my store website for individuals.

Problem 1 is an error nonce is invalid popping up after trying to add the book to the cart. It is difficult to replicate and is an underlying cache error with WordPress as best I can tell. My advice to fix is go to the individual book page directly to add the book to your cart (paperback, ebook). It seems to mostly happen to me when I am on the main shop page and add directly to cart. If the problem persists for you let me know.

The second problem is that for the print book, to do shipping on LuLu’s end you need to input a phone number. As far as I can tell, most website advice in WooCommerce suggest to pay for an additional plug in to edit this option. I was able to use javascript (and the WPCode free plugin) though to force the phone number to be filled in for free though. Sharing here as I hope it will be helpful to others.

Check out the onload function first, that sets the attribute for either billing-phone or shipping-phone to be required. If you are fast and are looking at the exact right spot of the checkout page, you would be able to see this change from “Phone (Optional)” to “Phone”.

The change_label event listener is to modify the error message when you don’t put in a phone number (by default it confusingly says “Please enter a valid phone number (optional)”. So that part is a bit hacky with attaching the event listener to the entire webpage, but unless you are trying to purchase from a Commodore 64 it should be fine.

<script>
// This script forces the billing/shipping
// phone number to be filled in and not optional
// Andy Wheeler, done via WPCode Extension in Footer
function change_label(){
    var xl = document.getElementById("billing-phone");
    if (xl){
        var ll = xl.nextSibling;
        var nd = ll.nextSibling;
        if (nd) {
            if (nd.getAttribute('role') == 'alert') {
                nd.firstChild.innerText = "Please enter a valid phone number"
            };
        };
    };
    var xs = document.getElementById("shipping-phone");
    if (xs){
        var ls = xs.nextSibling;
        var ns = ls.nextSibling;
        if (ns) {
            if (ns.getAttribute('role') == 'alert') {
                ns.firstChild.innerText = "Please enter a valid phone number"
            };
        };
    };
};

// So click is not working when people
// just use tabs/keyboard select
// not sure how to fix that, but just results in a
// bad red note that says "optional" (but still need
// to fill in
document.addEventListener('click',change_label);

window.onload = function() {
    var x = document.getElementById("billing-phone");
    if (x) {
        var lab = x.nextSibling;
        lab.innerText = "Phone";
        x.setAttribute('aria-label','Phone')
        x.setAttribute('required','')
        // These don't seem to work unfortunately!
        //x.addEventListener("change",change_label);
        //x.setAttribute("onChange","change_label();")
    };
    var x2 = document.getElementById("shipping-phone");
    if (x2) {
        var lab2 = x2.nextSibling;
        lab2.innerText = "Phone";
        x2.setAttribute('aria-label','Phone')
        x2.setAttribute('required','')
    };
};
</script>

Because there is no phone verification, you could technically put in a fake number for these FYI and no one would know. (I have a google voice number I use for instances in which I don’t really want to give out personal.)

Thanks again for those who have purchased a copy – appreciate the support.

Leave a comment

by Andy Wheeler on July 11, 2024 • Permalink

Posted in Crime Analysis, data science, Python

Tagged javascript, woocommerce, wordpress

Posted by Andy Wheeler on July 11, 2024

https://andrewpwheeler.com/2024/07/11/wake-lp-talk-on-lprs-and-javascript-hacks-in-woocommerce/

Some notes on self-publishing a tech book

So my book, Data Science for Crime Analysis with Python, is finally out for purchase on my Crime De-Coder website. Folks anywhere in the world can purchase a paperback or epub copy of the book. You can see this post on Crime De-Coder for a preview of the first two chapters, but I wanted to share some of my notes on self publishing. It was some work, but in retrospect it was worth it. Prior books I have been involved with (Wheeler 2017; Wheeler et al. 2021) plus my peer review experience I knew I did not need help copy-editing, so the notes are mostly about creating the physical book and logistics of selling it.

Academics may wish to go with a publisher for prestige reasons (I get it, I was once a professor as well). But it is quite nice once you have done the legwork to publish it yourself. You have control of pricing, and if you want to make money you can, or have it cheap/free for students.

Here I will detail some of the set up of compiling the book, and then the bit of work to distribute it.

Compiling the documents

So the way I compiled the book is via Quarto. I posted my config notes on how to get the book contents to look how I wanted on GitHub. Quarto is meant to run code at the same time (so works nicely for a learning to code book). But even if I just wanted a more typical science/tech book with text/images/equations, I would personally use Quarto since I am familiar with the set up at this point. (If you do not need to run dynamic code you could do it in Pandoc directly, not sure if there is a way to translate a Quarto yaml config to the equivalent Pandoc code it turns into.)

One thing that I think will interest many individuals – you write in plain text markdown. So my writing looks like:

# Chapter Heading

blah, blah blah

## Subheading

Cool stuff here ....

In a series of text files for each chapter of the book. And then I tell Quarto quarter render, and it turns my writing in those text files into both an Epub and a PDF (and other formats if you cared, such as word or html). You can set up the configuration for the book to be different for the different formats (for example I use different fonts in the PDF vs the epub, nice fonts in one look quite bad in the other). See the _quarto.yml file for the set up, in particular config options that are different for both PDF and Epub.

One thing is that ebooks are hard to format nicely – if I had a book I wanted to redo to be an epub, I would translate it to markdown. There are services online that will translate, they will do a bad job though with scientific texts with many figures (and surely will not help you choose nice fonts). So just learn markdown to translate. Folks who write in one format and save to the other (either Epub/HTML to PDF, or PDF to Epub/HTML) are doing it wrong and the translated format will look very bad. Most advice online is for people who have just books with just text, so science people with figures (and footnotes, citations, hyperlinks, equations, etc.) it is almost all bad advice.

So even for qualitative people, learning how to write in markdown to self-publish is a good skill to learn in my opinion.

Setting up the online store

For awhile I have been confused how SaaS companies offer payment plans. (Many websites just seem to copy from generic node templates.) Looking at the Stripe API it just seems over the top for me to script up all of my own solution to integrate Stripe directly. If I wanted to do a subscription I may need to figure that out, but it ended up being for my Hostinger website I can set up a sub-page that is WordPress (even though the entire website is not), and turn on WooCommerce for that sub-page.

WooCommerce ends up being easy, and you can set up the store to host web-assets to download on demand (so when you purchase it generates a unique URL that obfuscates where the digital asset is saved). No programming involved to set up my webstore, it was all just point and click to set things up one time and not that much work in the end.

I am not sure about setting up any DRM for the epub (so in reality people will purchase epub and share it illegally). I don’t know of a way to prevent this without using Amazon+Kindle to distribute the book. But the print book should be OK. (If there were a way for me to donate a single epub copy to all libraries in the US I would totally do that.)

I originally planned on having it on Amazon, but the low margins on both plus the formatting of their idiosyncratic kindle book format (as far as I can tell, I cannot really choose my fonts) made me decide against doing either the print or ebook on Amazon.

Print on Demand using LuLu

For print on demand, I use LuLu.com. They have a nice feature to integrate with WooCommerce, the only thing I wish shipping was dynamically calculated. (I need to make a flat shipping rate for different areas around the globe the way it is set up now, slightly annoying and will change the profit margins depending on area.)

LuLu is a few more dollars to print than Amazon, but it is worth it for my circumstance I believe. Now if I had a book I expected to get many “random Amazon search buys” I could see wanting it on Amazon. I expect more sales will be via personal advertising (like here on the blog, social media, or other crime analyst events). My Crime De-Coder site (and this blog) will likely be quite high in google searches for some of the keywords fairly quickly, so who knows, maybe just having on personal site is just as many sales.

LuLu does has an option to turn on distribution to other wholesalers (like Barnes & Noble and Amazon) – have not turned that on but maybe I will in the future.

LuLu has a pricing calculator to see how much to print on their website. Paperback and basically the cheapest color option for letter sized paper (which is quite large) is just over $17 for my 310 page book (Amazon was just over $15). For folks if you are less image heavy and more text, you could get away with a smaller size book (and maybe black/white) and I suspect will be much cheaper. LuLu’s printing of this book is higher quality compared to Amazon as well (better printing of the colors and nicer stock for the paperback cover).

Another nice thing about print on demand is I can go in and edit/update the book as I see fit. No need to worry about new versions. Not sure what that exactly means for citing the work (I could always go and change it), you can’t have a static version of record and an easy way to update at the same time.

Other Random Book Stuff

I purchased ISBNs on Bowker, something like 10 ISBNs for $200. (You want a unique ISBN for each type of the book, so you may want three in the end if you have epub/paperback/hardback.) Amazon and LuLu though have options to have them give you an ISBN though, so that may have not been necessary. I set the imprint to be my LLC though in Bowker, so CRIME De-Coder is the publisher.

You don’t technically need an ISBN at all, but it is a simple thing, and there may be ways for me to donate to libraries in the future. (If a University picks it up as a class text, I have been at places you need at least one copy for rent at the Uni library.)

I have not created an index – I may have a go at feeding my book through LLMs and seeing if I can auto-generate a nice index. (I just need a list of key words, after that can just go and find-replace the relevent text in the book to fill in so it auto-compiles an index.) I am not sure that is really necessary though for a how-to book, you should just look at the table of contents to see the individual (fairly small) sections. For epub you can just doing a direct text search, so not sure if people use an index at all in epubs.

Personal Goals

So I debated on releasing the book open source, I do want to try and see if I can make some money though. I don’t have this expectation, but there is potential to get some “data science” spillover, and if that is the case sales could in theory be quite high. (I was surprised in searching the “data science python” market on Amazon, it is definitely not saturated.) Personally I will consider at least 100 sales to be my floor for success. That is if I can sell at least 100 copies, I will consider writing more books. If I can’t sell 100 copies I have a hard time justifying the effort – it would just be too few of people buying the book to have the types of positive spillovers I want.

To make back money relative to the amount of work I put in, I would need more than 1000 sales (which I think is unrealistic). I think 500 sales is about best case, guesstimating the size of the crime analyst community that may be interested plus some additional sales for grad students. 1000 sales it would need to be in the multiple professors using it for a class book over several years. (Which if you are a professor and interested in this for a class let me know, I will give your class a discount.)

Another common way for individuals to make money off of books is not for sales, but to have training’s oriented around the book. I am hoping to do more of that for crime analysts directly in the future, but those opportunities I presume will be correlated with total sales.

I do enjoy writing, but I am busy, so cannot just say “I am going to drop 200 hours writing a book”. I would like to write additional python topics oriented towards crime analysts/criminology grad students like:

GIS analysis in python
Regression
Machine Learning & Optimization
Statistics for Crime Analysis
More advanced project management in python

Having figured out much of this grunt work definitely makes me more motivated, but ultimately in the end need to have a certain level of sales to justify the effort. So please if you like the blog pick up a copy and tell a friend you like my work!

References

Wheeler, A.P. (2024) Data Science for Crime Analysis with Python. CRIME De-Coder. ISBN 979-8-9903770-1-1
Wheeler, A.P. (2017) Geospatial analytics. In SPSS® Statistics for Data Analysis and Visualization. Eds. McCormick, K. & Salcedo J., with contributions by Peck J. & Wheeler, A.P. Wiley. ISBN 978-1119003557
Wheeler, A.P., Herrmann C.R., & Block R.L. (2021). Micro-place homicide patterns in Chicago. Springer. ISBN 978-3-030-61445-4

5 Comments

by Andy Wheeler on July 2, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Personal Productivity, Python, writing

Tagged data visualization, ebook

Posted by Andy Wheeler on July 2, 2024

https://andrewpwheeler.com/2024/07/02/some-notes-on-self-publishing-a-tech-book/

My word template for Quarto

I have posted on Github my notes on creating a word template to use with quarto. And since Quarto is just feeding into pandoc, those who are just using pandoc (so not doing intermediate computations), should maybe find that template worthwhile as well.

So first, why word? Quarto by default looks pretty nice for HTML. That is fine for them to prioritize that, but the majority of reports I want to use quarto for HTML is not the best format. Many times I want a report that can be emailed in PDF and/or printed. And sometimes I (or my clients) want a semi-automated report that can be edited after the fact. In those cases word is a good choice.

Editing LaTeX is too hard, and I am pretty happy with the this template for small reports. I will be sharing my notes on writing my python book in Quarto soonish, but for now wanted to share how I created a word template.

Note some of the items may seem gratuitous (why so many CRIME De-Coder logos?). Part of those are just notes though (like how to insert an image after your author name, I have done this to insert my signature in reports for example). The qmd file has most of the things I am interested in doing in documents, such as how to format markdown tables in python, doings sections/footnotes, references, table/figure captions, etc.

I do like my logo though in the header (it is hyperlinked even, so in subsequent PDFs if you click the logo it will go to my website), and the footer page numbers I commonly need in reports as well. And my title page and TOC do not look bad as well IMO. I am not one to discuss fonts, but I like small caps for titles and the Verdana font is nice to make it look somewhat different.

Creating the Template

So first, you can do from the command line:

quarto pandoc -o custom-reference-doc.docx --print-default-data-file reference.docx

From there, you should edit that reference.docx file to get what you want. So for example, if you want to change the font used for code snippets, in Word you can open up Styles, and on the right hand side select different elements and edit them:

Here for example to change the font for code snippets, you modify the HTML code style (I like Consolas):

There ended up being a ton of things I edited, I did not keep a list. Offhand you will want to modify the Title, Headings 1 & 2, First Paragraph, Body Text. And then you can edit things like the page numbers and header/footer.

So when rendering a document, you can sometimes click on the element in the rendered document and figure out what style it inherits from. Here for example you can see in the test.docx file that the quote section uses the “Block Text” style:

This does not always work though, and it can take some digging/experimentation in the original template file to get the right style modifier. (If you are having a real hard problem, convert the word document format to .zip, and dig into the XML documents. You can see the style formats in inherits from in the XML tree.) It doesn’t work for the code segments for example. Do not render a document and edit the style in that document, only edit the original --print-default-data-file reference.docx that was generated from the command line to update your template.

I have placed a few notes in the readme on Github, but one of my main things was making tables look nice. So this plays nicely with markdown tables, which I can use python to render directly. Here is an example of spreading tables across multiple pages.

One thing to note though is that this has limits – different styles are interrelated, so sometimes I would change one and it would propagate errors to different elements. (I can’t figure out how to change the default bullets to squares instead of circles for example without having bullets in places they should not be in tables – try to figure that one out. I also cannot figure out how to change the default font in tables, I would use monospace, without changing the font for other text elements in normal blocks.) So this template was the best I could figure without making other parts broken.

I have a few notes in the qmd file as well, showing how to use different aspects of markdown, as well as some sneaky things to do extra stuff (like formatting fourth level headings to produce a page break, I do not think I will need that deep of headings).

Even for those not using Quarto for computational workflows, writing in markdown is a really useful skill. You write in plain text, and can then have the output in different formats. Even for qualitative folks (or people in industry creating documents), I think many people would be well served by writing content in plain text markdown, and then rendering to whatever output they wanted.

1 Comment

by Andy Wheeler on June 21, 2024 • Permalink

Posted in data science, Data Visualization, Python, writing

Tagged pandoc, quarto

Posted by Andy Wheeler on June 21, 2024

https://andrewpwheeler.com/2024/06/21/my-word-template-for-quarto/

Conformal Sets Part 2, Estimating Precision

After publishing my post yesterday, in which I said you could not estimate the false positive rate using conformal sets, I realized there was a way given the nature of if you know some parts of the contingency table you can estimate the other parts. In short if you know the proportion of the outcome in your sample, which you can use the same calibration sample to estimate (and should be reasonable given the same exchangeability assumption for conformal sets to work to begin with) you can estimate the false positive rate (or the precision of your estimate given a particular threshold).

Here we know, given a particular threshold, the percentage estimates for the cells below:

           True
          0     1 
       -----------
Pred 0 | tn  | fn |
       ------------
     1 | fp  | tp |
       ------------
         X0    X1

I added two details to the table, X0 and X1, which I take to be the column counts for the cells. So imagine we have a result where you have 90% coverage for the positive class, and 60% coverage for the negative class. Also assume that the proportion of 1’s is 20%. So we have this percentage table:

           True
          0     1 
       -----------
Pred 0 | 60%  | 10% |
       ------------
     1 | 40%  | 90% |
       ------------

Now pretend to translate to counts, where we have 80 0’s and 20 1’s (my 20% specified above):

           True
          0     1 
       -----------
Pred 0 | 48  |  2 |
       ------------
     1 | 32  | 18 |
       ------------
         80    20

So for our false positive estimate, we have 32/(32 + 18) = 0.64. Pretend our fp,tp etc. are our original percent metrics, we could then write our table as:

            True
          0         1 
       -----------
Pred 0 | tn*X0  | fn*X1 |
       ------------
     1 | fp*X0  | tp*X1 |
       ------------
         X0        X1

And so our false positive equation is then:

    fp*X0
--------------
(fp*X0 + tp*X1)

We can make this more dimensionless, by setting X0 = 1, and then writing X0 = m*X1 = 1. Then we can update the above equation to simply be:

    fp*X0                    fp
--------------       =>  ----------
  fp*X0 + tp*X0*m         fp + tp*m

The factor m is prop/(1 - prop), where prop is the proportion of 1s, here 0.2, so m is 0.2/0.8 = 0.25. So if we do 0.4/(0.4 + 0.9*0.25) = 0.64. And that is our false positive estimate for that threshold across the sample, and our precision estimate is the complement, so 36%.

I have updated my code on github, but here is an example class to help calculate all of these metrics. It is short enough to put right on the blog:

import numpy as np
from scipy.stats import ecdf

class ConfSet:
    def __init__(self,y_true,y_score):
        self.p1 = y_score[y_true == 1]
        self.p0 = y_score[y_true == 0]
        self.prop = y_true.mean()
        self.m = self.prop/(1 - self.prop)
        self.ecdf1 = ecdf(self.p1)
        self.ecdf0 = ecdf(self.p0)
    def Cover1(self,k):
        res = np.percentile(self.p1,100-k)
        return res
    def Cover0(self,k):
        res = np.percentile(self.p0,k)
        return res
    def PCover(self,p):
        cov1 = self.ecdf1.sf.evaluate(p)
        cov0 = self.ecdf0.cdf.evaluate(p)
        # false positive estimate
        fp = 1 - cov0
        fprate = fp/(fp + cov1*self.m)
        return cov0, cov1, fprate

So based on your calibration sample, you pass in the predicted probabilities and true outcomes. After that you can either calculate recall (or zero class recall) using the Cover methods. Or given a particular threshold you can calculate the different metrics. Here is an example using this method (with the same data from yesterday) to replicate the metrics in that post (for coverage for 0, I am always working with the predicted probability for the positive class, not its complement):

And this provides a much better out of sample estimator of false positive rates than using the PR curve directly.

The method makes it easy to do a grid search over different thresholds and calculating metrics:

And here is an ugly graph to show that the out of sample matches up very closely to the estimates:

I debated on adding a method to solve for the false positive rate, but I think just doing a grid search may be the better approach. I am partially concerned in small samples the false positive estimate will not be monotonic (a similar problem happens in AUC calculations with censored data). But here in this sample at this grid level it is monotonic – and most realistic cases I have personally worrying about the third digit in predicted probabilities is just noise at that point.

1 Comment

by Andy Wheeler on June 9, 2024 • Permalink

Posted in data science, Python

Tagged conformal-prediction, machine-learning, prediction

Posted by Andy Wheeler on June 9, 2024

https://andrewpwheeler.com/2024/06/09/conformal-sets-part-2-estimating-precision/

Conformal Sets and Setting Recall

I had a friend the other day interested in a hypothesis along the lines of “I think the mix of crime at a location is different”, in particular they think it will be pushed to more lower level property (and fewer violent) based on some local characteristics. I had a few ideas on this – Brantingham (2016) and Lentz (2018) have examples of creating a permutation type test. And I think I could build a regression multinomial type model (similar to Wheeler et al. 2018) to generate a surface of crime category prediction types over a geographic area (e.g. area A has a mix of 50% property and 50% violent, and area B has a mix of 10% violent and 90% property).

Another approach though is pure machine learning and using conformal sets. I have always been confused about them (see my comment on Gelman’s blog) – reading some more about conformal sets though my comments on Andrew Gelman’s post are mostly confused but partly right. In short you can set recall on a particular class using conformal sets, but you cannot set precision (or equivalently set the false positive rate). So here are my notes on that.

For a CJ application of conformal sets, check out Kuchibhotla & Berk (2023). The idea is that you are predicting categorical classes, in the Berk paper it is recidivism classification with three categories {violent,non-violent,no recidivism}. Say we had a prediction for an individual for the three categories as {0.1,0.5,0.4} – you may say that this person has the highest predicted category of non-violent. Conformal sets are different, in that they can return multiple categories based on a decision threshold, e.g. predict {non-violent,no-recidivism} in this example.

My comment on Gelman’s blog is confused, in that I always thought “why not just take the probabilities to get the conformal set”, so if I wanted a conformal set of 90%, the non-violent and no recidivism probabilities add up to 90%, so wouldn’t they count? But that is not what conformal sets give you, conformal sets only make sense in the repeated frequentist sense I do this thing over and over again, what happens. So with conformal sets so you get Prob(Predict > Threshold | True ) = 0.95, or whatever conformal proportion you want. (I like to call this “coverage”, so out of those True outcomes, what threshold will cover 95% of them.)

This blog post with the code examples really helped my understanding, and how conformal sets can be applied to individual categories (which I think makes more sense than the return multiple labels scenario).

I have code to replicate on github, using data from the NIJ recidivism competition (Circo & Wheeler, 2022) as an example. See some of my prior posts for the feature engineering example, but I use the out of bag trick for random forests in lieu of having a separate calibration sample.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# NIJ Recidivism data with some feature engineering
pdata = pd.read_csv('NIJRecid.csv') # NIJ recidivism data

# Train/test split and fit model
train = pdata[pdata['Training_Sample'] == 1]
test = pdata[pdata['Training_Sample'] == 0]

yvar = 'Recidivism_Arrest_Year1'
xvar = list(pdata)[2:]

# Random forest, need to set OOB to true
# for conformal (otherwise need to use a seperate calibration sample)
rf = RandomForestClassifier(max_depth=5,min_samples_leaf=100,random_state=10,n_estimators=1000,oob_score=True)
rf.fit(train[xvar],train[yvar])

# Out of bag predictions
probs = rf.oob_decision_function_

Now I have intentionally made this as simple as possible (the towards data science post has a small sample quantile correction, plus has a habit of going back and forth between P(y=1) and 1 - P(y=1)). But to get a conformal threshold in this scenario to set the recall at 95% is quite simple:

# conditional predictions for actual 1's
p1 = probs[train[yvar]==1,1]

# recall 95% coverage
k = 95
cover95 = np.percentile(p1,100-k)
print(f'Threshold to have conformal set of {k}% for capturing recidivism')
print(f'{cover95:,.3f}')

# Now can check out of sample
ptest = rf.predict_proba(test[xvar])
out_cover = (ptest[test[yvar]==1,1] > cover95).mean()
print(f'\nOut of sample coverage at {k}%')
print(f'{out_cover:,.3f}')

And this results in for this data:

So this is again recall, or I like to call it the capture rate of the true positives. It is true_positives / (true_positives + false_negatives). This threshold value is estimated purely based on the calibration sample (or here the OOB estimates). The model I will show is not very good, but the conformal sets you still get good performance. So this is quite helpful, having a good estimator (based on exchangeability, so no drift over time). I think in practice though that will not be bad (I by default auto-retrain models I put into production on a regular schedule, e.g. retrain once a month), so I don’t bother monitoring drift.

You can technically do this for each class, so you can have a recall set for the true negatives as well:

# can also set the false negative rate in much the same way
p0 = probs[train[yvar]==0,0]

# false negative rate set to 5%
k = 95
cover95 = np.percentile(p0,100-k)
print(f'Threshold (for 0 class) to have conformal set of {k}% for low risk')
print(f'{cover95:,.3f}')

# Now can check out of sample
out_cover = (ptest[test[yvar]==0,0] > cover95).mean()
print(f'\nOut of sample coverage at {k}%')
print(f'{out_cover:,.3f}')

Note that the threshold here is for P(y=0),

Threshold (for 0 class) to have conformal set of 95% for low risk
0.566

Out of sample coverage at 95%
0.953

Going back to the return multiple labels idea, in this example for predictions (for the positive class) we would have this breakdown (where 1 – 0.566 = 0.434):

P < 0.19 = {no recidivism}
0.19 < P < 0.434 = {no recidivism,recidivism}
0.434 < P = {recidivism}

Which I don’t think is helpful offhand, but it would not be crazy for someone to want to set the recall (for either class on its own) as a requirement in practice. So say we have a model that predicts some very high risk event (say whether to open an investigation into potential domestic terrorist activity). We may want the recall for the positive class to be very high, so even if it is a lot of nothing burgers, we have some FBI agent at least give some investigation into the predicted individuals.

For the opposite scenario, say we are doing release on recognizance for pre-trial in lieu of bail. We want to say, of those who would not go onto recidivate, we only want to “falsely hold pretrial” 5%, so this is a 95% conformal set of True Negative/(True Negative + False Positive) = 0.95. This is what you get in the second example above for no-recidivism.

Note that this is not the false positive rate, which is False Positive/(True Positive + False Positive), which as far as I can tell you cannot determine via conformal sets. If I draw my contingency table (use fp as false positive, tn as true negative, etc.) Conformal sets condition on the columns, whereas the false positive rate conditions on the second row.

          True
          0     1 
       -----------
Pred 0 | tn  | fn |
       ------------
     1 | fp  | tp |
       ------------

So what if you want to set the false positive rate? In batch I know how to set the false positive rate, but this random forest model happens to not be very well calibrated:

# This models calibration is not very good, it is overfit
dfp = pd.DataFrame(probs,columns=['Pred0','Pred1'],index=train.index)
dfp['y'] = train[yvar]
dfp['bins'] = pd.qcut(dfp['Pred1'],10)
dfp.groupby('bins')[['y','Pred1']].sum()

So I go for a more reliable logistic model, which does result in more calibrated predictions in this example:

# So lets do a logit model to try to set the false positive rate
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve

# Making a second calibration set
train1, cal1 = train_test_split(train,train_size=10000)
logitm = LogisticRegression(random_state=10,penalty=None,max_iter=100000)
logitm.fit(train1[xvar],train1[yvar])
probsl = logitm.predict_proba(cal1[xvar])

# Can see here that the calibration is much better
dflp = pd.DataFrame(probsl,columns=['Pred0','Pred1'],index=cal1.index)
dflp['y'] = cal1[yvar]
dflp['bins'] = pd.qcut(dflp['Pred1'],10)
dflp.groupby('bins')[['y','Pred1']].sum()

Now the batch way to set the false positive rate, given you have a well calibrated model is as follows. Sort your batch according the predicted probability of the positive class in descending value. Pretend we have a simple set of four cases:

Prob
 0.9
 0.8
 0.5
 0.1

Now if we set the threshold to be 0.6, we would then have {0.9,0.8} as our two predictions, we then estimate that the false positive rate would be (0.1 + 0.2)/2 = 0.3/2, If we set the threshold to be 0.4, we would have a false positive rate estimate of (0.1 + 0.2 + 0.5)/3 = 0.8/3. So this relies on a batch of characteristics that we are predicting, and is not determined beforehand (this is the idea I use in this post on prioritizing audits):

# The batch way to set the false positive rate
ptestl = logitm.predict_proba(test[xvar])
dftp = pd.DataFrame(ptestl,columns=['Pred0','Pred1'],index=test.index)
dftp['y'] = test[yvar]

dftp.sort_values(by='Pred1',ascending=False,inplace=True)
dftp['PredictedFP'] = (1 - dftp['Pred1']).cumsum()
dftp['AcutalFP'] = (dftp['y'] == 0).cumsum()
dftp['CumN'] = np.arange(dftp.shape[0]) + 1
dftp['PredRate'] = dftp['PredictedFP']/dftp['CumN']
dftp['ActualRate'] = dftp['AcutalFP']/dftp['CumN']
dftp.iloc[range(1000,7001,1000)]

What happens if we try to estimate where to set the threshold in the training/calibration data though? NOTE: I have a new blog post showing how to construct a more appropriate estimate of the false positive rate ENDNOTE. In practice, we often need to make decisions one at a time, in the parole case it is not like we hold all parolees in the queue for a month to save and batch process them. So lets use our precision in the calibration sample to get a threshold:

# Using precision to set the threshold (based on calibration set)
fp_set = 0.45
pr_data = precision_recall_curve(cal1[yvar], probsl[:,1])
loc = np.arange(pr_data[0].shape[0])[pr_data[0] > fp_set].min()
thresh_fp = pr_data[2][loc]

print(f'Threshold estimate for FP rate at {fp_set}')
print(f'{thresh_fp:,.3f}')

print(f'\nActual FP rate in test set at threshold {thresh_fp:,.3f}')
test_fprate = 1 - test[yvar][ptest[:,1] > thresh_fp].mean()
print(f'{test_fprate:,.3f}') # this is not a very good estimate!

Which gives us very poor out of sample estimates – we set the false positive rate to 45%, but ends up being 55%:

Threshold estimate for FP rate at 0.45
0.333

Actual FP rate in test set at threshold 0.333
0.549

So I am not sure what the takeaway from that is, whether we need to be doing something else to estimate the false positive rate (like an online learning approach that Chohlas-Wood et al. (2021) discuss). A takeaway though from the NIJ competition I have is that false positives tend to be a noisy measure (and FP for fairness between groups just exacerbates the problem), so maybe we just shouldn’t be worried about false positives at all. In many CJ scenarios, we do not get any on-policy feedback on false positives – think the bail case where you have ROR vs held pre-trial, you don’t observe false positives in that scenario in practice.

Conformal sets though, if you want recall for particular classes are the way to go though. You can also do them for subsets of data, e.g. different conformal thresholds for male/female, minority/white. So have an easy way to accomplish a fairness ideal with post processing. And I may do a machine learning approach to help that friend out with the crime mix in places idea as well (Wheeler & Steenbeek, 2021).

References

Brantingham, P. J. (2016). Crime diversity. Criminology, 54(4), 553-586.
Chohlas-Wood, A., Coots, M., Zhu, H., Brunskill, E., & Goel, S. (2021). Learning to be fair: A consequentialist approach to equitable decision-making. arXiv preprint arXiv:2109.08792.
Circo, G. M., & Wheeler, A. P. (2022). An Open Source Replication of a Winning Recidivism Prediction Model. International Journal of Offender Therapy and Comparative Criminology, Online First
Kuchibhotla, A.K., & Berk, R.A. (2023). Nested conformal prediction sets for classification with applications to probation data. The Annals of Applied Statistics, 17(1), 761-785.
Lentz, T.S. (2018). Crime diversity: reexamining crime richness across spatial scales. Journal of Contemporary Criminal Justice, 34(3), 312-335.
Wheeler, A.P., & Steenbeek, W. (2021). Mapping the risk terrain for crime using machine learning. Journal of Quantitative Criminology, 37, 445-480
Wheeler, A.P., Steenbeek, W., & Andresen, M. A. (2018). Testing for similarity in area‐based spatial patterns: Alternative methods to Andresen’s spatial point pattern test. Transactions in GIS, 22(3), 760-774.

1 Comment

by Andy Wheeler on June 7, 2024 • Permalink

Posted in Crime Analysis, data science, Python

Tagged conformal-prediction, machine-learning

Posted by Andy Wheeler on June 7, 2024

https://andrewpwheeler.com/2024/06/07/conformal-sets-and-setting-recall/

Notes on MMc queues

Recently had a project related to queues at work, so wanted to put some of my notes in a blog post. For a bit of up-front, the notation MMc refers to a queuing system with multiple servers (c), and the inputs are Poisson distributed (the first M), and have exponential service rates M (these Ms can be different though). That is a mouthful, but basically saying events that arrive in independently and have a left skewed distribution of times it takes to resolve those events. (That may seem like a lot of assumptions, they are often reasonable though for many systems, and if not deviations may not be that big of deal to the estimates in practice.)

Main reason for blog post is that the vast majority of stuff online is about MM1 queue systems, so systems that only have 1 server. I basically never deal with this situation. The formulas for multiple servers are much more complicated, so took me a bit to gather code examples and verify correctness. These are notes based on that work.

So for up-front, the group I was dealing with at work had a fundamental problem, their throughput was waaay too small. In this notation, we have:

Number of arrivals per time period, N
Mean time it takes to exit the queue, S
Number of servers, c

So first, you need to have N*S < c! This is simple accounting, so say we are talking about police calls for service, you have on average 5 calls per hour, and they take on average 0.5 hours (30 minutes) to handle. You then need more than 5*0.5 = 2.5 officers to handle this, so a minimum of 3 officers. If you don’t have 3 officers, the queue will grow, you won’t be able to handle all of the calls.

At work I was advising a situation where they were chronically too low of staff serving for a particular project, and it has ballooned over an extended period of time to create an unacceptable backlog. So think S is really tiny and N is very large – at first the too small of servers could cycle through the tickets, but the backlog just slowly grew, and then after months, they had unacceptable wait times. This is a total mess – there is no accounting trick to solve this, you need c > N*S. It makes no sense to talk about anything else like average wait time in the queue unless that condition is met.

OK, so we know you need c > N*S, a common rule of thumb is that capacity should not be over 80%, so that is c > (N*S)/0.8. (This is not for policing, but more common for call centers, see also posts on Erlang-C formulas.) The idea behind 80% it is at the point where wait times (being held in the queue) start to grow.

If you want to get more into the nitty gritty though, such as calculating the actual probability in the queue, average wait time, etc., then you will want to dig into the MMc queue lit. Here I have posted some python notes (that is itself derivative work others have posted). Hoping just posting and giving my thumbs up makes it easier for others.

So first here is an example of using those functions to estimate our queue example above. Note you need to give the inverse of the mean service time for this function.

# queuing functions in python
from queue import MMc, nc

N = 6    # 6 calls per hour
S = 0.5  # calls take 30 minutes to resolve
c = 7    # officers taking calls

# This function expects inverse service average
qS = MMc(N,1/S,c)

# Now can get stats of interest

# This is the probability that when a call comes
# in, it needs to wait for an officer
qS.getQueueProb()

And this prints out 0.0376.... So when a call comes in, we have a 3% probability of having to wait in the queue for an officer to respond. How about how long on average a call will wait in the queue?

# This is how long a call on average needs
# to wait in the queue in minutes
qS.getAvgQueueTime()*60

And this gives 0.28.... The multiplication by 60 goes from hours to minutes, and we are waiting less than 1 minute on average. This seems good, but somewhat counter-intuitively, this is an average of a bunch of calls answered immediately, plus the 3.8% of calls that are held for some time. We can calculate the estimate of if a call is held, how long will it be held on average:

# If a call is queued however, how long to wait?
qS.getAvgQueueTime_Given()*60

And this is a less rosy 17.5 minutes! Queues are tricky. Unless you have a lot of extra capacity, there are going to be wait times. We can also calculate how often all officers will be idle in this setup.

# Idle time, no one taking any calls
qS.getIdleProb()

And this gives rounded 0.05, so we have only 5% idle time in this set up. This is not that helpful though for police planning, you want individual officers to have capacity to do proactive work, that is more you want officers to only spend 40-60% on responding to calls for service, so that suggests c > (N*S)/0.5 is where you want to be. Which is where we are at in this scenario with 7 officers. This is the probability all 7 officers will be idle at once, which does not really matter.

Now you can technically just run this through multiple values of c to get this, but Rosetti (2021) has listed an approximate square root staffing formula that given an input probability wait in the queue, how many servers do you need. So here is that function:

# If you want probability of holding in the queue to only be 3%
est_serv = nc(N,S,0.03)
print(est_serv)

Which prints out 6.387..., so since you need to take the ceiling of this, you will need to 7 officers to keep to that probability (agreeing with the MMc object above).

In terms of values, the nc function will work with very large/small N and S inputs just fine. The MMc function also looks fine, minus one submethod uses a factorial, .getPk (so cannot have very large inputs to that method), but the rest is OK. So if you wanted to do nc(very_big,very_small,0.1) that is fine and should be no floating point issues.

The nc function relies on scipy, but the MMc class is all base python (just the math library). So the MMc functions can really be embedded in any particular python application you want with no real problem.

Rough Estimates for Spatial Police Planning

I have prior work on spatial allocation of patrol units with workload equality constraints (Wheeler, 2018). But generally, you need to first estimate how many units you will have, and after that you can worry about optimally distributing them. The reason for this is that the number of units is much more important, too few and you will have more queuing, in which case the spatial arrangement does not matter at all. Larson & Stevenson (1972) estimate optimal spatial allocation only beats random allocation by 25%.

So for police response times you can think about time waiting in queue, time spent driving to the event, and time spent resolving the event (time to dispatch tends to be quite trivial, but is sometimes included in the wait in the queue part, Verlaan & Ruiter, 2023).

There is somewhat of a relationship between the above “service” time, fewer people have to drive farther, and so service time goes up. But there happens to some simple rules of thumb, if you have N patrol units, you can calculate (2/3)*sqrt(Square Miles)/sqrt(N) = average distance traveled in miles for your jurisdiction (Stenzel, 1993, see page 135 in the PDF). Then you can translate that miles driven to time, by say taking an average of 45 miles per hour. Given a fixed N, you can then just add this into the service time estimate for your given jurisdiction to get a rough estimate of more officers will reduce response times by X amount.

It ends up being though this tends to be trivial relation to the waiting in the queue time (or the typical it takes 30 minutes to resolve police incidents on average). So it is often more important to get rough estimates for that if you want to reduce wait times for calls for service. And this does not even take into account priority levels in calls, but to start simpler folks should figure out a minimum to handle the call stack (whether in policing or in other areas) and then go onto more complicated scenarios.

References

Larson, R.C., & Stevenson, K.A. (1972). On insensitivities in urban redistricting and facility location. Operations Research, 20(3), 595-612.
Rossetti, M.D. (2021). Simulation Modeling and Arena, 3rd and Open Text Edition.
Stenzel, W. (1993). Police Personnel Allocation Manual Users Guide. US Dept of Transportation National Highway Traffic Safety.
Verlaan, T., & Ruiter, S. (2023). Predictors of police response time: a scoping review. Crime Science, 12(1), 19.
Wheeler, A. P. (2018). Creating optimal patrol areas using the p-median model. Policing: An International Journal, 42(3), 318-333.

Leave a comment

by Andy Wheeler on May 29, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Python

Tagged queue

Posted by Andy Wheeler on May 29, 2024

https://andrewpwheeler.com/2024/05/29/notes-on-mmc-queues/

Grabbing the NHAMCS emergency room data in python

So Katelyn Jetelina on her blog, The Local Epidemiologist, had a blog post (with Heidi Moseson) on how several papers examining mifepristone related to emergency room (ER) visits were retracted. (Highly recommend Katelyn’s blog, I really enjoy the mix of virology and empirical data discussions you can’t get from other outlets.)

This reminded me I had on the todo list examining the CDC’s NHAMCS (National Hospital Ambulatory Medical Care Survey) data. This is a sample of ER visit data collated by the CDC. I previously putzed with this data to illustrate predictive models for wait times, and I was interested in examining gun violence survival rates in this dataset.

I had the idea with checking out gun violence in this data after seeing Jeff Brantingham’s paper showing gun shot survival rates in California have been decreasing, and ditto for Chicago via work by Jen’s Ludwig and Jacob Miller. It is not uber clear though if this is a national pattern, Jeff Asher does not think so for example. So I figured the NHAMCS would be a good way to check national survival rates, and maybe see if any metro areas were diverging over time.

Long story short, the NHAMCS data is waaay too small of sample to look at rare outcomes like gun violence. (So probably replicating the bad studies Katelyn mentions in her blog are not worth the effort, they will be similarly rare). But the code is concise enough to share in a quick blog post for others if interested. Looking at the data the other day, I realized you could download SPSS/SAS/Stata files instead of the fixed width from the CDC website. This is easier than my prior post, as you can read those different files into python directly without having to code all of the variable fields from the fixed width file.

So for some upfront, the main library you need is pandas (as well as pyreadstat installed). The rest is just stuff that comes with pythons standard library. The NHAMCS files are zipped SPSS files, so a bit more painful to download but not that much of an issue. (Unfortunately you cannot just read in memory, like I did with Excel/csv here, I have to save the file to disk and then read it back.)

import pandas as pd
import zipfile
from io import BytesIO
import requests
from os import path, remove

# This downloads zip file for SPSS
def get_spss(url,save_loc='.',convert_cat=False):
    ext = url[-3:]
    res = requests.get(url)
    if ext == 'zip':
        zf = zipfile.ZipFile(BytesIO(res.content))
        spssf = zf.filelist[0].filename
        zz = zf.open(spssf)
        zs = zz.read()
    else:
        zs = BytesIO(res.content)
        spssf = path.basename(url)
    sl = path.join('.',spssf)
    with open(sl, "wb") as sav:
        sav.write(zs)
    df = pd.read_spss(sl,convert_categoricals=convert_cat)
    remove(sl)
    return df

Now that we have our set up, we can just read in each year. Note 2005!

# creating urls
base_url = 'https://ftp.cdc.gov/pub/health_statistics/nchs/dataset_documentation/NHAMCS/spss/'
files = ['ed02-spss.zip',
         'ed03-spss.zip',
         'ed04-spss.zip',
         'ed05-sps.zip',
         'ed06-spss.zip',
         'ed07-spss.zip',
         'ed08-spss.zip',
         'ed09-spss.zip',
         'ed2010-spss.zip',
         'ed2011-spss.zip',
         'ed2012-spss.zip',
         'ed2013-spss.zip',
         'ed2014-spss.zip',
         'ed2015-spss.zip',
         'ed2016-spss.zip',
         'ed2017-spss.zip',
         'ed2018-spss.zip',
         'ed2019-spss.zip',
         'ed2019-spss.zip',
         'ed2020-spss.zip',
         'ed2021-spss.zip']
urls = [base_url + f for f in files]

def get_data():
    res_data = []
    for u in urls:
        res_data.append(get_spss(u))
    for r in res_data:
        r.columns = [v.upper() for v in list(r)]
    vars = []
    for i,d in enumerate(res_data):
        year = i + 2001
        vars += list(d)
    vars = list(set(vars))
    vars.sort()
    vars = pd.DataFrame(vars,columns=['V'])
    for i,d in enumerate(res_data):
        year = i + 2001
        uc = [v.upper() for v in list(d)]
        vars[str(year)] = 1*vars['V'].isin(uc)
    return res_data, vars

rd, va = get_data()
all_data = pd.concat(rd,axis=0,ignore_index=True)

Note that the same links with the zipped up sav files also have .sps files, so you can see how the numeric variables are encoded. Or pass in the argument convert_cat=True to the get_spss function and it will turn the data into strings based on the labels.

You can check out which variables are available for which years via the va dataframe. They are quite consistent though. The bigger pain is that for older years, we have ICD9 codes, and for more recent years we have ICD10. So it takes a bit of work to normalize between the two (for ICD10, just looking at the first 3 is ok, for ICD9 you need to look at all 5 though). It is similar to NIBRS crime data, a single event can have different codes associated with it, so you need to look across all of them to identify whether any of the codes are associated with gun assaults.

# Assaulting Gun Violence for ICD9/ICD10
# ICD9, https://www.aapc.com/codes/icd9-codes-range/165/
# ICD9, https://www.cdc.gov/nchs/injury/ice/amsterdam1998/amsterdam1998_guncodes.htm
# ICD10, https://www.icd10data.com/ICD10CM/Codes/V00-Y99/X92-Y09
gv = {'Handgun': ['X93','9650'],
      'Longgun': ['X94','9651','9652','9653'],
      'Othergun': ['X95','9654']}

any_gtype = gv['Handgun'] + gv['Longgun'] + gv['Othergun']
gv['Anygun'] = any_gtype

fields = ['CAUSE1','CAUSE2','CAUSE3']

all_data['Handgun'] = 0
all_data['Longgun'] = 0
all_data['Othergun'] = 0
all_data['Anygun'] = 0


for f in fields:
    for gt, gl in gv.items():
        all_data[gt] = all_data[gt] + 1*all_data[f].isin(gl) + 1*all_data[f].str[:3].isin(gl)

for gt in gv.keys():
    all_data[gt] = all_data[gt].clip(0,1)

There are ranging between 10k and 40k rows in each year, but overall there are very few observations of assaultive gun violence. So even with over 500k rows across the 19 years, there are fewer than 200 incidents of people going to the ER because of a gun assault.

# Not very many, only a handful each
all_data[gv.keys()].sum(axis=0)

# This produces
# Handgun      20
# Longgun      11
# Othergun    139
# Anygun      170

These are far too few in number to estimate the survival rate changes over time. So the Brantingham or Ludwig analysis that looks at larger register data of healthcare claims, or folks looking at reported crime incident data, is likely to be much more reasonable to estimate those trends. If you do a groupby per year this becomes even more stark:

# Per year it is quite tiny
all_data.groupby('YEAR')[list(gv.keys())].sum()

#         Handgun  Longgun  Othergun  Anygun
# YEAR
# 2002        1        0        12      13
# 2003        5        2         4      11
# 2004        1        3        12      16
# 2005        2        1         7      10
# 2007        1        0        14      15
# 2008        2        2        10      14
# 2009        0        0        12      12
# 2010        1        0        10      11
# 2011        2        1         9      12
# 2012        1        0         6       7
# 2013        0        0         0       0
# 2014        1        0         2       3
# 2015        0        0         6       6
# 2016        0        0         5       5
# 2017        0        0         0       0
# 2018        0        1         4       5
# 2019        0        0         6       6
# 2020        0        0         9       9
# 2021        1        0         4       5

While using weights you can get national level estimates, the standard errors are based on the observed number of cases. Which in retrospect I should have realized – gun violence is pretty rare, so rates in the 1 to 100 per 100,000 would be the usual range. If anything these are maybe a tinge higher than I should have guessed (likely due to how CDC does the sampling).

To be able to do the analysis of survival rates I want, the sample sizes here would need to be 100 times larger than they are. Which would require something more akin to NIBRS reporting by hospitals directly, instead of having CDC do boots on the ground samples. Which is feasible of course (no harder for medical providers to do this than police departments), see SPARCS with New York data for example.

But perhaps others can find this useful. It may be easier to do things more like 1/100 events and analyze them. The data has quite a few variables, like readmission due to other events, public/private insurance, different drugs, and then of course all the stuff that is recorded via ICD10 codes (which is both health events as well as behavioral). So probably not a large enough sample to do analysis of other criminal justice related health care incidents, but they do add up to big victim costs to the state that are easy to quantify, as the medicaid population is a large chunk of that.

Leave a comment

by Andy Wheeler on March 30, 2024 • Permalink

Posted in data science, healthcare, Python

Tagged cdc, health data

Posted by Andy Wheeler on March 30, 2024

https://andrewpwheeler.com/2024/03/30/grabbing-the-nhamcs-emergency-room-data-in-python/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 399 other subscribers
aoristic big-data cartography census choropleth citeulike color consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category data science

ChatGPT

Google

Claude

Prompt Engineering Ideas

Compiling the documents

Setting up the online store

Print on Demand using LuLu

Other Random Book Stuff

Personal Goals

References

Creating the Template

References

Rough Estimates for Spatial Police Planning

References

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange