All posts in category Criminal Justice

Year in Review 2024

Past year in review posts I have made focused on showing blog stats. Writing this in early December, but total views will likely be down this year – I am projecting around 140,000 views in total for this site. But I have over 25k views for the Crime De-Coder site, so it is pretty much the same compared to 2023 combining the two sites.

I do not have a succinct elevator speech to tell people what I am working on. With the Crime De-Coder consulting gig, it can be quite eclectic. That Tukey quote being a statistician you get to play in everyone’s backyard is true. Here is a rundown of the paid work I conducted in the past year.

Evidence Based CompStat: Work with Renee Mitchell and the American Society of Evidence Based Policing on what I call Evidence Based CompStat. This mostly amounts to working directly with police departments (it is more project management than crime analysis) to help them get started with implementing evidence based practices. Reach out if that sounds like something your department would be interested in!

Estimating DV Violence: Work supported by the Council on CJ. I forget exactly the timing of events. This was an idea I had for a different topic (to figure out why stores and official reports of thefts were so misaligned). Alex approached me to help with measuring national level domestic violence trends, and I pitched this idea (use local NIBRS data and NCVS to get better local estimates).

Premises Liability: I don’t typically talk about ongoing cases, but you can see a rundown of some of the work I have done in the past. It is mostly using the same stats I used as a crime analyst, but in reference to civil litigation cases.

Patrol Workload Analysis: I would break workload analysis for PDs down into two categories, advanced stats and CALEA reports. I had one PD interested in the simpler CALEA reporting requirement (which I can do for quite a bit cheaper than the other main consulting firm that offers these services).

Kansas City Python Training: Went out to Kansas City for a few days to train their analysts up in using python for Focused Deterrence. If you think the agenda in the pic below looks cool get in touch, I would love to do more of these sessions with PDs. I make it custom for the PD based on your needs, so if you want “python and ArcGIS”, or “predictive models” or whatever, I will modify the material to go over those advanced applications. I have also been pitching the same idea (short courses) for PhD programs. (So many posers in private sector data science, I want more social science PhDs with stronger tech skills!)

Patterson Opioid Outreach: Statistical consulting with Eric Piza and Kevin Wolff on a street outreach intervention intended to reduce opioid overdose in Patterson New Jersey. I don’t have a paper to share for that at the moment, but I used some of the same synthetic control in python code I developed.

Bookstore prices: Work with Scott Jacques, supported by some internal GSU money. Involves scraping course and bookstore data to identify the courses that students spend the most on textbooks. Ultimate goal in mind is to either purchase those books as unlimited epubs (to save the students money), or encourage professors to adopt better open source materials. It is a crazy amount of money students pour into textbooks. Several courses at GSU students cumulatively spend over $100k on course materials per semester. (And since GSU has a large proportion of Pell grant recipients, it means the federal government subsidizes over half of that cost.)

General Statistical Consulting: I do smaller stat consulting contracts on occasion as well. I have an ongoing contract to help with Pam Metzger’s group at the SMU Deason center. Did some small work for AH Datalytics on behind the scenes algorithms to identify anomalous reporting for the real time crime index. I have several times in my career consulted on totally different domains as well, this year had a contract on calculating regression spline curves for some external brain measures.

Data Science Book: And last (that I remember), I published Data Science for Crime Analysis with Python. I still have not gotten my 100 sales I would consider it a success – so if you have not bought a copy go do that right now. (Coupon code APWBLOG will get you $10 off for the next few weeks, either the epub or the paperback.)

Sometimes this seems like I am more successful than I am. I have stopped counting the smaller cold pitches I make (I should be more aggressive with folks, but most of this work is people reaching out to me). But in terms of larger grant proposals or RFPs in that past year, I have submitted quite a few (7 in total) and have landed none of them to date! Submitted a big one to support surveys that myself and Gio won the NIJ competition on for place based surveys to NIJ in their follow up survey solicitation, and it was turned down for example. So it goes.

In addition to the paid work, I still on occasion publish peer reviewed articles. (I need to be careful with my time though.) I published a paper with Kim Rossmo on measuring the buffer zone in journey to crime data. I also published the work on measuring domestic violence supported by the Council on CJ with Alex Piquero.

I took the day gig in Data Science at the end of 2019. Citations are often used as a measure of a scholars influence on the field – they are crazy slow though.

I had 208 citations by the end of 2019, I now have over 1300. Of the 1100 post academia, only a very small number are from articles I wrote after I left (less than 40 total citations). A handful for the NIJ recidivism competition paper (with Gio), and a few for this Covid and shootings paper in Buffalo. The rest of the papers that have a post 2019 publishing date were entirely written before I left academia.

Always happy to chat with folks on teaming up on papers, but it is hard to take the time to work on a paper for free if I have other paid work at the moment. One of the things I need to do to grow the business is to get some more regular work. So if you have a group (academic, think tank, public sector) that is interested in part time (or fractional I guess is what the cool kids are calling it these days), I would love to chat and see if I could help your group out.

1 Comment

by Andy Wheeler on December 3, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Personal Productivity, website, writing

Tagged consulting, training

Posted by Andy Wheeler on December 3, 2024

https://andrewpwheeler.com/2024/12/03/year-in-review-2024/

Some things work

A lawyer and economist Megan Stevenson last year released an essay that was essentially “Nothing Works” 2.0. For those non-criminologists, “nothing works” was a report by Robert Martinson in the 1970’s that was critical of the literature on prisoner rehabilitation, and said to date that essentially all attempts at rehabilitation were ineffective.

Martinson was justified. With the benefit of hindsight, most of the studies Martinson critiques were poorly run (in terms of randomization, they almost all were observational self selected into treatment) and had very low statistical power to detect any benefits. (For people based experiments in CJ, think you typically need 1000+ participants, not a few dozen.)

Field experiments are hard and messy, and typically we are talking about “reducing crime by 20%” or “reducing recidivism by 10%” – they are not miracles. You can only actually know if they work using more rigorous designs that were not used at that point in social sciences.

Stevenson does not deny those minor benefits exist, but moves the goalposts to say CJ experiments to date have failed because they do not generate wide spread sweeping change in CJ systems. This is an impossible standard, and is an example of the perfect being the enemy of the good fallacy.

A recent piece by Brandon del Pozo and colleagues critiques Stevenson, which I agree with the majority of what Brandon says. Stevenson’s main critique are not actually with experiments, but more broadly organizational change (which del Pozo is doing various work in now).

Stevenson’s critique is broader than just policing, but I actually would argue that the proactive policing ideals of hotspots and focused deterrence have diffused into the field broadly enough that her points about systematic change are false (at least in those two examples). Those started out as individual projects though, and only diffused through repeated application in a slow process.

As I get older, am I actually more of the engineer mindset that Stevenson is critical of. As a single person, I cannot change the whole world. As a police officer or crime analyst or professor, you cannot change the larger organization you are a part of. You can however do one good thing at a time.

Even if that singular good thing you did is fleeting, it does not make it in vain.

References

del Pozo, B., Belenko, S., Pivovarova, E., Ray, B., Martins, K. F., & Taxman, F. S. (2024). Using implementation science to improve evidence-based policing: An introduction for researchers and practitioners. Police Quarterly, Online First.
del Pozo, B., Belenko, S., Taxman, F. S., Engel, R. S., Ratcliffe, J., Adams, I., & Piquero, A. R. (2024). Then a miracle occurs: cause, effect, and the heterogeneity of criminal justice research. Journal of Experimental Criminology, Online First
Martinson, R. (1974). What works? Questions and answers about prison reform. The public interest, 35(2), 22-54.
Stevenson, M. (2023) Cause, effect, and the structure of the social world. Boston University Law Review 103 (7), 2001-2047.

Leave a comment

by Andy Wheeler on October 6, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, scholarly

Tagged quasi-experiment

Posted by Andy Wheeler on October 6, 2024

https://andrewpwheeler.com/2024/10/06/some-things-work/

Using GenAI to describe charts for reports

One of the ideas that has come up with the recent GenAI craze is to use these tools to conduct end-to-end data analysis. So you feed it a dataset + a question and out pops an analysis. Mike Zidar has notes on using a Google tool to do this:

I am not worried about these tools usurping crime analysts though. The reason is that the vast majority of data analysis (crime or business or whatever) is very superficial. The hardest part is not generating the chart, it is knowing what chart to generate, and how it will be used by real people to make real decisions. Often in crime analysis you get the ambiguous “well this will help allocate resources” – when in reality your chart can in no way help dictate any realistic decision a department is going to make.

If you cut out analysts, and just have front line individuals asking google “do crime analysis”, it will be hopelessly superficial, and the front line will just ignore it altogether.

I however do think that GenAI has the ability to become power-tools for super-users. That is, someone who does know what they want to calculate, but uses the computer to help them get that information faster. Not dissimilar to how auto-complete while texting helps you type faster. And here is one use case I have been thinking about – so analysts spend a ton of time automating different products, such as monthly CompStat reports. The reports should have tables/graphs in them, like this chart of gun crimes in DC for example:

Now, most reports you want to also have a plain text summary of what is going on in the chart. Currently when auto-generating reports, it is difficult to mix that plain text in and not make it very superficial using a rule based system. The newest round of many of the GenAI tools lets you upload an image and ask it questions about the image. So this is still very open ended, but has many more guiderails than simply telling the machine “go brr and generate analysis”. You have already decided the chart, you are just asking for a nice description of the chart to fill in your already pre-made metrics.

I did this with ChatGPT, Google’s Gemini, and Claude to see how it does with the gun crime chart, the below Raleigh weekly chart:

And in some cases a more tame monthly time series chart:

Because these models are changing all the time, keep in mind that when you read this the newest models may do even better than what I show here (and see the end section on how you may prompt engineer this to produce better results as well). That said lets check out the results!

ChatGPT

For ChatGPT I used the GPT-4o model, I first just asked about the DC chart “Describe the patterns in this chart”. ChatGPT as it is known, is quite verbose:

I then asked ChatGPT to keep it to two to three sentences, and I think it did very well.

Ok, now to the Raleigh chart with error bars:

So this is OK, it spotted the recent increasing trend. “Frequently exceeded the average of the prior 8 weeks (gray shaded area)” is wrong, there are only 2! But I think the last sentence about notable recent spikes is good.

I then gave it the Durham chart that had the anomalies in early 2019/2020:

And I again think this is quite reasonable. I mean an analyst should probably say “these must be reporting idiosyncrasies”, but I don’t think this is not so bad a description as to be misleading.

All in all very happy with the results for ChatGPT here – these charts are not typical line and bar charts, and ChatGPT interpreted each quite well. At least the description is not so bad that if I did these directly in an automated report it would be embarrassing.

Google

For all of these examples I am using the free tools (that typically have limits that I run out by just these two queries). I did this on 5/22 for Google (which I think is Gemini-1.5, I am not 100% sure). So here is the DC gun crime seasonal chart for Google with the prompt Describe the trends in this chart:

This is very wrong in multiple places. It did not do any better with the Raleigh chart:

There is not strong seasonality here, and it includes some filler “important to note” that is not actually important to note. After this I gave it a more tame monthly crime counts of Robberies in Houston chart to see if Google could redeem itself:

And it flopped pretty hard again. Maybe most charts are increasing, so the model is biased to say increasing (I don’t know).

So again this is Google’s free version, and so the paid may be better (or recent updates may be better). But this isn’t even close to make me want to prompt engineer further.

Claude

I did the tests with Sonnet 3.5 (so around a month after the tests with Google/ChatGPT). I used the shorter 2/3 sentences prompt.

I like this description even slightly better than ChatGPT. How about the Raleigh MV thefts chart:

Similar to ChatGPT it is not quite right but in very subtle ways. It does catch the upward trend. It is wrong in terms of data ends before 2024, and the gray area indicating greater variability is technically true but I would not describe it as noticeable. So again not embarrassingly wrong (like Google), but not quite right either.

How about the anomaly Durham chart:

Very similar to ChatGPT, which I think is again OK.

Prompt Engineering Ideas

So the idea behind prompt engineering is you can ask “Describe this chart” or you can ask “Describe this chart in two to three sentences” and it changes the results (in any of these tools). Subsequently a big part of this is figuring out the prompts that give the most reasonable results. Prompts in these GenAI tools when submitting images have two parts, the image and the text. Do not take me as an expert by any means, but for other analysts here are my guesses as to how to prompt engineer this to maybe return better results.

For the text part, one thing you may do for auto reports is to give examples of the text you want. So for example, you could have a prompt that is “Please fill in the blanks: The chart shows _____ trends over time.”. So provide more guidance as to the structure of how you want to the response to look. Or you could do: “Here is example description 1: …., description 2: ….”. This is how RAG applications work, but with a static report you can give static exemplars, they don’t really need to be dynamically looked up from prior reports.

For the image part of the prompt, in an auto-application you may submit a different image than is actually shown in the report. So for example I may have the X axis for monthly crimes be labeled with the actual months (instead of numbers). Putting all the months in smaller font I bet the GenAI tools will still read it just fine, even if I don’t want it to look like that in the final report.

And I probably shouldn’t include logos (since they are immaterial and just cause extra info that may distract the description), and the text footers. I also think making my legends more descriptive may help guide the tools interpretation. I may remove the title text all together and place the relevant info in the prompt “Here is a chart of Robberies in Houston from …. to … Please describe the chart, including any long term trends, or anomalous spikes (high or low) in any month.” The text prompt may keep the tools on track a bit more with the specific details, but still allow them leeway to interpret the chart without being too rigid.

For the error bar chart, you could insert into the prompt explicit dates they were outside, e.g. “weeks a and b were high, make sure to mention that”. So you could have a mix of explicit anomaly detection, insert those anomalies into the prompt, to just keep the results on track.

It would still be a lot of work to automate a report with such plain text language, but I think it could be a quite reasonable iterative workflow. So you generate the report in a format you can edit, like Word, review it. And then in subsequent reports try to tweak the parameters a wee bit to produce better outputs.

2 Comments

by Andy Wheeler on July 16, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science

Tagged ChatGPT, genai

Posted by Andy Wheeler on July 16, 2024

https://andrewpwheeler.com/2024/07/16/using-genai-to-describe-charts-for-reports/

Some notes on self-publishing a tech book

So my book, Data Science for Crime Analysis with Python, is finally out for purchase on my Crime De-Coder website. Folks anywhere in the world can purchase a paperback or epub copy of the book. You can see this post on Crime De-Coder for a preview of the first two chapters, but I wanted to share some of my notes on self publishing. It was some work, but in retrospect it was worth it. Prior books I have been involved with (Wheeler 2017; Wheeler et al. 2021) plus my peer review experience I knew I did not need help copy-editing, so the notes are mostly about creating the physical book and logistics of selling it.

Academics may wish to go with a publisher for prestige reasons (I get it, I was once a professor as well). But it is quite nice once you have done the legwork to publish it yourself. You have control of pricing, and if you want to make money you can, or have it cheap/free for students.

Here I will detail some of the set up of compiling the book, and then the bit of work to distribute it.

Compiling the documents

So the way I compiled the book is via Quarto. I posted my config notes on how to get the book contents to look how I wanted on GitHub. Quarto is meant to run code at the same time (so works nicely for a learning to code book). But even if I just wanted a more typical science/tech book with text/images/equations, I would personally use Quarto since I am familiar with the set up at this point. (If you do not need to run dynamic code you could do it in Pandoc directly, not sure if there is a way to translate a Quarto yaml config to the equivalent Pandoc code it turns into.)

One thing that I think will interest many individuals – you write in plain text markdown. So my writing looks like:

# Chapter Heading

blah, blah blah

## Subheading

Cool stuff here ....

In a series of text files for each chapter of the book. And then I tell Quarto quarter render, and it turns my writing in those text files into both an Epub and a PDF (and other formats if you cared, such as word or html). You can set up the configuration for the book to be different for the different formats (for example I use different fonts in the PDF vs the epub, nice fonts in one look quite bad in the other). See the _quarto.yml file for the set up, in particular config options that are different for both PDF and Epub.

One thing is that ebooks are hard to format nicely – if I had a book I wanted to redo to be an epub, I would translate it to markdown. There are services online that will translate, they will do a bad job though with scientific texts with many figures (and surely will not help you choose nice fonts). So just learn markdown to translate. Folks who write in one format and save to the other (either Epub/HTML to PDF, or PDF to Epub/HTML) are doing it wrong and the translated format will look very bad. Most advice online is for people who have just books with just text, so science people with figures (and footnotes, citations, hyperlinks, equations, etc.) it is almost all bad advice.

So even for qualitative people, learning how to write in markdown to self-publish is a good skill to learn in my opinion.

Setting up the online store

For awhile I have been confused how SaaS companies offer payment plans. (Many websites just seem to copy from generic node templates.) Looking at the Stripe API it just seems over the top for me to script up all of my own solution to integrate Stripe directly. If I wanted to do a subscription I may need to figure that out, but it ended up being for my Hostinger website I can set up a sub-page that is WordPress (even though the entire website is not), and turn on WooCommerce for that sub-page.

WooCommerce ends up being easy, and you can set up the store to host web-assets to download on demand (so when you purchase it generates a unique URL that obfuscates where the digital asset is saved). No programming involved to set up my webstore, it was all just point and click to set things up one time and not that much work in the end.

I am not sure about setting up any DRM for the epub (so in reality people will purchase epub and share it illegally). I don’t know of a way to prevent this without using Amazon+Kindle to distribute the book. But the print book should be OK. (If there were a way for me to donate a single epub copy to all libraries in the US I would totally do that.)

I originally planned on having it on Amazon, but the low margins on both plus the formatting of their idiosyncratic kindle book format (as far as I can tell, I cannot really choose my fonts) made me decide against doing either the print or ebook on Amazon.

Print on Demand using LuLu

For print on demand, I use LuLu.com. They have a nice feature to integrate with WooCommerce, the only thing I wish shipping was dynamically calculated. (I need to make a flat shipping rate for different areas around the globe the way it is set up now, slightly annoying and will change the profit margins depending on area.)

LuLu is a few more dollars to print than Amazon, but it is worth it for my circumstance I believe. Now if I had a book I expected to get many “random Amazon search buys” I could see wanting it on Amazon. I expect more sales will be via personal advertising (like here on the blog, social media, or other crime analyst events). My Crime De-Coder site (and this blog) will likely be quite high in google searches for some of the keywords fairly quickly, so who knows, maybe just having on personal site is just as many sales.

LuLu does has an option to turn on distribution to other wholesalers (like Barnes & Noble and Amazon) – have not turned that on but maybe I will in the future.

LuLu has a pricing calculator to see how much to print on their website. Paperback and basically the cheapest color option for letter sized paper (which is quite large) is just over $17 for my 310 page book (Amazon was just over $15). For folks if you are less image heavy and more text, you could get away with a smaller size book (and maybe black/white) and I suspect will be much cheaper. LuLu’s printing of this book is higher quality compared to Amazon as well (better printing of the colors and nicer stock for the paperback cover).

Another nice thing about print on demand is I can go in and edit/update the book as I see fit. No need to worry about new versions. Not sure what that exactly means for citing the work (I could always go and change it), you can’t have a static version of record and an easy way to update at the same time.

Other Random Book Stuff

I purchased ISBNs on Bowker, something like 10 ISBNs for $200. (You want a unique ISBN for each type of the book, so you may want three in the end if you have epub/paperback/hardback.) Amazon and LuLu though have options to have them give you an ISBN though, so that may have not been necessary. I set the imprint to be my LLC though in Bowker, so CRIME De-Coder is the publisher.

You don’t technically need an ISBN at all, but it is a simple thing, and there may be ways for me to donate to libraries in the future. (If a University picks it up as a class text, I have been at places you need at least one copy for rent at the Uni library.)

I have not created an index – I may have a go at feeding my book through LLMs and seeing if I can auto-generate a nice index. (I just need a list of key words, after that can just go and find-replace the relevent text in the book to fill in so it auto-compiles an index.) I am not sure that is really necessary though for a how-to book, you should just look at the table of contents to see the individual (fairly small) sections. For epub you can just doing a direct text search, so not sure if people use an index at all in epubs.

Personal Goals

So I debated on releasing the book open source, I do want to try and see if I can make some money though. I don’t have this expectation, but there is potential to get some “data science” spillover, and if that is the case sales could in theory be quite high. (I was surprised in searching the “data science python” market on Amazon, it is definitely not saturated.) Personally I will consider at least 100 sales to be my floor for success. That is if I can sell at least 100 copies, I will consider writing more books. If I can’t sell 100 copies I have a hard time justifying the effort – it would just be too few of people buying the book to have the types of positive spillovers I want.

To make back money relative to the amount of work I put in, I would need more than 1000 sales (which I think is unrealistic). I think 500 sales is about best case, guesstimating the size of the crime analyst community that may be interested plus some additional sales for grad students. 1000 sales it would need to be in the multiple professors using it for a class book over several years. (Which if you are a professor and interested in this for a class let me know, I will give your class a discount.)

Another common way for individuals to make money off of books is not for sales, but to have training’s oriented around the book. I am hoping to do more of that for crime analysts directly in the future, but those opportunities I presume will be correlated with total sales.

I do enjoy writing, but I am busy, so cannot just say “I am going to drop 200 hours writing a book”. I would like to write additional python topics oriented towards crime analysts/criminology grad students like:

GIS analysis in python
Regression
Machine Learning & Optimization
Statistics for Crime Analysis
More advanced project management in python

Having figured out much of this grunt work definitely makes me more motivated, but ultimately in the end need to have a certain level of sales to justify the effort. So please if you like the blog pick up a copy and tell a friend you like my work!

References

Wheeler, A.P. (2024) Data Science for Crime Analysis with Python. CRIME De-Coder. ISBN 979-8-9903770-1-1
Wheeler, A.P. (2017) Geospatial analytics. In SPSS® Statistics for Data Analysis and Visualization. Eds. McCormick, K. & Salcedo J., with contributions by Peck J. & Wheeler, A.P. Wiley. ISBN 978-1119003557
Wheeler, A.P., Herrmann C.R., & Block R.L. (2021). Micro-place homicide patterns in Chicago. Springer. ISBN 978-3-030-61445-4

6 Comments

by Andy Wheeler on July 2, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Personal Productivity, Python, writing

Tagged data visualization, ebook

Posted by Andy Wheeler on July 2, 2024

https://andrewpwheeler.com/2024/07/02/some-notes-on-self-publishing-a-tech-book/

Notes on MMc queues

Recently had a project related to queues at work, so wanted to put some of my notes in a blog post. For a bit of up-front, the notation MMc refers to a queuing system with multiple servers (c), and the inputs are Poisson distributed (the first M), and have exponential service rates M (these Ms can be different though). That is a mouthful, but basically saying events that arrive in independently and have a left skewed distribution of times it takes to resolve those events. (That may seem like a lot of assumptions, they are often reasonable though for many systems, and if not deviations may not be that big of deal to the estimates in practice.)

Main reason for blog post is that the vast majority of stuff online is about MM1 queue systems, so systems that only have 1 server. I basically never deal with this situation. The formulas for multiple servers are much more complicated, so took me a bit to gather code examples and verify correctness. These are notes based on that work.

So for up-front, the group I was dealing with at work had a fundamental problem, their throughput was waaay too small. In this notation, we have:

Number of arrivals per time period, N
Mean time it takes to exit the queue, S
Number of servers, c

So first, you need to have N*S < c! This is simple accounting, so say we are talking about police calls for service, you have on average 5 calls per hour, and they take on average 0.5 hours (30 minutes) to handle. You then need more than 5*0.5 = 2.5 officers to handle this, so a minimum of 3 officers. If you don’t have 3 officers, the queue will grow, you won’t be able to handle all of the calls.

At work I was advising a situation where they were chronically too low of staff serving for a particular project, and it has ballooned over an extended period of time to create an unacceptable backlog. So think S is really tiny and N is very large – at first the too small of servers could cycle through the tickets, but the backlog just slowly grew, and then after months, they had unacceptable wait times. This is a total mess – there is no accounting trick to solve this, you need c > N*S. It makes no sense to talk about anything else like average wait time in the queue unless that condition is met.

OK, so we know you need c > N*S, a common rule of thumb is that capacity should not be over 80%, so that is c > (N*S)/0.8. (This is not for policing, but more common for call centers, see also posts on Erlang-C formulas.) The idea behind 80% it is at the point where wait times (being held in the queue) start to grow.

If you want to get more into the nitty gritty though, such as calculating the actual probability in the queue, average wait time, etc., then you will want to dig into the MMc queue lit. Here I have posted some python notes (that is itself derivative work others have posted). Hoping just posting and giving my thumbs up makes it easier for others.

So first here is an example of using those functions to estimate our queue example above. Note you need to give the inverse of the mean service time for this function.

# queuing functions in python
from queue import MMc, nc

N = 6    # 6 calls per hour
S = 0.5  # calls take 30 minutes to resolve
c = 7    # officers taking calls

# This function expects inverse service average
qS = MMc(N,1/S,c)

# Now can get stats of interest

# This is the probability that when a call comes
# in, it needs to wait for an officer
qS.getQueueProb()

And this prints out 0.0376.... So when a call comes in, we have a 3% probability of having to wait in the queue for an officer to respond. How about how long on average a call will wait in the queue?

# This is how long a call on average needs
# to wait in the queue in minutes
qS.getAvgQueueTime()*60

And this gives 0.28.... The multiplication by 60 goes from hours to minutes, and we are waiting less than 1 minute on average. This seems good, but somewhat counter-intuitively, this is an average of a bunch of calls answered immediately, plus the 3.8% of calls that are held for some time. We can calculate the estimate of if a call is held, how long will it be held on average:

# If a call is queued however, how long to wait?
qS.getAvgQueueTime_Given()*60

And this is a less rosy 17.5 minutes! Queues are tricky. Unless you have a lot of extra capacity, there are going to be wait times. We can also calculate how often all officers will be idle in this setup.

# Idle time, no one taking any calls
qS.getIdleProb()

And this gives rounded 0.05, so we have only 5% idle time in this set up. This is not that helpful though for police planning, you want individual officers to have capacity to do proactive work, that is more you want officers to only spend 40-60% on responding to calls for service, so that suggests c > (N*S)/0.5 is where you want to be. Which is where we are at in this scenario with 7 officers. This is the probability all 7 officers will be idle at once, which does not really matter.

Now you can technically just run this through multiple values of c to get this, but Rosetti (2021) has listed an approximate square root staffing formula that given an input probability wait in the queue, how many servers do you need. So here is that function:

# If you want probability of holding in the queue to only be 3%
est_serv = nc(N,S,0.03)
print(est_serv)

Which prints out 6.387..., so since you need to take the ceiling of this, you will need to 7 officers to keep to that probability (agreeing with the MMc object above).

In terms of values, the nc function will work with very large/small N and S inputs just fine. The MMc function also looks fine, minus one submethod uses a factorial, .getPk (so cannot have very large inputs to that method), but the rest is OK. So if you wanted to do nc(very_big,very_small,0.1) that is fine and should be no floating point issues.

The nc function relies on scipy, but the MMc class is all base python (just the math library). So the MMc functions can really be embedded in any particular python application you want with no real problem.

Rough Estimates for Spatial Police Planning

I have prior work on spatial allocation of patrol units with workload equality constraints (Wheeler, 2018). But generally, you need to first estimate how many units you will have, and after that you can worry about optimally distributing them. The reason for this is that the number of units is much more important, too few and you will have more queuing, in which case the spatial arrangement does not matter at all. Larson & Stevenson (1972) estimate optimal spatial allocation only beats random allocation by 25%.

So for police response times you can think about time waiting in queue, time spent driving to the event, and time spent resolving the event (time to dispatch tends to be quite trivial, but is sometimes included in the wait in the queue part, Verlaan & Ruiter, 2023).

There is somewhat of a relationship between the above “service” time, fewer people have to drive farther, and so service time goes up. But there happens to some simple rules of thumb, if you have N patrol units, you can calculate (2/3)*sqrt(Square Miles)/sqrt(N) = average distance traveled in miles for your jurisdiction (Stenzel, 1993, see page 135 in the PDF). Then you can translate that miles driven to time, by say taking an average of 45 miles per hour. Given a fixed N, you can then just add this into the service time estimate for your given jurisdiction to get a rough estimate of more officers will reduce response times by X amount.

It ends up being though this tends to be trivial relation to the waiting in the queue time (or the typical it takes 30 minutes to resolve police incidents on average). So it is often more important to get rough estimates for that if you want to reduce wait times for calls for service. And this does not even take into account priority levels in calls, but to start simpler folks should figure out a minimum to handle the call stack (whether in policing or in other areas) and then go onto more complicated scenarios.

References

Larson, R.C., & Stevenson, K.A. (1972). On insensitivities in urban redistricting and facility location. Operations Research, 20(3), 595-612.
Rossetti, M.D. (2021). Simulation Modeling and Arena, 3rd and Open Text Edition.
Stenzel, W. (1993). Police Personnel Allocation Manual Users Guide. US Dept of Transportation National Highway Traffic Safety.
Verlaan, T., & Ruiter, S. (2023). Predictors of police response time: a scoping review. Crime Science, 12(1), 19.
Wheeler, A. P. (2018). Creating optimal patrol areas using the p-median model. Policing: An International Journal, 42(3), 318-333.

Leave a comment

by Andy Wheeler on May 29, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Python

Tagged queue

Posted by Andy Wheeler on May 29, 2024

https://andrewpwheeler.com/2024/05/29/notes-on-mmc-queues/

Power for analyzing Likert items

First for some other updates of interest to folks on the blog. On CRIME De-Coder a blog post, You should be geocoding crime data locally. I give python code to create a local geocoding engine using the arcpy library.

This is a bit more techy, so would typically post this on this blog instead of the CRIME De-Coder one. But, currently the web is sorely lacking in good advice for local geocoding solutions. Vast majority of sites discuss online geocoding APIs (e.g. google or the census geocoder), which I guess are common for web-apps, but they do not make sense for crime analysis. For the few webpages that are actually relevant to describe local solutions (that do not involve calling an online web API), all the exmples use PostGIS that I am aware of. PostGIS is both very difficult to setup and has worse results compared to ESRI. So I know ESRI is a paid for service, but they have reasonable academic and small business pricing (and most police departments already have access), so to me this is a reasonable use case. If you need to geocode 100k cases, the license fee for ESRI is worth it at that point relative to using the web engines.

Definitely do not spend thousands of dollars if you need to batch geocode a few million records. That is something that is worth getting in touch with me about. And so hopefully that gets picked up by search engines and drives a bit more traffic to my consulting website.

A second example I posted some python code to help construct network experiments. So the idea here is you want to spread out the treated nodes so you have a specific allocation of treated, connected to treated (what I call spillover here), and those not connected to treated (the leftover control group). This python code constructs linear programs to accomplish certain treated/not-touched proportions. So this graph shows if you choose to treat 1 person, but have constraints on 1,2,3 leftover.

And then you can apply this to bigger networks, here the network is 311 nodes, and 90 are treated and I want a total of 150 not treated.

Idea derivative from some work Bruce Desmarais discussed on Twitter, but also have thought about this in some discussion with Barak Ariel in focused deterrence style interventions. So hopefully that comes in handy.

My linear programming formulation is not as svelte as the optimal treatment assignment with spillovers formulation, it is 3*N + 2*E decision variables and 5*N + 2*E constraints (where N is the number of nodes and E is the number of un-directed edges). I have a feeling my formulation is redundant, so if I write my constraints smarter can be more like 2N decision variables and 2N + E constraints.

But for my examples I show it solves quite fast as is (and maybe solvers get rid of the cruft in pre-solve), so not worried about that at the moment. Don’t know the typical size networks people use, but I suspect it will work just fine and dandy on typical machines for networks even as large as 10k. (Generally if I keep the model to under 100k decision variables it is only a few minutes to solve the types of problems I show on this blog.)

Power with Likert items

The other day for a grant application we needed to conduct power analysis. Our design was t-test of mean differences for a simple treated/control group randomized experiment with the outcome being a Likert score survey response. (I know, most of the time people create latent scores with Likert items, analyzing the individual items is fine IMO and simpler to specify for a pre-registration analysis.) I am sure others have needed to do similar things, but I could not find code online to help out with this. So I scripted up a simulation in R to do this, and figured sharing would be useful.

So the rub with Likert data, and why you can’t use typical power calculations, is that they have ceiling effects. If most people answer on average 4.5 out of the your scale up to 5, it is difficult to go much higher. Here I simulate data that has that skew (so ceiling effects come into play), and then go through the motions of doing the t-test. So first for some setup, I have a function that rounds and clips data to limited sets of integers, plus some plotting functions.

# power analysis of Likert data
library(ggplot2)

# custom theme
theme_cdc <- function(){
  theme_bw() %+replace% theme(
     text = element_text(size = 16),
     panel.grid.major= element_line(linetype = "longdash"),
     panel.grid.minor= element_blank()
) }

set.seed(10) # setting the random seed

# rounding/clipping data to integers 
clipint <- function(x,min=1,max=5){
    rx <- round(x)
    rx <- ifelse(rx < min,min,rx)
    rx <- ifelse(rx > max,max,rx)
    return(rx)
}

This following function generates the p-values and standard errors, what I will use later in my simulation. Here I use a t-test of mean differences, but it would be fairly easy to say swap out with an ordinal logistic regression if you prefer that. Probably the bigger deal is the simulation generates data using a normal distribution, and then post rounds/clip the data. There is probably a smarter way to generate the data according to the logistic model and ordinal intercepts (Frank Harrell discusses such things a bit on his blog). But this at least takes into account that the data will be skewed, even in the control group, to have more positive outcomes and thus take the ceiling into account.

# this uses OLS to do t-test of mean differences
# generates normal data, but then rounds/clips
sim_ols <- function(n,eff=0.5,int=4,sd=1){
    df <- data.frame(1:n)
    df$treat <- sample(c(0,1),n,replace=TRUE)
    df$latent <- int + eff*df$treat + rnorm(n,sd=sd)
    df$likert <- clipint(df$latent)
    m1 <- lm(likert ~ treat,data=df)
    cd <- coef(summary(m1))
    pval <- cd[2,4]
    se <- cd[2,2]
    return(c(pval,se))
}

Now we can move onto the simulations, this evaluates sample sizes from 100 to 2000 (in increments of 50), effect sizes of 0.1, 0.3, and 0.5, and repeats the simulations 10k times. I then see the power (how often the two-tailed p-value is less than 0.05), as well as the standard error (precision) of the estimates. Effect sizes are in terms of the original Likert scale values, what I take to be much easier to reason about. (I have seen power analyses here use Cohen’s D, which you really can’t get a very large Cohen’s D value due to ceiling effects with this data.)

# running power estimates over different
# sample sizes and effect sizes
samp_sizes <- seq(100,2000,50)
eff_sizes <- c(0.1,0.3,0.5)
rep_size <- 10000
df <- expand.grid(samps_sizes=samp_sizes,eff_sizes=eff_sizes,pow=c(NA),se=c(NA))
for (i in 1:nrow(df)){
    ss <- df[i,1]
    es <- df[i,2]
    stats <- replicate(rep_size,sim_ols(n=ss,eff=es))
    smean <- rowMeans(stats)
    df[i,3] <- mean(stats[1,] < 0.05) # alpha 0.05
    df[i,4] <- smean[2]
}

df$eff_sizes <- as.factor(df$eff_sizes)

The graph of the power shows what you would expect, so with a few hundred samples you can determine an effect size of 0.5, but with a smaller effect size (on the order of 0.1) you will need more than 2k samples.

# Graph of power
powg <- ggplot(data=df,aes(x=samps_sizes,y=pow)) + 
        geom_line(aes(color=eff_sizes)) + 
        geom_point(pch=21,color='white',size=2,aes(fill=eff_sizes)) +
        labs(x='Sample Sizes',y=NULL,title='Power Estimates') +
        scale_y_continuous(breaks=seq(0,1,0.1)) + 
        scale_x_continuous(breaks=seq(100,2000,250)) + 
        scale_color_discrete(name="Effect Sizes") +
        scale_fill_discrete(name="Effect Sizes") + 
        theme_cdc()

Unfortunately, in reality with most survey measures of police data, e.g. rate your officer 1 to 5, a 0.5 effect is a really large increase. In my mapping attitudes paper, some demographics shift global attitudes at max by 0.2, and I doubt most interventions will be that dramatic. So I like plotting the precision of the estimator, which the effect size doesn’t really make a dent here (it could with more severe ceiling effects).

# Graph of Standard Errors (for Precision)
precg <- ggplot(data=df,aes(x=samps_sizes,y=se,color=eff_sizes)) + 
         geom_line(aes(color=eff_sizes)) + 
         geom_point(pch=21,color='white',size=2,aes(fill=eff_sizes)) +
         labs(x='Sample Sizes',y=NULL,title='Precision Estimates') +
         scale_x_continuous(breaks=seq(100,2000,250)) + 
         scale_color_discrete(name="Effect Sizes") +
         scale_fill_discrete(name="Effect Sizes") + 
         theme_cdc()

With field experiments when considering post police contacts (general attitude surveys you have more wiggle room, but still they cost money to survey) you really can’t control the sample size. You are at the whims of whatever events happen in the police departments daily duties. So the best you can do is make approximate plans for “how long am I going to collect measures before I analyze the data”, and “how reasonably precise will my estimates be”.

This particular grant I make arguments we care more about a non-inferiority type test (e.g. can be sure attitudes are not worse, within a particular error level, given our treatment is more cost-effective than business as usual). But if we did an intervention specifically intended to improve attitudes, you may be talking more like 5,000+ contacts to detect an effect of 0.1 (given likely sample non-response), which is still a big effect.

You would gain power by doing a scale (e.g. summing together multiple items or conducting a factor analysis), assuming the intervention effects the underlying latent trait, and piecemeal individual items. But that will have to be for another day simulating data to do that end-to-end analysis.

Despite not being an academic anymore, I have in the hopper more grant collaborations than I did when I was a professor. The NIJ season is winding down, so probably too late to collaborate on any more of those. But if you have other ideas and need quant help with your projects, always feel free to reach out. I enjoy doing these side projects with academics when reasonable funding is available.

Leave a comment

by Andy Wheeler on May 2, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, Data Visualization, Python, R

Tagged power-analysis, survey

Posted by Andy Wheeler on May 2, 2024

https://andrewpwheeler.com/2024/05/02/power-for-analyzing-likert-items/

Harmweighted hotspots, using ESRI python API, and Crime De-Coder Updates

Haven’t gotten the time to publish a blog post in a few. There has been a ton of stuff I have put out on my Crime De-Coder website recently. For some samples since I last mentioned here, have published four blog posts:

on what AI regulation in policing would look like
high level advice on creating dashboards
overview of early warning systems for police
types of surveys for police departments

For surveys a few different groups have reached out to me in regards to the NIJ measuring attitudes solicitation (which is essentially a follow up of the competition Gio and myself won). So get in touch if interested (whether a PD or a research group), may try to coordinate everyone to have one submission instead of several competing ones.

To keep up with everything, my suggestion is to sign up for the RSS feed on the site. If you want an email use the if this than that service. (I may have to stop doing my AltAc newsletter emails, it is so painful to send 200 emails and I really don’t care to sign up for another paid for service to do that.)

I also have continued the AltAc newsletter. Getting started with LLMs, using secrets, advice on HTML, all sorts of little pieces of advice every other week.

I have created a new page for presentations. Including, my recent presentation at the Carolina Crime Analysis Association Conference. (Pic courtesy of Joel Caplan who was repping his Simsi product – thank you Joel!)

If other regional IACA groups are interested in a speaker always feel free to reach out.

And finally a new demo on creating a static report using quarto/python. It is a word template I created (I like often generating word documents that are easier to post-hoc edit, it is ok to automate 90% and still need a few more tweaks.)

Harmweighted Hotspots

If you like this blog, also check out Iain Agar’s posts, GIS/SQL/crime analysis – the good stuff. Here I wanted to make a quick note about his post on weighting Crime Harm spots.

So the idea is that when mapping harm spots, you could have two different areas with same high harm, but say one location had 1 murder and one had 100 thefts. So if murder harm weight = 100 and theft harm weight = 1, they would be equal in weight. Iain talks about different transformations of harm, but another way to think about it is in terms of variance. So here assuming Poisson variance (although in practice that is not necessary, you could estimate the variance given enough historical time series data), you would have for your two hotspots:

Hotspot1: mean 1 homicide, variance 1
Hotspot2: mean 100 thefts, variance 100

Weight of 100 for homicides, 1 for theft

Hotspot1: Harmweight = 1*100 = 100
          Variance = 100^2*1 = 10,000
          SD = sqrt(10,000) = 100

Hotspot2: Harmweight = 100*1 = 100
          Variance = 1^2*100 = 100
          SD = sqrt(100) = 10

When you multiply by a constant, which is what you are doing when multiplying by harm weights, the relationship with variance is Var(const*x) = const^2*Var(x). The harm weights add variance, so you may simple add a penalty term, or rank by something like Harmweight - 2*SD (so the lower end of the harm CI). So in this example, the low end of the CI for Hotspot 1 is 0, but the low end of the CI for Hotspot2 is 80. So you would rank Hotspot2 higher, even though they are the same point estimate of harm.

The rank by low CI is a trick I learned from Evan Miller’s blog.

You could fancy this up more with estimating actual models, having multiple harm counts, etc. But this is a quick way to do it in a spreadsheet with just simple counts (assuming Poisson variance). Which I think is often quite reasonable in practice.

Using ESRI Python API

So I knew you could use python in ESRI, they have a notebook interface now. What I did not realize is now with Pro you can simply do pip install arcgis, and then just interact with your org. So for a quick example:

from arcgis.gis import GIS

# Your ESRI url
gis = GIS("https://modelpd.maps.arcgis.com/", username="user_email", password="???yourpassword???")
# For batch geocoding, probably need to do GIS(api_key=<your api key>)

This can be in whatever environment you want, so you don’t even need ArcGIS installed on the system to use this. It is all web-api’s with Pro. To geocode for example, you would then do:

from arcgis.geocoding import geocode, Geocoder, get_geocoders, batch_geocode

# Can search to see if any nice soul has published a geocoding server

arcgis_online = GIS()
items = arcgis_online.content.search('geocoder north carolina', 'geocoding service', max_items=30)

# And we have four
#[<Item title:"North Carolina Address Locator" type:Geocoding Layer owner:ecw31_dukeuniv>,
# <Item title:"Southeast North Carolina Geocoding Service" type:Geocoding Layer owner:RaleighGIS>, 
# <Item title:"Geocoding Service - AddressNC " type:Geocoding Layer owner:nconemap>, 
# <Item title:"ArcGIS World Geocoding Service - NC Extent" type:Geocoding Layer owner:NCDOT.GOV>]

geoNC = Geocoder.fromitem(items[0]) # lets try Duke
#geoNC = Geocoder.fromitem(items[-1]) # NCDOT.GOV
# can also do directly from URL
# via items[0].url
# url = 'https://utility.arcgis.com/usrsvcs/servers/8caecdf6384144cbafc9d56944af1ccf/rest/services/World/GeocodeServer'
# geoNC = Geocoder(url,gis)

# DPAC
res = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1)
print(res[0])

Note you cannot cache the geocoding results. To do that, you need to use credits and probably sign in via a token and not a username password.

# To cache, need a token
r2 = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1,for_storage=True)

# If you have multiple addresses, use batch_geocode, again need a token
#dc_res = batch_geocode(FullAddressList, geocoder=geoNC)

Geocoding to this day is still such a pain. I will need to figure out if you can make a local geocoding engine with ESRI and then call that through Pro (I mean I know you can, but not sure pricing for all that).

Overall being able to work directly in python makes my life so much easier, will need to dig more into making some standard dashboards and ETL processes using ESRI’s tools.

I have another post that has been half finished about using the ESRI web APIs, hopefully will have time to put that together before another 6 months passes me by!

Leave a comment

by Andy Wheeler on March 3, 2024 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, data science, geocoding, Python

Tagged ESRI, speaking

Posted by Andy Wheeler on March 3, 2024

https://andrewpwheeler.com/2024/03/03/harmweighted-hotspots-using-esri-python-api-and-crime-de-coder-updates/

Recoding America review, Data Science CV Update, Sworn Dashboard

Over this Christmas break I read Jennifer Pahlka’s Recoding America. It is a great book and I really recommend it.

My experience working in criminal justice is a bit different than Pahlka’s examples, but even if you are just interested in private sector product/project management this is a great book. It has various user experience gems as well (such as forms that eliminate people, put the eliminating questions in order by how many people they filter).

Pahlka really digs on waterfall, I have critiqued agile on the blog in the past, but we are both just using generic words to describe bad behavior. I feel like a kindred spirit with Pahlka based on some of her anecdotes; concrete boats, ridiculous form questions, PDF inputs that only work on ancient web-browsers, mainframes are not the problem stupid requirements are, hiring too many people makes things worse, people hanging up on them in phone calls when you tell the truth – so many good examples.

To be specific with agile/waterfall, Pahlka is very critical of fixed requirements coming down on high from policy makers. When you don’t have strong communication at the requirements gathering stage between techies, users and owners making the requests (which can happen in private sector too), you can get some comical inefficiencies.

A good example for my CJ followers are policies to do auto-clearance of records in California. So the policy makers made a policy that said those with felony convictions for stealing less than $1,000 can be expunged, but there is no automated way to do this, since the criminal records do not save the specific dollar amount in the larceny charge. (And to do the manual process is very difficult, so pretty much no one bothers.) It probably would make more sense to say something like “a single felony larceny charge that is 5 years old will be auto-cleared”, that is not exactly the same but is similar in spirit to what the legislature wants, and can be easily automated based on criminal records that are already collected by the state. A real effective solution would look like data people working with policy makers directly and giving scenarios “if we set the criteria to X, it will result in Y clearances”. These are close to trivial things to ask a database person to comment on, there is no fundamental reason why policy/techs can’t go back in forth and craft policy that makes sense and is simpler to implement.

To be more generic, what can happen is someone requests X, X is really hard/impossible, but you can suggest a,b,c instead that is easier to accomplish and probably meets the same high level goals. There is asymmetry in what people ask for and understanding of the work it takes to accomplish those requests, an important part of your job as a programmer/analyst is to give feedback to those asking to make the requests better. It takes understanding from the techies of the business requirements (Pahlka suggests govt should hire more product owners, IMO would rather just have senior developer roles do that stuff directly). And it takes people asking to be open to potential changes. Which most people are in my experience, just sometimes you get people who hang up in phone calls when you don’t tell them what they want to hear.

I actually like the longer term plan out a few months waterfall approach (I find that easier to manage junior developers, I think the agile shorter term stuff is too overbearing at times). But it requires good planning and communication between end users and developers no matter whether you say you are doing waterfall or agile. My experience in policing is not much like the policy people giving stone tablets, I have always had more flexibility to give suggestions in my roles. But I do think many junior crime analysts need to learn to say “you asked for percent change, here is a different stat instead that is better for what you want”.

What I am trying to do with CRIME De-Coder is really consistent with Pahlka’s goals with Code for America. I think it is really important for CJ agencies to take on more human investment in tech. Part of the reason I started CRIME De-Coder was anger – I get angry when I see cities pay software vendors six digits for crappy software that a good crime analyst could do. Or pay a consulting firm 6 figures for some mediocre (and often inappropriate) statistical analysis. Cities can do so much better by internally developing skills to take on many software projects, which are not moving mountains, and often outside software causes more problems than they solve.

At work we are starting to hire a new round of data scientists (no links to share, they are offshore in India, and first round is through a different service). The resume over-stating technical expertise for data scientists is at lol levels at this point. Amazing how everyone is an LLM, deep learning, and big data expert these days.

I’ve written before how I am at a loss on how to interview data scientists. The resumes I am getting are also pretty much worthless at this point. One problem I am seeing in these resumes is that people work on teams, so people can legitimately claim “I worked on this LLM”, but when you dig in and ask about specifics you find out they only contributed this tiny thing (which is normal/OK). But the resumes look like they are Jedi masters in advanced machine learning.

I went and updated my data science resume in response to reading others. (I should probably put that in HTML, so it shows up in google search results.) I don’t really have advice for folks “what should your resume look like” – I have no clue how recruiters view these things. No doubt my resume is not immune to a recruiter saying “you have 10+ years with python, but I don’t see any Jira experience, so I don’t think you are qualified”.

What I have done is only include stuff in the resume where I can link to specific, public examples (peer reviewed work, blog posts, web pages, github). I doubt recruiters are going to click on a single link in the resume (let alone all 40+), but that is what I personally would prefer when I am reviewing a resume. Real tangible stuff so someone can see I actually know how to write code.

So for example in the most recent update of the resume, I took Unix, Kubernetes/Docker, Azure, and Databricks off. Those are all tech I have worked with at HMS/Gainwell, but do not have any public footprint to really show off. I have some stuff on Docker on the blog, but nothing real whiz bang to brag about. And I have written some about my deployment strategy for python code in Databricks using github actions. (I do like Azure DevOps pipelines, very similar to building github actions, which are nice for many of the batch script processes I do. My favorite deployment pattern at work is using conda + persistent Fedora VMs. Handling servers/kubernetes everything docker is a total pain.) “Expertise” in those tools is probably too strong, I think claiming basic competence is reasonable though. (Databricks has changed so much in the two years we have been using it at work I’m not sure anyone outside of Databricks themselves could claim expertise – only if you are a very fast learner!)

But there is no real fundamental way for an outsider to know I have any level of competence/expertise in these tech tools. Honestly they do not matter – if you want me to use google cloud or AWS for something equivalent to Azure DevOps, or Snowflake instead of Databricks, it doesn’t really matter. You just learn the local stack in a month or two. Some rare things you do need very specialized tech skills, say if someone wanted me to optimize latency in serving pytorch LLMs, that will be tough given my background. Good luck posting that position on LinkedIn!

But the other things I list, I can at least pull up a web page to say “here is code I wrote to do this specific thing”. Proof is in the pudding. Literally 0 of the resumes I am reviewing currently have outside links to any code, so it could all be made up (and clearly for many people is embellished). I am sure people think mine is embellished as well, best I can do to respond to that is share public links.

For updates on CRIME De-Coder:

Sworn Dashboard, trends in officer hiring using D3.js to make charts and supabase as a backend
How to evaluate predictive policing software blog post
Crime De-Coder Storefront, currently can purchase (in and outside US) my GUI tool to download streetview images, as well as pre-release of python book

I researched ways to do payments for so long, in the end just turning on WooPayments in wordpress (and using an iframe) was a super simple solution (and it works fine for digital downloads and international payments). Will need to figure out webhooks with Stripe to do more complicated stuff eventually (like SaaS services, licenses, recurring payments), but for now this set up works for what I need.

I will start up newsletters again next week.

2 Comments

by Andy Wheeler on January 2, 2024 • Permalink

Posted in Criminal Justice, data science, scholarly

Tagged book-review, resume

Posted by Andy Wheeler on January 2, 2024

https://andrewpwheeler.com/2024/01/02/recoding-america-review-data-science-cv-update-sworn-dashboard/

Year in Review 2023: How did CRIME De-Coder do?

In 2023, I published 45 pages on the blog. Cumulative site views were slightly more than last year, a few over 150,000.

I would have had pretty much steady cumulative views from last year (site views took a dip in April, the prior year had quite a bit of growth, I suspect something to do with the way WordPress counts stats changed), but in December my post Forecasts need to have error bars hit front page on Hackernews. This generated about 12k views for that post over two days. (In 2022 I had just shy of 140k views in total.)

It was very high on the front page (#1) for most of that day. So for folks who want to guesstimate the “death by Hackernews” referrals, I would guess if your site/app can handle 10k requests in an hour you will be ok. WordPress by default this is fine (my Crime De-Coder Hostinger site is maybe not so good for that, the SLA is 20k requests per day). Also interesting note, about 10% of people who were referred to the forecast post clicked at least one other page on the site.

So I started CRIME De-Coder in February this year. I have published a few over 30 pages on that site during the year, and have accumulated a total of a few more than 11k site views. This is very similar to the first year of my personal blog, with publishing around 30 posts and getting just over 7k total views for the year. This is almost entirely via direct referrals (I share posts on LinkedIn, google searches are just a trickle).

Sometimes people are like “cool you started your own company”, but really I did that same type of consulting since I was in grad school. I have had a fairly consistent set of consulting work (around $20k per year) for quite awhile. That was people cold asking me for help with mostly statistical analysis.

The reason I started CRIME De-Coder was to be more intentional about it – advertise the work I do, instead of waiting for people to come to me. Doing your own LLC is simple, and it is more a website than anything.

So how much money did I make this year for CRIME De-Coder? Not that much more than $30k (I don’t count the data competitions I won in that metric, but actual commissioned work.) I do have substantially more work lined up for next year though already (more on the order of $50k so far, although no doubt some of that will fall through).

I sent out something like 30 some soft pitches during the year to people in my extended network (first or strong second degree). I don’t know the typical rate of something like that, but mine was abysmal – I was lucky to get an email response no thanks. These are just ideas like “hey I could build you an interactive dashboard with your data” or “you paid this group $150k, I would do that same thing for less than $30k”.

Having CRIME De-Coder did however did increase my first degree network to “ask me for stat analysis” more. So it was definitely worth spending time doing the website and creating the LLC. Don’t ask me for advice though about making pitches for consulting work!

The goal is ultimately to be able to go solo, and just do my consulting work as my full time job. It is hard to see that happening though – even if I had 5 times the amount of work lined up, it would still just be short term single projects. I have pitched more consistent retainers, but no one has gone for that. Small police departments if interested in outsourcing crime analysis let me know – that is I believe the best solution for them. Also have pitched to think tanks to hire me part time as well, as well as CJ programs to hire me in part time roles as well. I understand the CJ programs no interest, I am way more expensive than typical adjunct, I am a good deal for other groups though. (I mean I am good deal for CJ programs as well, part of the value add is supervising students for research, but universities don’t value that very high.)

I will ultimately keep at it – sending email pitches is easy. And I am hoping that as the website gets more organic search referrals, I will be able to break out of my first degree network.

Leave a comment

by Andy Wheeler on December 26, 2023 • Permalink

Posted in Crime Analysis, Criminal Justice, Meta, website, writing

Tagged year-in-review

Posted by Andy Wheeler on December 26, 2023

https://andrewpwheeler.com/2023/12/26/year-in-review-2023-how-did-crime-de-coder-do/

Won NIJ competition on surveys

The submission Gio and myself put together, Using Every Door Direct Mail Web Push Surveys and Multi-level modelling with Post Stratification to estimate Perceptions of Police at Small Geographies, has won the NIJ Innovations in Measuring Community Attitudes Survey challenge.

Specifically we took 1st in the non-probability section of the competition. The paper has the details, but using every door direct mail + post-stratifying the estimates is the approach we advocate. If you are a city or research group interested in implementing this and need help, feel free to get in touch.

Of course if you want to do this yourself go for it (part of the reason it won was because the method should be doable for many agencies in house), but letting me and Gio know we were the inspiration is appreciated!

Second, for recruiting for criminology PhDs, CRIME De-Coder has teamed up with the open access CrimRXiv consortium

This example shows professor adverts, but I think the best value add for this is for more advanced local govt positions. Anymore many of those civil service gigs are very competitive with lagging professor salaries.

For people hiring advanced roles, there are two opportunities. One is advertising – so for about the same amount as advertising on LinkedIn, you can publish a job advert. This is much more targeted than LinkedIn, so if you want PhD talent this is a good deal to get your job posting on the right eyeballs.

The second service is recruiting for a position. This is commission based – if I place a candidate for the role then you pay the recruiter (me and CrimRXiv) a commission. For that I personally reach out to my network of people with PhDs interested in positions, and do the first round of vetting for your role.

Third, over on Crime De-Coder I have another round of the newsletter up, advice this round is that many smaller cities have good up and coming tech markets, plus advice about making fonts larger in python/R plots. (Note in response to that post, Greg Ridgeway says it is better to save as vector graphics as oppossed to high res PNG. Vector is slightly more work to check everything is kosher in the final produced plot, but that is good advice from Greg. I am lazy with the PNG advice.)

No more newsletters this year, but let me know if you want to sign up and I will add you to the list.

Last little tidbit, in the past I have used the pdftk tool to combine multiple PDFs together. This is useful when using other tools to create documents, so you have multiple outputs in the end (like a cover page or tech appendix), and want to combine those all together into a single PDF to share. But one thing I noticed recently, if your PDF has internal table of content (TOC) links (as is the case for LaTeX, or in my case a document built using Quarto), using pdftk will make the TOC links go away. You can however use ghostscript instead, and the links still work as normal.

On my windows machine, it looks something like:

gswin64 -q -sDEVICE=pdfwrite -o MergedDoc.pdf CoverPage.pdf Main.pdf Appendix.pdf

So a few differences that if you just google. Installing the 64 bit version on my windows machine, the executable is gswin64, not gs from the command line. Second, I needed to manually add C:\Program Files\gs\gs10.02.1\bin to my PATH for this to work at the command prompt the way you would expect, installing did not do that directly.

Quarto is awesome by the way – definitely suggest people go check that out.

Leave a comment

by Andy Wheeler on December 12, 2023 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, data science

Tagged post-stratification, survey

Posted by Andy Wheeler on December 12, 2023

https://andrewpwheeler.com/2023/12/12/won-nij-competition-on-surveys/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 400 other subscribers
aoristic big-data cartography census choropleth citeulike color consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category Criminal Justice

References

ChatGPT

Google

Claude

Prompt Engineering Ideas

Compiling the documents

Setting up the online store

Print on Demand using LuLu

Other Random Book Stuff

Personal Goals

References

Rough Estimates for Spatial Police Planning

References

Power with Likert items

Harmweighted Hotspots

Using ESRI Python API

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange