All posts in category data science

Harmweighted hotspots, using ESRI python API, and Crime De-Coder Updates

Haven’t gotten the time to publish a blog post in a few. There has been a ton of stuff I have put out on my Crime De-Coder website recently. For some samples since I last mentioned here, have published four blog posts:

on what AI regulation in policing would look like
high level advice on creating dashboards
overview of early warning systems for police
types of surveys for police departments

For surveys a few different groups have reached out to me in regards to the NIJ measuring attitudes solicitation (which is essentially a follow up of the competition Gio and myself won). So get in touch if interested (whether a PD or a research group), may try to coordinate everyone to have one submission instead of several competing ones.

To keep up with everything, my suggestion is to sign up for the RSS feed on the site. If you want an email use the if this than that service. (I may have to stop doing my AltAc newsletter emails, it is so painful to send 200 emails and I really don’t care to sign up for another paid for service to do that.)

I also have continued the AltAc newsletter. Getting started with LLMs, using secrets, advice on HTML, all sorts of little pieces of advice every other week.

I have created a new page for presentations. Including, my recent presentation at the Carolina Crime Analysis Association Conference. (Pic courtesy of Joel Caplan who was repping his Simsi product – thank you Joel!)

If other regional IACA groups are interested in a speaker always feel free to reach out.

And finally a new demo on creating a static report using quarto/python. It is a word template I created (I like often generating word documents that are easier to post-hoc edit, it is ok to automate 90% and still need a few more tweaks.)

Harmweighted Hotspots

If you like this blog, also check out Iain Agar’s posts, GIS/SQL/crime analysis – the good stuff. Here I wanted to make a quick note about his post on weighting Crime Harm spots.

So the idea is that when mapping harm spots, you could have two different areas with same high harm, but say one location had 1 murder and one had 100 thefts. So if murder harm weight = 100 and theft harm weight = 1, they would be equal in weight. Iain talks about different transformations of harm, but another way to think about it is in terms of variance. So here assuming Poisson variance (although in practice that is not necessary, you could estimate the variance given enough historical time series data), you would have for your two hotspots:

Hotspot1: mean 1 homicide, variance 1
Hotspot2: mean 100 thefts, variance 100

Weight of 100 for homicides, 1 for theft

Hotspot1: Harmweight = 1*100 = 100
          Variance = 100^2*1 = 10,000
          SD = sqrt(10,000) = 100

Hotspot2: Harmweight = 100*1 = 100
          Variance = 1^2*100 = 100
          SD = sqrt(100) = 10

When you multiply by a constant, which is what you are doing when multiplying by harm weights, the relationship with variance is Var(const*x) = const^2*Var(x). The harm weights add variance, so you may simple add a penalty term, or rank by something like Harmweight - 2*SD (so the lower end of the harm CI). So in this example, the low end of the CI for Hotspot 1 is 0, but the low end of the CI for Hotspot2 is 80. So you would rank Hotspot2 higher, even though they are the same point estimate of harm.

The rank by low CI is a trick I learned from Evan Miller’s blog.

You could fancy this up more with estimating actual models, having multiple harm counts, etc. But this is a quick way to do it in a spreadsheet with just simple counts (assuming Poisson variance). Which I think is often quite reasonable in practice.

Using ESRI Python API

So I knew you could use python in ESRI, they have a notebook interface now. What I did not realize is now with Pro you can simply do pip install arcgis, and then just interact with your org. So for a quick example:

from arcgis.gis import GIS

# Your ESRI url
gis = GIS("https://modelpd.maps.arcgis.com/", username="user_email", password="???yourpassword???")
# For batch geocoding, probably need to do GIS(api_key=<your api key>)

This can be in whatever environment you want, so you don’t even need ArcGIS installed on the system to use this. It is all web-api’s with Pro. To geocode for example, you would then do:

from arcgis.geocoding import geocode, Geocoder, get_geocoders, batch_geocode

# Can search to see if any nice soul has published a geocoding server

arcgis_online = GIS()
items = arcgis_online.content.search('geocoder north carolina', 'geocoding service', max_items=30)

# And we have four
#[<Item title:"North Carolina Address Locator" type:Geocoding Layer owner:ecw31_dukeuniv>,
# <Item title:"Southeast North Carolina Geocoding Service" type:Geocoding Layer owner:RaleighGIS>, 
# <Item title:"Geocoding Service - AddressNC " type:Geocoding Layer owner:nconemap>, 
# <Item title:"ArcGIS World Geocoding Service - NC Extent" type:Geocoding Layer owner:NCDOT.GOV>]

geoNC = Geocoder.fromitem(items[0]) # lets try Duke
#geoNC = Geocoder.fromitem(items[-1]) # NCDOT.GOV
# can also do directly from URL
# via items[0].url
# url = 'https://utility.arcgis.com/usrsvcs/servers/8caecdf6384144cbafc9d56944af1ccf/rest/services/World/GeocodeServer'
# geoNC = Geocoder(url,gis)

# DPAC
res = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1)
print(res[0])

Note you cannot cache the geocoding results. To do that, you need to use credits and probably sign in via a token and not a username password.

# To cache, need a token
r2 = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1,for_storage=True)

# If you have multiple addresses, use batch_geocode, again need a token
#dc_res = batch_geocode(FullAddressList, geocoder=geoNC)

Geocoding to this day is still such a pain. I will need to figure out if you can make a local geocoding engine with ESRI and then call that through Pro (I mean I know you can, but not sure pricing for all that).

Overall being able to work directly in python makes my life so much easier, will need to dig more into making some standard dashboards and ETL processes using ESRI’s tools.

I have another post that has been half finished about using the ESRI web APIs, hopefully will have time to put that together before another 6 months passes me by!

Leave a comment

by Andy Wheeler on March 3, 2024 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, data science, geocoding, Python

Tagged ESRI, speaking

Posted by Andy Wheeler on March 3, 2024

https://andrewpwheeler.com/2024/03/03/harmweighted-hotspots-using-esri-python-api-and-crime-de-coder-updates/

Getting started with github notes

I mentioned on LinkedIn the other day I think github is a good resource for crime analysts to learn. Even if you don’t write code, it is convenient to have an audit-trail of changes in documents.

Jerry Ratcliffe made the comment that it is a tough learning curve, and I agree dealing with merge conflicts is a pain in the butt:

In the past I have suggested people to get started by using the github desktop GUI tool. But I do not suggest that anymore because of the issues Jerry mentions. If you get headaches like this, you pretty much need to use the command line to deal with them. I do not have many git commands memorized, and I will give a rundown of my getting started with git and github notes. So I just suggest now people bite the bullet and learn the command line.

Agree it takes some effort, but I think it is well worth it.

Making a project and first commit

Technically github is the (now Microsoft owned) software company that offers web hosted version control, and git is a more general system for version control. (There is another popular web host called Gitlab for example.) Here I will offer advice about using github and git from the command line.

So first, I typically create projects first online on the web-browser on github.com (I do not have the command prompt command memorized to create a new repository). On github.com, click the green New button:

Here I am creating a new repo named example_repo. I do it this way intentionally, as I can make sure I set the repo owner to the correct one (myself or my organization), and set the repo to the correct public/private by default. Many things you want to default to private.

Note on windows, the git command is probably not installed by default. If you install git-bash, it should be available in the command prompt.

Now that you have your repository created, in github I click the green code button, and copy the URL to the repo:

Then from the command line, navigate to where you want to download the repo (I set up my windows machine so I have a G drive mapped to where I download github repos). So from command line, mine looks like:

# cd to to correct location
git clone https://github.com/apwheele/example_repo.git
# now go inside the folder you just downloaded
cd ./example_repo

Now typically I do two things when first creating a repo, edit the README.md to give a high level overview of the project, and also create a .gitignore file (no file extension!). Often you have files that you don’t want committed to the github repository. Most of my .gitignore files look like this for example, where # are comment lines:

# No csv files
*.csv

# No python artifacts
*.pyc
__pycache__

# can prevent uploading entire folders if you want
/folder_dont_upload

Note if you don’t generally want files, but want a specific file for whatever reason, you can use an exclamation point, e.g. !./data/keep_me.csv will include that file, even if you have *.csv as ignored in the .gitignore file in general. And if you want to upload an empty folder, place a .gitkeep file in that folder.

Now in the command prompt, run git status. You will see the files that you have edited listed (minus any file that is ignored in the gitignore file).

So once you have those files edited, then in the command prompt you will do three different commands in a row:

git add .
git commit -m 'making init commit'
git push

The first command git add ., adds all of the files you edited (again minus any file that is ignored in the gitignore file). Note you can add a specific file one at a time if you want, e.g. git add README.md, but using the period adds all of the files you edited at once.

Git commit adds in a message where you should write a short note on the changes. Technically at this point you could go and do more changes, but here I am going to git push, which will send the updates to the online hosted github branch. (Note if doing this the first time from the command prompt, you may need to give your username and maybe set up a github token or do two-factor authentication).

You don’t technically need to do these three steps at once, but in my workflows I pretty much always do. Now you can go checkout the online github repo and see the updated changes.

Branches

When you are working on things yourself for small projects, just those above commands and committing directly to the default main branch is fine. Branches allow for more complicated scenarios like:

you want the main code to not change, but you want to experiment and try out some changes
you have multiple people working on the code at the same time

Branches provide isolation – they allow the code in the branch to change, whereas code in main (or other branches) does not change. Here I am going to show how to make a branch in the command prompt, but first a good habit when working with multiple people is to do at the start of your day:

git fetch
git pull origin main

Git fetch updates the repo if other collaborators added a branch (but does not update the files directly). And git pull origin main pulls the most recent main branch version. So if a colleague updated main, when you do git pull origin main it will update the code on your local computer. (If you want to pull the most recent version of a different branch, it will be git pull origin branch_name.)

To create a new branch, you can do:

git checkout -b new_branch

Note if the branch is already created you can just omit the -b flag, and this just switches to that branch. Make a change, and then when pushing, use git push origin new_branch, which specifies you are specifically pushing to your branch you just created (instead of pushing to the default main branch).

# after editing readme to make a change
git add .
git commit -m 'trivial edit'
git push origin new_branch

Now back in the browser, you can go and checkout the updated code by switching the branch you are looking at in the dropdown on the left hand part of the screen that says “new_branch” with the tiny branching diagram:

A final step, you want to merge the branch back into the main code script. If you see the big green button Compare and Pull Request in the above screenshot, click that, and it will bring up a dialog about creating a pull request. Then click the green Create Pull Request button:

Then after you created the request, it will provide another dialogue to merge in the code into the target (by default main).

If everything is ok (you have correct permissions and no merge conflicts), you can click the buttons to merge the branches and that is that.

Merge Conflicts

The rub with above is that sometimes merge conflicts happen, and as Jerry mentions, these can be a total pain to sort out. It is important to understand why merge conflicts happen first though, and to take steps to prevent them. In my experience merge conflicts most often happen because of two reasons:

Multiple people are working on the same branch, and I forget to run git pull origin branch at the start of my day, so I did not incorporate the most recent changes. (Note these can happen via auto-changes as well, such as github actions running scripts.)

The second scenario is someone updated main, and I did not update my version of main. This tends to occur with more long term development. Typically this means at the start of my day, I should have run git checkout main, then git pull origin main.

I tend to find managing merge conflicts is very difficult using the built in github tools (so I don’t typically use git rebase for example). More commonly, when I have a merge conflict for a single file, first I will save the file that is giving me problems outside of the github repo (so I don’t accidently delete/overwrite it). Then, if my new_branch is conflicting with main, I will do:

# this pulls the exact file from main
git checkout main conflict_file.txt
git add conflict_file.txt
git commit -m 'pulling file to fix conflict'
git push origin new_branch

Then if I want to make edits to conflict_file.txt, make the edits now, then redo add-commit-push.

This workflow tends to be easier in my experience than dealing with rebase or trying to edit the merge conflicts directly.

It is mostly important though to realize what caused the merge conflict to begin with, to prevent the pain of dealing with it again in the future. My experience these are mostly avoidable, and mean you made a personal mistake in not pulling the most recent version, or more rarely collaboration with a colleague wasn’t coordinated correctly (you both editing the same file at the same time).

I realize this is not easy – it takes a bit of work to understand github and incorporate into your workflow. I think it is a worthwhile tool for analysts and data scientists to learn though.

1 Comment

by Andy Wheeler on January 14, 2024 • Permalink

Posted in Crime Analysis, data science

Tagged command-line, github

Posted by Andy Wheeler on January 14, 2024

https://andrewpwheeler.com/2024/01/14/getting-started-with-github-notes/

Recoding America review, Data Science CV Update, Sworn Dashboard

Over this Christmas break I read Jennifer Pahlka’s Recoding America. It is a great book and I really recommend it.

My experience working in criminal justice is a bit different than Pahlka’s examples, but even if you are just interested in private sector product/project management this is a great book. It has various user experience gems as well (such as forms that eliminate people, put the eliminating questions in order by how many people they filter).

Pahlka really digs on waterfall, I have critiqued agile on the blog in the past, but we are both just using generic words to describe bad behavior. I feel like a kindred spirit with Pahlka based on some of her anecdotes; concrete boats, ridiculous form questions, PDF inputs that only work on ancient web-browsers, mainframes are not the problem stupid requirements are, hiring too many people makes things worse, people hanging up on them in phone calls when you tell the truth – so many good examples.

To be specific with agile/waterfall, Pahlka is very critical of fixed requirements coming down on high from policy makers. When you don’t have strong communication at the requirements gathering stage between techies, users and owners making the requests (which can happen in private sector too), you can get some comical inefficiencies.

A good example for my CJ followers are policies to do auto-clearance of records in California. So the policy makers made a policy that said those with felony convictions for stealing less than $1,000 can be expunged, but there is no automated way to do this, since the criminal records do not save the specific dollar amount in the larceny charge. (And to do the manual process is very difficult, so pretty much no one bothers.) It probably would make more sense to say something like “a single felony larceny charge that is 5 years old will be auto-cleared”, that is not exactly the same but is similar in spirit to what the legislature wants, and can be easily automated based on criminal records that are already collected by the state. A real effective solution would look like data people working with policy makers directly and giving scenarios “if we set the criteria to X, it will result in Y clearances”. These are close to trivial things to ask a database person to comment on, there is no fundamental reason why policy/techs can’t go back in forth and craft policy that makes sense and is simpler to implement.

To be more generic, what can happen is someone requests X, X is really hard/impossible, but you can suggest a,b,c instead that is easier to accomplish and probably meets the same high level goals. There is asymmetry in what people ask for and understanding of the work it takes to accomplish those requests, an important part of your job as a programmer/analyst is to give feedback to those asking to make the requests better. It takes understanding from the techies of the business requirements (Pahlka suggests govt should hire more product owners, IMO would rather just have senior developer roles do that stuff directly). And it takes people asking to be open to potential changes. Which most people are in my experience, just sometimes you get people who hang up in phone calls when you don’t tell them what they want to hear.

I actually like the longer term plan out a few months waterfall approach (I find that easier to manage junior developers, I think the agile shorter term stuff is too overbearing at times). But it requires good planning and communication between end users and developers no matter whether you say you are doing waterfall or agile. My experience in policing is not much like the policy people giving stone tablets, I have always had more flexibility to give suggestions in my roles. But I do think many junior crime analysts need to learn to say “you asked for percent change, here is a different stat instead that is better for what you want”.

What I am trying to do with CRIME De-Coder is really consistent with Pahlka’s goals with Code for America. I think it is really important for CJ agencies to take on more human investment in tech. Part of the reason I started CRIME De-Coder was anger – I get angry when I see cities pay software vendors six digits for crappy software that a good crime analyst could do. Or pay a consulting firm 6 figures for some mediocre (and often inappropriate) statistical analysis. Cities can do so much better by internally developing skills to take on many software projects, which are not moving mountains, and often outside software causes more problems than they solve.

At work we are starting to hire a new round of data scientists (no links to share, they are offshore in India, and first round is through a different service). The resume over-stating technical expertise for data scientists is at lol levels at this point. Amazing how everyone is an LLM, deep learning, and big data expert these days.

I’ve written before how I am at a loss on how to interview data scientists. The resumes I am getting are also pretty much worthless at this point. One problem I am seeing in these resumes is that people work on teams, so people can legitimately claim “I worked on this LLM”, but when you dig in and ask about specifics you find out they only contributed this tiny thing (which is normal/OK). But the resumes look like they are Jedi masters in advanced machine learning.

I went and updated my data science resume in response to reading others. (I should probably put that in HTML, so it shows up in google search results.) I don’t really have advice for folks “what should your resume look like” – I have no clue how recruiters view these things. No doubt my resume is not immune to a recruiter saying “you have 10+ years with python, but I don’t see any Jira experience, so I don’t think you are qualified”.

What I have done is only include stuff in the resume where I can link to specific, public examples (peer reviewed work, blog posts, web pages, github). I doubt recruiters are going to click on a single link in the resume (let alone all 40+), but that is what I personally would prefer when I am reviewing a resume. Real tangible stuff so someone can see I actually know how to write code.

So for example in the most recent update of the resume, I took Unix, Kubernetes/Docker, Azure, and Databricks off. Those are all tech I have worked with at HMS/Gainwell, but do not have any public footprint to really show off. I have some stuff on Docker on the blog, but nothing real whiz bang to brag about. And I have written some about my deployment strategy for python code in Databricks using github actions. (I do like Azure DevOps pipelines, very similar to building github actions, which are nice for many of the batch script processes I do. My favorite deployment pattern at work is using conda + persistent Fedora VMs. Handling servers/kubernetes everything docker is a total pain.) “Expertise” in those tools is probably too strong, I think claiming basic competence is reasonable though. (Databricks has changed so much in the two years we have been using it at work I’m not sure anyone outside of Databricks themselves could claim expertise – only if you are a very fast learner!)

But there is no real fundamental way for an outsider to know I have any level of competence/expertise in these tech tools. Honestly they do not matter – if you want me to use google cloud or AWS for something equivalent to Azure DevOps, or Snowflake instead of Databricks, it doesn’t really matter. You just learn the local stack in a month or two. Some rare things you do need very specialized tech skills, say if someone wanted me to optimize latency in serving pytorch LLMs, that will be tough given my background. Good luck posting that position on LinkedIn!

But the other things I list, I can at least pull up a web page to say “here is code I wrote to do this specific thing”. Proof is in the pudding. Literally 0 of the resumes I am reviewing currently have outside links to any code, so it could all be made up (and clearly for many people is embellished). I am sure people think mine is embellished as well, best I can do to respond to that is share public links.

For updates on CRIME De-Coder:

Sworn Dashboard, trends in officer hiring using D3.js to make charts and supabase as a backend
How to evaluate predictive policing software blog post
Crime De-Coder Storefront, currently can purchase (in and outside US) my GUI tool to download streetview images, as well as pre-release of python book

I researched ways to do payments for so long, in the end just turning on WooPayments in wordpress (and using an iframe) was a super simple solution (and it works fine for digital downloads and international payments). Will need to figure out webhooks with Stripe to do more complicated stuff eventually (like SaaS services, licenses, recurring payments), but for now this set up works for what I need.

I will start up newsletters again next week.

2 Comments

by Andy Wheeler on January 2, 2024 • Permalink

Posted in Criminal Justice, data science, scholarly

Tagged book-review, resume

Posted by Andy Wheeler on January 2, 2024

https://andrewpwheeler.com/2024/01/02/recoding-america-review-data-science-cv-update-sworn-dashboard/

Won NIJ competition on surveys

The submission Gio and myself put together, Using Every Door Direct Mail Web Push Surveys and Multi-level modelling with Post Stratification to estimate Perceptions of Police at Small Geographies, has won the NIJ Innovations in Measuring Community Attitudes Survey challenge.

Specifically we took 1st in the non-probability section of the competition. The paper has the details, but using every door direct mail + post-stratifying the estimates is the approach we advocate. If you are a city or research group interested in implementing this and need help, feel free to get in touch.

Of course if you want to do this yourself go for it (part of the reason it won was because the method should be doable for many agencies in house), but letting me and Gio know we were the inspiration is appreciated!

Second, for recruiting for criminology PhDs, CRIME De-Coder has teamed up with the open access CrimRXiv consortium

This example shows professor adverts, but I think the best value add for this is for more advanced local govt positions. Anymore many of those civil service gigs are very competitive with lagging professor salaries.

For people hiring advanced roles, there are two opportunities. One is advertising – so for about the same amount as advertising on LinkedIn, you can publish a job advert. This is much more targeted than LinkedIn, so if you want PhD talent this is a good deal to get your job posting on the right eyeballs.

The second service is recruiting for a position. This is commission based – if I place a candidate for the role then you pay the recruiter (me and CrimRXiv) a commission. For that I personally reach out to my network of people with PhDs interested in positions, and do the first round of vetting for your role.

Third, over on Crime De-Coder I have another round of the newsletter up, advice this round is that many smaller cities have good up and coming tech markets, plus advice about making fonts larger in python/R plots. (Note in response to that post, Greg Ridgeway says it is better to save as vector graphics as oppossed to high res PNG. Vector is slightly more work to check everything is kosher in the final produced plot, but that is good advice from Greg. I am lazy with the PNG advice.)

No more newsletters this year, but let me know if you want to sign up and I will add you to the list.

Last little tidbit, in the past I have used the pdftk tool to combine multiple PDFs together. This is useful when using other tools to create documents, so you have multiple outputs in the end (like a cover page or tech appendix), and want to combine those all together into a single PDF to share. But one thing I noticed recently, if your PDF has internal table of content (TOC) links (as is the case for LaTeX, or in my case a document built using Quarto), using pdftk will make the TOC links go away. You can however use ghostscript instead, and the links still work as normal.

On my windows machine, it looks something like:

gswin64 -q -sDEVICE=pdfwrite -o MergedDoc.pdf CoverPage.pdf Main.pdf Appendix.pdf

So a few differences that if you just google. Installing the 64 bit version on my windows machine, the executable is gswin64, not gs from the command line. Second, I needed to manually add C:\Program Files\gs\gs10.02.1\bin to my PATH for this to work at the command prompt the way you would expect, installing did not do that directly.

Quarto is awesome by the way – definitely suggest people go check that out.

Leave a comment

by Andy Wheeler on December 12, 2023 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, data science

Tagged post-stratification, survey

Posted by Andy Wheeler on December 12, 2023

https://andrewpwheeler.com/2023/12/12/won-nij-competition-on-surveys/

Sentence embeddings and caching huggingface models in github actions

For updates on Crime De-Coder:

AltAc Newsletter posts up to 4 are posted. Get in touch if you are interested in recruiting (nearly 100 people signed up so far, most with Phds, interested in positions)
I have a post on what distinguishes my crime analysis training from others – tech expertise plus I make it customized.

This is actually more born out from demand interacting with crime analysis groups – it doesn’t really make sense for me to swoop in and say “do X/Y/Z, here is my report”. My experience doing that with PDs is not good – my recommendations often do not get implemented. So it is important to upskill your current analysts to be able to conduct more rigorous analysis over time.

Some things it is better to have an external person do the report (such as program evaluation, cost-benefit analysis, or racial bias metrics). But things like generating hotspots and chronic offender lists, those should be something your crime analysis unit learns how to do.

You can of course hire me to do those things, but if you don’t train your local analysts to take over the job, you have to perpetually hire me to do that. Which may make sense for small departments without an internal crime analysis unit. But for large agencies you should be using your crime analysts to do reports like that.

This post, I also wanted to share some work on NLP. I have not been immune at work given all the craze. I am more impressed with vector embeddings and semantic search though than I am with GenAI applications. GenAI I think will be very useful, like a more advanced auto-complete, but I don’t think it will put us all out of work anytime soon.

Vector embeddings, for those not familiar, are taking a text input and turning it into numbers. You can then do nearness calculations with the numbers. So say you had plain text crime narrative notes on modus operandi, and wanted to search “breaking car window and stealing purse” – you don’t want to do an exact search on that phrase, but a search for text that is similar. So the vector similarity search may return a narrative that is “break truck back-window with pipe, take backpack”. Not exactly the same, but semantically very similar.

Sentencetransformers makes this supersimple (no need to use external APIs for this).

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Base query
query = ['breaking car window and stealing purse']

# Example MO
mo = ['break truck back window with pipe, steal backpack',
      'pushing in a window AC unit, steal TV and sneakers',
      'steal car from cloned fab, joyride']

#Compute embedding for both lists
qemb = model.encode(query, convert_to_tensor=True)
moemb = model.encode(mo, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(moemb, qemb)
print(cosine_scores)

So you have your query, and often people just return the top-k results, since to know “how close is reasonably close” is difficult in practice (google pretty much always returns some results, even if they are off the mark!).

I am quite impressed with the general models for this (even very idiosyncratic documents and jargon it tends to do quite well). But if you want to embed specifically trained models on different tasks, I often turn to simpletransformers. Here is a model based on clinical case notes.

from simpletransformers.language_representation import RepresentationModel

model = RepresentationModel("bert", "medicalai/ClinicalBERT")
sent1 = ["Procedure X-ray of wrist"] # needs to be a list
wv1 = model.encode_sentences(sent1, combine_strategy="mean")

For another example, I shared a github repo, hugging_cache, where I did some work on how to cache huggingface models in a github actions script. This is useful for unit tests, so you don’t need to redownload the model everytime.

Github cache size for free accounts is 10 gigs (and you need the environment itself, which is a few gigs). So huggingface models up to say around 4 (maybe 5) gigs will work out OK in this workflow. So not quite up to par with the big llama models, but plenty large enough for these smaller embedding models.

It is not very difficult to build a local, in-memory vector search database. Which will be sufficient for many individuals. So a crime analyst could build a local search engine for use – redo vector DB on a batch job, then just have a tool/server to do the local lookup. (Or just use a local sqlite database is probably sufficient for a half dozen analysts/detectives may use this tool every now and then.)

Leave a comment

by Andy Wheeler on November 29, 2023 • Permalink

Posted in Crime Analysis, data science, Python

Tagged embeddings, nlp

Posted by Andy Wheeler on November 29, 2023

https://andrewpwheeler.com/2023/11/29/sentence-embeddings-and-caching-huggingface-models-in-github-actions/

Forecasts need to have error bars

Richard Rosenfeld in the most recent Criminologist published a piece about forecasting national level crime rates. People complain about the FBI releasing crime stats a year late, academics are worse; Richard provided “forecasts” for 2021 through 2025 for an article published in late 2023.

Even ignoring the stalecasts that Richard provided – these forecasts had/have no chance of being correct. Point forecasts will always be wrong – a more reasonable approach is to provide the prediction intervals for the forecasts. Showing error intervals around the forecasts will show how Richard interpreting minor trends is likely to be misleading.

Here I provide some analysis using ARIMA models (in python), to illustrate what reasonable forecast error looks like in this scenario, code and data on github.

You can get the dataset on github, but just some upfront with loading the libraries I need and getting the data in the right format:

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt

# via https://www.disastercenter.com/crime/uscrime.htm
ucr = pd.read_csv('UCR_1960_2019.csv')
ucr['VRate'] = (ucr['Violent']/ucr['Population'])*100000
ucr['PRate'] = (ucr['Property']/ucr['Population'])*100000
ucr = ucr[['Year','VRate','PRate']]

# adding in more recent years via https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/docApi
# I should use original from counts/pop, I don't know where to find those though
y = [2020,2021,2022]
v = [398.5,387,380.7]
p = [1958.2,1832.3,1954.4]
ucr_new = pd.DataFrame(zip(y,v,p),columns = list(ucr))
ucr = pd.concat([ucr,ucr_new],axis=0)
ucr.index = pd.period_range(start='1960',end='2022',freq='A')

# Richard fits the model for 1960 through 2015
train = ucr.loc[ucr['Year'] <= 2015,'VRate']

Now we are ready to fit our models. To make it as close to apples-to-apples as Richard’s paper, I just fit an ARIMA(1,1,2) model – I do not do a grid search for the best fitting model (also Richard states he has exogenous factors for inflation in the model, which I do not here). Note Richard says he fits an ARIMA(1,0,2) for the violent crime rates in the paper, but he also says he differenced the data, which is an ARIMA(1,1,2) model:

# Not sure if Richard's model had a trend term, here no trend
violent = ARIMA(train,order=(1,1,2),trend='n').fit()
violent.summary()

This produces the output:

                               SARIMAX Results
==============================================================================
Dep. Variable:                  VRate   No. Observations:                   56
Model:                 ARIMA(1, 1, 2)   Log Likelihood                -242.947
Date:                Sun, 19 Nov 2023   AIC                            493.893
Time:                        19:33:53   BIC                            501.923
Sample:                    12-31-1960   HQIC                           496.998
                         - 12-31-2015
Covariance Type:                  opg
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.4545      0.169     -2.688      0.007      -0.786      -0.123
ma.L1          1.1969      0.131      9.132      0.000       0.940       1.454
ma.L2          0.7136      0.100      7.162      0.000       0.518       0.909
sigma2       392.5640    104.764      3.747      0.000     187.230     597.898
===================================================================================
Ljung-Box (L1) (Q):                   0.13   Jarque-Bera (JB):                 0.82
Prob(Q):                              0.72   Prob(JB):                         0.67
Heteroskedasticity (H):               0.56   Skew:                            -0.06
Prob(H) (two-sided):                  0.23   Kurtosis:                         2.42
===================================================================================

So some potential evidence of over-differencing (with the negative AR(1) coefficient). Looking at violent.test_serial_correlation('ljungbox') there is no significant serial auto-correlation in the residuals. One could use some sort of auto-arima approach to pick a “better” model (it clearly needs to be differenced at least once, also maybe should also be modeling the logged rate). But there is not much to squeeze out of this – pretty much all of the ARIMA models will produce very similar forecasts (and error intervals).

So in the statsmodels package, you can append new data and do one step ahead forecasts, so this is comparable to Richard’s out of sample one step ahead forecasts in the paper for 2016 through 2020:

# To make it apples to apples, only appending through 2020
av = (ucr['Year'] > 2015) & (ucr['Year'] <= 2020)
violent = violent.append(ucr.loc[av,'VRate'], refit=False)

# Now can show insample predictions and forecasts
forecast = violent.get_prediction('2016','2025').summary_frame(alpha=0.05)

If you print(forecast) below are the results. One of the things I want to note is that if you do one-step-ahead forecasts, here the years 2016 through 2020, the standad error is under 20 (this is well within Richard’s guesstimate to be useful it needs to be under 10% absolute error). When you start forecasting multiple years ahead though, the error compounds over time. So to forecast 2022, you need a forecast of 2021. To forecast 2023, you need to forecast 21,22 and then 23, etc.

VRate        mean    mean_se  mean_ci_lower  mean_ci_upper
2016   397.743461  19.813228     358.910247     436.576675
2017   402.850827  19.813228     364.017613     441.684041
2018   386.346157  19.813228     347.512943     425.179371
2019   379.315712  19.813228     340.482498     418.148926
2020   379.210158  19.813228     340.376944     418.043372
2021   412.990860  19.813228     374.157646     451.824074
2022   420.169314  39.803285     342.156309     498.182318
2023   416.906654  57.846105     303.530373     530.282936
2024   418.389557  69.535174     282.103120     554.675994
2025   417.715567  80.282625     260.364513     575.066620

The standard error scales pretty much like sqrt(steps*se^2) (it is additive in the variance). Richard’s forecasts do better than mine for some of the point estimates, but they are similar overall:

# Richard's estimates
forecast['Rosenfeld'] = [399.0,406.8,388.0,377.0,394.9] + [404.1,409.3,410.2,411.0,412.4]
forecast['Observed'] = ucr['VRate']

forecast['MAPE_Andy'] = 100*(forecast['mean'] - forecast['Observed'])/forecast['Observed']
forecast['MAPE_Rick'] = 100*(forecast['Rosenfeld'] - forecast['Observed'])/forecast['Observed']

And this now shows for each of the models:

VRate        mean  mean_ci_lower  mean_ci_upper  Rosenfeld    Observed  MAPE_Andy  MAPE_Rick
2016   397.743461     358.910247     436.576675      399.0  397.520843   0.056002   0.372095
2017   402.850827     364.017613     441.684041      406.8  394.859716   2.023785   3.023931
2018   386.346157     347.512943     425.179371      388.0  383.362999   0.778155   1.209559
2019   379.315712     340.482498     418.148926      377.0  379.421097  -0.027775  -0.638103
2020   379.210158     340.376944     418.043372      394.9  398.500000  -4.840613  -0.903388
2021   412.990860     374.157646     451.824074      404.1  387.000000   6.715985   4.418605
2022   420.169314     342.156309     498.182318      409.3  380.700000  10.367563   7.512477
2023   416.906654     303.530373     530.282936      410.2         NaN        NaN        NaN
2024   418.389557     282.103120     554.675994      411.0         NaN        NaN        NaN
2025   417.715567     260.364513     575.066620      412.4         NaN        NaN        NaN

So MAPE in the held out sample does worse than Rick’s models for the point estimates, but look at my prediction intervals – the observed values are still totally consistent with the model I have estimated here. Since this is a blog and I don’t need to wait for peer review, I can however update my forecasts given more recent data.

# Given updated data until end of series, lets do 23/24/25
violent = violent.append(ucr.loc[ucr['Year'] > 2020,'VRate'], refit=False)
updated_forecast = violent.get_forecast(3).summary_frame(alpha=0.05)

And here are my predictions:

VRate        mean    mean_se  mean_ci_lower  mean_ci_upper
2023   371.977798  19.813228     333.144584     410.811012
2024   380.092102  39.803285     302.079097     458.105106
2025   376.404091  57.846105     263.027810     489.780373

You really need to graph these out to get a sense of the magnitude of the errors:

Note how Richard’s 2021 and 2022 forecasts and general increasing trend have already been proven to be wrong. But it really doesn’t matter – any reasonable model that admitted uncertainty would never let one reasonably interpret minor trends over time in the way Richard did in the criminologist article to begin with (forecasts for ARIMA models are essentially mean-reverting, they will just trend to a mean term in a short number of steps). Richard including exogenous factors actually makes this worse – as you need to forecast inflation and take that forecast error into account for any multiple year out forecast.

Richard has consistently in his career overfit models and subsequently interpreted the tea leaves in various macro level correlations (Rosenfeld, 2018). His current theory of inflation and crime is no different. I agree that forecasting is the way to validate criminological theories – picking up a new pet theory every time you are proven wrong though I don’t believe will result in any substantive progress in criminology. Most of the short term trends criminologists interpret are simply due to normal volatility in the models over time (Yim et al., 2020). David McDowall has a recent article that is much more measured about our cumulative knowledge of macro level crime rate trends – and how they can be potentially related to different criminological theories (McDowall, 2023). Matt Ashby has a paper that compares typical errors for city level forecasts – forecasting several years out tends to product quite inaccurate estimates, quite a bit larger than Richard’s 10% is useful threshold (Ashby, 2023).

Final point that I want to make is that honestly it doesn’t even matter. Richard can continue to keep making dramatic errors in macro level forecasts – it doesn’t matter if he publishes estimates that are two+ years old and already wrong before they go into print. Because unlike what Richard says – national, macro level violent crime forecasts do not help policy response – why would Pittsburgh care about the national level crime forecast? They should not. It does not matter if we fit models that are more accurate than 5% (or 1%, or whatever), they are not helpful to folks on the hill. No one is sitting in the COPS office and is like “hmm, two years from now violent crime rates are going up by 10, lets fund 1342 more officers to help with that”.

Richard can’t have skin the game for his perpetual wrong macro level crime forecasts – there is no skin to have. I am a nerd so I like looking at numbers and fitting models (or here it is more like that XKCD comic of yelling at people on the internet). I don’t need to make up fairy tale hypothetical “policy” applications for the forecasts though.

If you want a real application of crime forecasts, I have estimated for cities that adding an additional home or apartment unit increases the number of calls per service by about 1 per year. So for growing cities that are increasing in size, that is the way I suggest to make longer term allocation plans to increase police staffing to increase demand.

References

Ashby, M. (2023). Forecasting crime trends to support police strategic decision making. CrimRxiv.
McDowall, D. (2023). Empirical Properties of Crime Rate Trends. Journal of Contemporary Criminal Justice, 10439862231189979.
Rosenfeld, R. (2018). Studying crime trends: Normal science and exogenous shocks. Criminology, 56(1), 5-26.
Yim, H. N., Riddell, J. R., & Wheeler, A. P. (2020). Is the recent increase in national homicide abnormal? Testing the application of fan charts in monitoring national homicide trends over time. Journal of Criminal Justice, 66, 101656.

1 Comment

by Andy Wheeler on November 19, 2023 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Python

Tagged arima, forecast

Posted by Andy Wheeler on November 19, 2023

https://andrewpwheeler.com/2023/11/19/forecasts-need-to-have-error-bars/

Fitting a gamma distribution (with standard errors) in python

Over on Crime De-Coder, I have up a pre-release of my book, Data Science for Crime Analysis with Python.

I have gotten asked by so many people “where to learn python” over the past year I decided to write a book. Go check out my preface on why I am not happy with current resources in the market, especially for beginners.

I have the preface + first two chapters posted. If you want to sign up for early releases, send me an email and will let you know how to get an early copy of the first half of the book (and send updates as they are finished).

Also I have up the third installment of the AltAc newsletter. Notes about the different types of healthcare jobs, and StatQuest youtube videos are a great resource. Email me if you want to sign up for the newsletter.

So the other day I showed how to fit a beta-binomial model in python. Today, in a quick post, I am going to show how to estimate standard errors for such fitted models.

So in scipy, you have distribution.fit(data) to fit the distribution and return the estimates, but it does not have standard errors around those estimates. I recently had need for gamma fits to data, in R I would do something like:

# this is R code
library(fitdistrplus)
x <- c(24,13,11,11,11,13,9,10,8,9,6)
gamma_fit <- fitdist(x,"gamma")
summary(gamma_fit)

And this pipes out:

> Fitting of the distribution ' gamma ' by maximum likelihood
> Parameters :
>        estimate Std. Error
> shape 8.4364984  3.5284240
> rate  0.7423908  0.3199124
> Loglikelihood:  -30.16567   AIC:  64.33134   BIC:  65.12713
> Correlation matrix:
>           shape      rate
> shape 1.0000000 0.9705518
> rate  0.9705518 1.0000000

Now, for some equivalent in python.

# python code
from scipy.optimize import minimize
from scipy.stats import gamma

x = [24,13,11,11,11,13,9,10,8,9,6]

res = gamma.fit(x,floc=0)
print(res)

And this gives (8.437256958682587, 0, 1.3468401423927603). Note that python uses the scale version, so 1/scale = rate, e.g. 1/res[-1] results in 0.7424786123640676, which agrees pretty closely with R.

Now unfortunately, there is no simple way to get the standard errors in this framework. You can however minimize the parameters yourself and get the things you need – here the Hessian – to calculate the standard errors yourself.

Here to make it easier apples to apples with R, I estimate the rate parameter version.

# minimize negative log likelihood
def gll(parms,data):
    alpha, rate = parms
    ll = gamma.logpdf(data,alpha,loc=0,scale=1/rate)
    return -ll.sum()

result = minimize(gll,[8.,0.7],args=(x),method='BFGS')
se = np.sqrt(np.diag(result.hess_inv))
# printing results
np.stack([result.x,se],axis=1)

And this gives:

array([[8.43729465, 3.32538824],
       [0.74248193, 0.30255433]])

Pretty close, but not an exact match, to R. Here the standard errors are too low (maybe there is a sample correction factor?).

Some quick tests with this, using this scale version tends to be a bit more robust in fitting than the rate version.

The context of this, a gamma distribution with a shape parameter greater than 1 has a hump, and less than 1 is monotonically decreasing. This corresponds to the “buffer” hypothesis in criminology for journey to crime distributions. So hopefully some work related to that coming up soonish!

And doing alittle more digging, you can get “closed form” estimates of the parameters based on the Fisher Information matrix (via Wikipedia).

# Calculating closed form standard
# errors for gamma via Fisher Info
from scipy.special import polygamma
from numpy.linalg import inv
import numpy as np

def trigamma(x):
    return float(polygamma(1,x))

# These are the R estimates
shape = 8.4364984
rate = 0.7423908
n = 11 # 11 observations

fi = np.array([[trigamma(shape), -1/rate],
               [-1/rate, shape/rate**2]])

inv_fi = inv(n*fi)  # this is variance/covariance
np.sqrt(np.diagonal(inv_fi))

# R reports      3.5284240 0.3199124
# python reports 3.5284936 0.3199193

I do scare quotes around closed form, as it relies on the trigamma function, which itself needs to be calculated numerically I believe.

If you want to go through the annoying math (instead of using the inverse above), you can get a direct one liner for either.

# Closed for for standard error of shape
np.sqrt(shape / (n*(trigamma(shape)*shape - 1)))
# 3.528493591892887

# This works for the scale or the rate parameterization
np.sqrt((trigamma(shape))/ (n*rate**-2*(trigamma(shape)*shape - 1)))
# 0.31995664847081356

# This is the standard error for the scale version
scale = 1/rate
np.sqrt((trigamma(shape))/ (n*scale**-2*(trigamma(shape)*shape - 1)))
# 0.5803962127677769

Leave a comment

by Andy Wheeler on November 10, 2023 • Permalink

Posted in data science, Python

Tagged scipy

Posted by Andy Wheeler on November 10, 2023

https://andrewpwheeler.com/2023/11/10/fitting-a-gamma-distribution-with-standard-errors-in-python/

AMA: Advice on clustering

Ashely Evan’s writes in with a question:

I very recently started looking into clustering which I’ve only touched upon briefly in the past.

I have an unusual dataset, with dichotomous or binary responses for around 25000 patents and 35 categories.

Would you be able to recommend a suitable method, I couldn’t see if you’d done anything like this before on your site.

It’s a similar situation to a survey with 25000 respondents and 35 questions which can only be answered yes/no (1 or 0 is how I’ve represented this in my data).

The motivation for clustering would be to identify which questions/areas naturally cluster together to create distinct profiles and contrast differences.

I tried the k modes algorithm in r, using an elbow method which identified 3 clusters. This is a decent starting point, the size of the clusters are quite unbalanced, two had one common category for every result and the other category was quite fragmented.

I figured this topic would be a good one for the blog. The way clustering is treated in many data analysis courses is very superficial, so this contains a few of my thoughts to help people in conducting real world cluster analysis.

I have never done any project with similar data. So caveat emptor on advice!

So first, clustering can be tricky, since it is very exploratory. If you can put more clear articulation what the end goal is, I always find that easier. Clustering will always spit out solutions, but having clear end-goals makes it easier to tell whether the clustering has any face validity to accomplish those tasks. (And sometimes people don’t want clustering, they want supervised learning or anomaly detection.) What is the point of the profiles? Do you have outcomes you expect with them (like people do in market segmentation)?

The clustering I have done is geospatial – I like a technique called DBSCAN – this is very different than K-means (which every point is assigned into a cluster). You just identify areas of many cases nearby in space, and if this area has greater than some threshold, it is a local cluster. K-means being uneven is typical, as every point needs to be in a cluster. You tend to have a bunch of junk points in the cluster (so sometimes focusing on the mean or modal point in k-means may be better than looking at the whole distribution).

I don’t know if DBSCAN makes sense though for 0/1 data. Another problem with clustering many variables is what is called the curse of dimensionality. If you have 3 variables, you can imagine drawing your 3d scatterplot and clustering of those points in that 3d space. You cannot physically imagine it, but clustering with more variables is like this visualization, but in many higher dimensions.

What happens though is that in higher dimensions, all of the points get pushed away from each other, and closer to the hull of the that k-dimensional sphere (or I should say box here with 0/1 data). So the points tend to be equally far apart, and so clusters are not well defined. This is a different problem, but I like this example of averaging different dimensions to make a pilot that does not exist, it is the same issue.

There may be ways to take your 35 inputs and reduce down to fewer variables (the curse of dimensionality comes at you fast – binary variables may not be as problematic as continuous ones, but it is a big deal for even as few as 6-10 dimensions).

So random things to look into:

factor analysis of dichotomous variables (such as ordination analysis), or simply doing PCA on the columns may identify redundant columns (this doesn’t get you row wise clusters, but PCA followed by K-means is a common thing people do). Note that this only applies to independent categories, turning a single category into 35 dummy variables and then doing PCA does not make sense.

depending on what you want, looking at association rules/frequent item sets may be of interest. So that is identifying cases that tend to cluster with pairs of attributes.

for just looking at means of different profiles, latent class analysis I think is the “best” approach out of the box (better than k-means). But it comes with its own problems of selecting the number of groups.

The regression + mixture model I think is a better way to view clustering in a wider variety of scenarios, such as customer segmentation. I really do not like k-means, I think it is a bad default for many real world scenarios. But that is what most often gets taught in data science courses.

The big thing is though you need to be really clear about what the goals of the analysis are – those give you ways to evaluate the clustering solutions (even if those criteria are only fuzzy).

Leave a comment

by Andy Wheeler on October 31, 2023 • Permalink

Posted in ask me anything, data science

Tagged clustering

Posted by Andy Wheeler on October 31, 2023

https://andrewpwheeler.com/2023/10/31/ama-advice-on-clustering/

Why give advice?

So recently in a few conversations (revolving around the tech recruiting service I am starting), I get asked the question “Why are you bothering to give me advice?”.

It is something I have done regularly for almost a decade – but for many years it was not publicized. So from blog posts I get emails from academics/grad students maybe once a month on stats questions. And more recently with going to the private sector, I get emails once a month from first/second degree connections about my experience with that. (These are actually more often mid-career academics than newly minted PhDs.)

So I have just made it more public that I give that type of advice. On this blog I started an irregular ask me anything. I will often just turn these into their own blog posts, see for example my advice on learning stats/machine learning. And for the tech recruiting I have been having phone calls with individuals recently and forwarding potential opportunities, see my recent post on different tech positions and salary ranges.

It is hard for me to articulate why I do this that is not cheesy or hubristic (if that is even a word). Individuals who have gotten criminal justice (CJ) PhDs in the last 15 years, we likely have very similar shared experiences. One thing that has struck me – and I feel this even more strongly now than I did when I was an academic – is that individuals who I know that have a CJ Phd are really smart. I have not met a single CJ PhD who I was like “how did this person get a PhD?”.

This simultaneously makes me sad/angry/frustrated when I see very talented individuals go through essentially the same struggles I did in academia. But for the grace of God there I go. On the flipside I have gotten some very bad advice in my career – not intentionally malicious but often from senior people in my life who did not know better given their lack of contemporary knowledge. (I wonder if that is inevitable when we get older – always critically examine advice, even from me!)

Some people I know do “life-coaching”, or simply charge per meeting. To be clear I don’t have any plans on doing that. It just doesn’t make sense for me to do that (the hubris thing – I think my advice is worth that, but I am not interested in squeezing people for a few dollars). If I am too busy to have a 30 minute phone call or send an email with quick stat advice I will just say so.

Life isn’t zero sum – if you do well that does not mean I do bad – quite the opposite for the majority of scenarios. I want to see my colleagues and friends be in positions that better appreciate (and compensate) their skills.

Leave a comment

by Andy Wheeler on October 23, 2023 • Permalink

Posted in ask me anything, data science

Tagged job-advice, social-networking

Posted by Andy Wheeler on October 23, 2023

https://andrewpwheeler.com/2023/10/23/why-give-advice/

Fitting beta binomial in python, Poisson scan stat in R

Sharing two pieces of code I worked on recently for various projects. First is fitting a beta binomial distribution in scipy. I had a need for this the other day, looking at the count of near duplicates in surveys. In that code I just used the method of moments estimator (which can often misbehave).

The top google results for this are all very bad (either code that does not work, or alternatives that are not good advice). One of the articles was even auto-generated content (ala ChatGPT type stuff), that had a few off the mark points (although was superficially what the python solution should look like, so the unwary would be led down a wrong path).

So here I show how to estimate the maximum likelihood estimate for the beta binomial distribution. First, because scipy already has a function for the pmf for the beta-binomial distribution it is pretty simple. For all of the discrete distributions, it should look like -dist.logpmf(..data..,...params...).sum(). In complicated stats speak this is “the negative of the sum of the log likelihood”. It is easier for me anyway to think in terms of the PDF/PMF though (the probability of observing your data given fixed parameters). And you find the parameters that maximize that probability over your entire sample. But to make the math easier we take the logs of the probabilities (so we can work with sums instead of multiplications), the log PMF here, and we take the negative so we find the minimum of the function.

Then you just pass the appropriate arguments to minimize and you are good to go.

import numpy as np
from scipy.optimize import minimize
from scipy.stats import betabinom
np.random.seed(10)

# simulating some random data
a = 0.8
b = 1.2

sim_size = 1000

n = 90
r = betabinom.rvs(n, a, b, size=sim_size)

# minimize negative log likelihood
def bbll(parms,k,n):
    alpha, beta = parms
    ll = betabinom.logpmf(k,n,alpha,beta)
    return -ll.sum()

result = minimize(bbll,[1,1],args=(r,90),method='Nelder-Mead')
print(result.x) # [alpha, beta]

And this returns for me:

>>> print(result.x)
[0.77065051 1.16323611]

Using simple simulations you can often get a feel for different estimators. Here n and sim_size make a decent difference for estimating beta, and beta I think tends to be biased downward in the smaller sample scenarios. (People don’t realize, for these non-normal distributions it is not un-common to need 1000+ observations to get decent un-biased estimates depending on the distribution.)

Note the nature of the data here, it is something like hits [5,8,9], and then a second either constant for every number (if the denominator is say 10 for all the observations, can just pass a constant 10). The denominator can however be variable in this set up, so you could have a set of different denominators like [6,8,10].

a = 1.5
b = 4.1

n = np.random.choice(range(30,90),size=sim_size)
r = betabinom.rvs(n, a, b, size=sim_size)

result = minimize(bbll,[1,1],args=(r,n),method='Nelder-Mead')
print(result.x)

Which returns:

>>> print(result.x)
[1.50563582 3.99837155]

I note here that some examples of beta-binomial use weighted data (the wikipedia page does this). These functions expect unweighted data. Functions that need to be repeatedly called (like the likelihood function here) I don’t like making them general with ifs and other junk, I would rewrite the bbll function for different forms of data and call that different function.

Also, as always, you need to check these to make sure the fitted parameters make sense and reasonably fit your data (plot the predicted PMF vs the observed histogram). The function here can converge, but could converge to non-sense (you probably don’t need to worry about constraints on the parameters, but better starting values are probably a good idea).

For future notes for myself, Guimaraes (2005) has examples of using fixed effects negative binomial and translating to beta-binomial (for fixed n). Also Young-Xu & Chan (2008) is a very nice reference (has Hessian, so if I wanted to estimate standard errors), as well as discussion of determining whether to use this model or a binomial with no extra dispersion.

The second thing I will post about is a scan statistic. The background is imagine someone comes to you and says “Hey, there were 10 robberies in the past week, is that normal or low?”. In the scenario where you have fixed time intervals, e.g. Monday through Sunday, and your data is approximately Poisson distributed, you can calculate the CDF. So say your mean per week over 2 years is 3.1, the probability of observing a count of 10 or more in a specific week is:

> # R code
> 1 - ppois(10-1,3.1)
> [1] 0.001400924

So alittle more than 1 in 1000. But if you ask the question “What is the probability of observing a single week of 10 or more, when I have been monitoring the series for 2 years”, with 52 weeks per year. You would adjust the probability for monitoring the trends for multiple weeks over time:

> p <- ppois(10-1,3.1)
> 1 - p^(52*2)
> [1] 0.1356679

So the probability of observing 10 or more in a single week over a 2 year period is closer to 14%. This multiple comparison issue is more extreme when you consider a sliding window – so can count events that occur in a span of a week, but not all necessarily in your pre-specified Monday to Sunday time period. So maybe you observed 10 in a week span that goes from Wednesday to Tuesday. What is the probability of observing 10 in that ad-hoc week time period over the 3 year monitoring period? This often comes up in news articles, see this post by David Spiegelhalter on pedestrian deaths for an example.

I have added in the Naus (1982) approximate statistic to calculate this in my ptools R package – scanw. If you install the latest version of ptools from github you can run for yourself:

> # R code
> library(devtools)
> install_github("apwheele/ptools") # get most recent
> library(ptools)
>
> # example given above
> scanw(52*2,1,3.1,10)

Which prints out [1] 0.5221948. So adding in the sliding window considerably ups the probability of observing a large clump of events.

I don’t think this is so useful from a crime analyst perspective, moreso from a journalistic perspective ‘oh we saw this number recently, is that anomalous’. If you are actively monitoring crime stats I would suggest you use the stats I describe in Wheeler (2016) to identify current outliers given fixed time intervals from the start. (And this approximation is for Poisson data. Overdispersed will have a higher probability.)

And for reference as well, Prieto-Curiel et al. (2023) have an approach that examines the cumulative sum. I’ve debated on doing that in a control chart style framework as well, but instead of just cumsum(counts), do cumsum(counts - expected). I don’t know how people effectively reset the cumulative charts though and effectively deal with seasonality.

I think my approach in Wheeler (2016) is better to identify anomalous trends right now, the Prieto-Curiel approach is still examining historical data and looking for breaks.

References

Aberdein, J., & Spiegelhalter, D. (2013). Have London’s roads become more dangerous for cyclists? Significance, 10(6), 46-48.
Guimaraes, P. (2005). A simple approach to fit the beta-binomial model. The Stata Journal, 5(3), 385-394.
Naus, J.I. (1982). Approximations for distributions of scan statistics. Journal of the American Statistical Association, 77, 177-183.
Prieto-Curiel, R., Walther, O., & Davies, E. (2023). Detecting trends and shocks in terrorist activities. PLoS one, 18(9), e0291514.
Wheeler, A. P. (2016). Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice. International Journal of Police Science & Management, 18(3), 159-172.
Wheeler, A.P. (2023). ptools: Tools for Poisson Data. On 10/18/2023.
Young-Xu, Y., & Chan, K. A. (2008). Pooling overdispersed binomial data to estimate event rate. BMC medical research methodology, 8, 1-12.

3 Comments

by Andy Wheeler on October 18, 2023 • Permalink

Posted in Crime Analysis, data science, Python, R

Tagged optimize, Poisson, ptools

Posted by Andy Wheeler on October 18, 2023

https://andrewpwheeler.com/2023/10/18/fitting-beta-binomial-in-python-poisson-scan-stat-in-r/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 397 other subscribers
aoristic big-data cartography census choropleth citeulike color consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

All posts in category data science

Harmweighted Hotspots

Using ESRI Python API

Making a project and first commit

Branches

Merge Conflicts

References

References

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange