Security issues with sending ChatGPT sensitive data

Part of my job as a data scientist is to be a bridge for lay-people interested in applying artificial intelligence and machine learning to their particular applications. Most quant people with a legit background will snicker at the term “artificial intelligence” – it is a buzzword for sure, but it doesn’t matter really. People have potential applications they need help with, and various statistical and optimization techniques can help.

Given the popularity of ChatGPT and other intelligent chatbots, I figured it would be worthwhile articulating the potential security issues with these technologies in criminal justice and healthcare domains. In particular, you should not send sensitive information in internet chatbot prompts. Examples of this include:

a crime analyst inputting incident narratives (that include names) and asking a chatbot to summarize them
a clinical coder inputting hospital notes and asking for the relevant billing codes
a business analyst inputting text from a set of slides, and asking ChatGPT to edit for grammer

The first two examples should be pretty clear why they are sensitive – they contain obviously sensitive and personal identifiable data. The last example is related to intellectual property leakage, which is more fuzzy, but for a general piece of advice if it is not OK to post publicly for everyone to see on the internet, you should not put it into a prompt. (So crime analysts talking about crime trends is probably OK, since that is already public info, but a business analyst with your pitch deck for internal business applications is probably not.)

Why can’t I send ChatGPT sensitive information?

So the way many online APIs work (including ChatGPT) is this:

You go to website, you input information into a webform
This data gets posted to a webpoint (someone elses computer)
Someone elses computer takes that input, does something with that data
That other computer sends information back to your computer

Here is a diagram of that flow:

So there are two potential attack vectors in this diagram. The first are the arrows sending data to/from OpenAIs computer. Someone could potentially intercept that data. This is not really a huge issue as stated, as the data is likely encrypted in transit. The second, and more important issue, is that the red OpenAI computer now has your sensitive data cached in some capacity.

If the red computer becomes compromised it can cause issues. This is not hypothetical, OpenAI has had issues of leaking sensitive information to other users. This is a computer glitch – bad but fixable. It is a risk though you should be aware of.

A more important issue though, the licensing I am aware of, they can use your conversations to improve the product. This is very bad as to my current understanding, as you can have conversations that are prompt leaked to third parties if they are updating models with your conversations downstream.

This is even worse than say Microsoft being able to read your emails – it would be like a potential third non-Microsoft party could become privy to some of your emails. For example, say a crime analyst in Raccoon city inputted crime incident narratives like I said in my prior example. Then I asked ChatGPT “Give me an example crime incident narrative”, and it outputs narratives very similar to the ones Raccoon city crime analyst previous put into ChatGPT. This is a feature under the current licensing, not a bug.

Let me know in the comments if they are offering paid tiers for the “don’t use my data for training and it is always encrypted and we can’t see it” (I don’t know why they do not offer that). Also they would need to have particular HIPPA standards for medical data, and CJIS standards for CJ data to be in security compliance for these example applications.

Now it is important to discuss other chatbots, who are often just calling OpenAI under the hood. The data flow diagram then looks like this:

It is essentially the same attack vectors but just doubled; now we have two computers instead of one that is a potential vulnerability.

Again here the issue is now two different people have your data cached in some capacity (the blue computer and the red computer). We have people making new services all the time now (the blue computers), that are just wrappers on OpenAI. Now you could have your data leaked by the blue computer, in addition to the problems with leaking in OpenAI.

The solution is local hosting, but local hosting is hard

OpenAI is to be commended for making a quality product – its very easy to use APIs are what make having wrapper services on top of it so easy (hence these many chatbot APIs). From a security standpoint though, you just need to do your due diligence now with two (or more) services when using these secondary tools, not just one. There will be malicious apps (so the blue computer is intentionally a bad actor), and there will be cases where the blue computer is compromised (so not intended to be malicious, but the people running the blue computer messed up).

Given that OpenAI as I am aware doesn’t have the necessary licensing to prevent info leakage, as well as the more specific security clearances, the solution like I said is to self host a model. Self hosting here means instead of sending data to the red OpenAI computer, the flow stays entirely in the single black computer you own, or you have your own server (so a second black computer that speaks to the first black computer).

There are open source and freemium models that are reasonable competitors. But, it is painful to self host these models. For neophytes the way these language models work, they take your text input, turn the text into a set of 1,000s of numbers. They then feed those 1,000s of numbers into a model with billions of parameters to get the final output. You can just think of it as doing several billion mathematical operations you individually could do on your hand-held calculator.

This takes a computer with a large memory and a GPU to do anything that doesn’t take hours. So self hosting a smaller batch process is maybe doable for a normal person or business, but if you want a live chatbot for even one person is hard (let alone a chatbot for multiple people to use at the same time).

Several large companies (including OpenAI) are currently even using up the majority of cloud infrastructure that has machines that can host and run these models, so even if you have money to pay AWS for one of their large GPU computers (it is expensive, think 5 digit costs per month), you maybe can’t even get a slot to get one of those cloud resources. And it is questionable how many people can even use that single machine.

I think eventually OpenAI will solve some of these security issues, and offer special paid tiers to accomodate use cases in healthcare and CJ. But until that happens, please do not post sensitive data into ChatGPT.

1 Comment

by Andy Wheeler on August 18, 2023 • Permalink

Posted in Crime Analysis, data science, website

Tagged ChatGPT, security

Posted by Andy Wheeler on August 18, 2023

https://andrewpwheeler.com/2023/08/18/security-issues-with-sending-chatgpt-sensitive-data/

Querying OSM places data in python

For updates on other blogs, we have:

CRIME De-Coder, Using data to establish reasonableness in premises liability cases, I go over ways to make premises liability claims more empirically rigorous at the reasonableness stage.
ASEBP, Cost-benefit analysis of Gun Shot Detection Tech, I estimate that GSD will save 1 life per 100 gun shot victims, and other crime reduction benefits of GSD are pretty weak sauce.

For a quick post here, Simon Willison has a recent post on using duckdb to query OSM data made available by the Overture foundation. I have written in the past scraping places data (gas stations, stores, etc., often used in spatial crime analysis) from Google (or other sources), so this is a potentially free source.

So I was interested to check out the data, it is quite easy to download given the ability to query parquet data files online. Simon in his post said it was taking awhile though, and this example downloading data from Raleigh took around 25 minutes. So no good for a live API, but fine for a batch job.

This python is simple enough to just embed in a blog post:

import duckdb
import pandas as pd
from datetime import datetime
import requests

# setting up in memory duckdb + extensions
db = duckdb.connect()
db.execute("INSTALL spatial")
db.execute("INSTALL httpfs")
db.execute("""LOAD spatial;
LOAD httpfs;
SET s3_region='us-west-2';""")

# Raleigh bound box from ESRI API
ral_esri = "https://maps.wakegov.com/arcgis/rest/services/Jurisdictions/Jurisdictions/MapServer/1/query?where=JURISDICTION+IN+%28%27RALEIGH%27%29&returnExtentOnly=true&outSR=4326&f=json"
bbox = requests.get(ral_esri).json()['extent']

# check out https://overturemaps.org/download/ for new releases
places_url = "s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*"
query = f"""
SELECT
  *
FROM read_parquet('{places_url}')
WHERE
  bbox.minx > {bbox['xmin']}
  AND bbox.maxx < {bbox['xmax']} 
  AND bbox.miny > {bbox['ymin']} 
  AND bbox.maxy < {bbox['ymax']}
"""

# Took me around 25 minutes
print(datetime.now())
res = pd.read_sql(query,db)
print(datetime.now())

And this currently queries 29k places in Raleigh. Places can have multiple categories, so here I just slice out the main category and check it out:

def extract_main(x):
    return x['main']

res['main_cat'] = res['categories'].apply(extract_main)

res['main_cat'].value_counts().head(30)

And this returns

>>> res['main_cat'].value_counts().head(30)
beauty_salon                         1291
real_estate_agent                     657
landmark_and_historical_building      626
church_cathedral                      567
community_services_non_profits        538
professional_services                 502
real_estate                           452
hospital                              405
automotive_repair                     396
dentist                               350
lawyer                                316
park                                  308
insurance_agency                      298
public_and_government_association     288
spas                                  265
financial_service                     261
gym                                   260
counseling_and_mental_health          256
religious_organization                240
car_dealer                            214
college_university                    185
gas_station                           179
hotel                                 170
contractor                            169
pizza_restaurant                      167
barber                                161
shopping                              160
grocery_store                         160
fast_food_restaurant                  160
school                                158

I can’t say anything about the coverage of this data. Looking nearby my house it appears pretty well filled in. There are additional pieces of info in the OSM data as well, such as a confidence score.

So definately a ton of potential to use that as a nice source for reproducible crime analysis (it probably has the major types of places most people are interested in looking at). But I would do some local checks for your data before wholesale recommending using the open street map data over an official business directory (if available – but that may not include things like ATMs) or Google Places API data (but this is free!)

1 Comment

by Andy Wheeler on August 14, 2023 • Permalink

Posted in Crime Analysis, Python

Tagged crime-generator, osm

Posted by Andy Wheeler on August 14, 2023

https://andrewpwheeler.com/2023/08/14/querying-osm-places-data-in-python/

Downloading Police Employment Trends from the FBI Data Explorer

The other day on the IACA forums, an analyst asked about comparing her agencies per-capita rate for sworn/non-sworn compared to other agencies. This is data available via the FBI’s Crime Data Explorer. Specifically they have released a dataset of employment rates, broken down by various agencies, over time.

The Crime Data Explorer to me is a bit difficult to navigate, so this post is going to show using the API to query the data in python (maybe it is easier to get via direct downloads, I am not sure). So first, go to that link above and sign up for a free API key.

Now, in python, first the API works via asking for a specific agencies ORI, as well as date ranges. (You can do a query for national and overall state as well, but I would rarely want those levels of aggregation.) So first we are just going to grab all of the agencies across 50 states. This runs fairly fast, only takes a few minutes:

import pandas as pd
import requests

key = 'Insert your key here'

state_list = ("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
              "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
              "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
              "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI",
              "SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY","DC")

# Looping over states, getting all of the ORIs
fin_data = []
for s in states:
    url = f'https://api.usa.gov/crime/fbi/cde/agency/byStateAbbr/{s}?API_KEY={key}'
    data = requests.get(url)
    fin_data.append(pd.DataFrame(data.json()))

agency = pd.concat(fin_data,axis=0).reset_index(drop=True)

And the agency dataframe has just a few shy of 19k ORI’s listed. Unfortunately this does not have much else associated with the agencies (such as the most recent population). It would be nice if this list had population counts (so if you just wanted to compare yourself to other similar size agencies), but alas it does not. So the second part here – scraping all 18,000+ agencies, takes a bit (let it run overnight).

# Now grabbing the full employment data
ystart = 1960   # some have data going back to 1960
yend = 2022
emp_data = []

# try/catch, as some of these can fail
for i,o in enumerate(agency['ori']):
    print(f'Getting agency {i+1} out of {agency.shape}')
    url = ('https://api.usa.gov/crime/fbi/cde/pe/agency/'
          f'{o}/byYearRange?from={ystart}&to={yend}&API_KEY={key}')
    try:
        data = requests.get(url)
        emp_data.append(pd.DataFrame(data.json()))
    except:
        print(f'Failed to query {o}')

emp_pd = pd.concat(emp_data).reset_index(drop=True)
emp_pd.to_csv('EmployeePoliceData.csv',index=False)

And that will get you 100% of the employee data on the FBI data explorer, including data for 2022.

To plug my consulting firm here, this is something that takes a bit of work. If you have longer running scraping jobs, I paired this code example down to be quite minimial, but you want to periodically save results and have the code be able to run from the last save point. So if you scrape 1000 agencies, your internet goes out, you don’t want to have to start from 0, you want to start from the last point you left off.

If interested in other tutorials like this, I suggest you check out two of my books:

Each can be purchase in either paperback for epub versions worldwide from my Crime De-Coder store.

If that is something you need, it makes sense to send me an email to see if I can help. For that and more, check out my website, crimede-coder.com:

4 Comments

by Andy Wheeler on July 29, 2023 • Permalink

Posted in Crime Analysis, Criminal Justice, Python

Tagged scraping

Posted by Andy Wheeler on July 29, 2023

https://andrewpwheeler.com/2023/07/29/downloading-police-employment-trends-from-the-fbi-data-explorer/

Age-Period-Cohort graphs for suicide and drug overdoses

When I still taught advanced research methods for PhD students, I debated on having a section on age-period-cohort (APC) analysis. Part of the reason I did not bother with that though is there were no good open source datasets (that I was aware of). A former student asking about APC analysis, as well as a recent NBER working paper on suicide rates (Marcotte & Hansen, 2023) brought it to mind again.

I initially had plans to do more modelling examples, but I decided on just showing off the graphs I generated. The graphs themselves I believe are quite informative.

So I went and downloaded mortality rates USA mortality rates for suicides and drug overdoses, spanning 1999-2022 for suicide and 1999-2021 for drug. Here is the data and R code to recreate these graphs in the post to follow along.

To follow along here in brief, we have a dataset of death and population counts, broken down by year and age:

# Age-Period-Cohort plots
library(ggplot2)

# Read in data
suicide <- read.csv('Suicides.csv')

# Calculate Rate & Cohort
suicide$Cohort <- suicide$Year - suicide$Age
suicide$Rate <- (suicide$Deaths/suicide$Population)*100000

# Suicide only 11-84
suicide <- suicide[suicide$Age >= 11,]
head(suicide)

And this produces the output:

> head(suicide)
   Age Year Deaths Population Cohort      Rate
16  11 1999     22    4036182   1988 0.5450696
17  11 2000     24    4115093   1989 0.5832189
18  11 2001     24    4344913   1990 0.5523701
19  11 2002     22    4295720   1991 0.5121377
20  11 2003     15    4254047   1992 0.3526054
21  11 2004     18    4207721   1993 0.4277850

A few notes here. 1) I limited the CDC Vital stats data to 1999, because in the Wonder dataset pre-1999 you can’t get individual year-age breakdowns, you need to do 5 year age bins. This can cause issues where you need to age-adjust within those bins (Gelman & Auerbach, 2016), that should be less of a problem with single year breakdowns. So I would go back further were it not for that. 2) When breaking down to individual years, the total count of suicides per age bracket is quite small. Initially I was skeptical of Marcotte & Hansen’s (2023) claims of LGBTQ subgroups potentially accounting for increased trends among young people (I just thought that group was too small for that to make sense), but looking at the counts I don’t think that is the case.

When I think about age-period-cohort analysis, my mind goes age effects > period effects > cohort effects. I think people often mix up cohort effects with things that are actually age effects. (And also generation labels are not real.) In criminology, the age-crime-curve was established back in the 1800’s by Quetelet.

So I focus on graphing the age curve, and look at deviations from that to try to visually identify period effects or cohort effects. Here is the plot to look at each of the age curves, broken down by year.

ap <- ggplot(data=suicide, aes(x = Age, y = Rate, color=Year, group=Year)) + 
             geom_line() +
             scale_colour_distiller(palette = "PuOr") +
             scale_x_continuous(breaks=seq(10,80,10)) +
             scale_y_continuous(breaks=seq(0,30,5)) + 
             labs(x='Age',y=NULL,title='Suicide Rate per 100,000',caption="USA Rates via CDC Wonder")
ap

When using diverging color ramps to visualize a continuous variable, you get a washed out effect in the middle. So I am not sure the best color ramp here, but it does provide a nice delineation and gradual progression from the curve in the early 2000’s compared to the suicide curve in 2022. (Also spot the one outlier year, it is age 75 for the “provisional” 2022 counts. I leave it in as it is a good showcase for how plots can help spot bad data.)

The blog the graph will be tinier, open it up in a new tab on your desktop computer to get a good look at the full size image.

Here looking at the graph you can see two things other researchers looking at similar data have discussed. In the early 2000’s, you had a gradual increase from 20’s to the peak suicide rate at mid 40’s. More recent data has shifted that peak to later ages, more like peak 55. Case & Deaton (2015) discussing deaths of despair (of which suicide is a part) focussed on this shift, and noted that females in this age category increased at a higher rate relative to males.

Marcotte & Hansen (2023) focus on the younger ages. So in the year 2000, the age-suicide curve was a gradual incline from ages early 20’s until the peak. Newer cohorts though show steeper inclines in the earlier ages, so the trend from ages 20-60 is flatter than before.

Period effects in these charts will look like the entire curve is the same shape, and it is just shifted up and down. (It may be better to graph these as log rates, but keeping on the linear scale for simplicity.) We have a bit of a shape change though, so these don’t rule out cohort effects.

Here is the same plot, but grouping by cohorts instead of years. So the age-suicide curve is indexed to the birth year for an individual:

cp <- ggplot(data=suicide, aes(x = Age, y = Rate, color=Cohort, group=Cohort)) + 
             geom_line() +
             scale_colour_distiller(palette = "Spectral") +
             scale_x_continuous(breaks=seq(10,80,10)) +
             scale_y_continuous(breaks=seq(0,30,5)) +
             labs(x='Age',y=NULL,title='Suicide Rate per 100,000',caption="USA Rates via CDC Wonder")
cp

My initial cheeky thought (not that there aren’t enough ways to do APC models already), was to use mixture models to identify discrete cohorts. Something along the lines of this in the R flexmix package (note this does not converge):

library(flexmix)
knot_loc <- c(20,35,50,65) # for ages
model <- stepFlexmix(cbind(Deaths, Population - Deaths) ~ bs(Age, knot_loc) | Cohort, 
                     model = FLXMRglm(family = "binomial", fixed = ~Year),
                     data = suicide, k = 3)

But there is an issue with this when looking at the cohort plot, you have missing data for cohorts – to do this you would need to observe the entire age-curve for a cohort. There may be a way to estimate this using latent class models in Stata (and fixing some of the unidentified spline coefficients to a fixed value), but to me just looking at the graphs I think is all I really care about. You could maybe say the orange cohorts in the late 90’s are splitting off, but I think that is consistent with period effects. (And is also a trick of the colors I used in the plot.)

You could do mixtures for the year plots, see some of the work by Elana Erosheva (Erosheva et al., 2014), but that again just isn’t how I think about APC analysis.

Doing this same exercise for drug overdoses rates, (which I not can overlap with suicide – you can commit suicide via intentionlly taking too many drugs) we can clearly see the dramatic rise in recent years. We can also see the same trends in earlier ages now being peak, but also increases and shifts to older ages.

The cohort plot here looks like a Spinosaurus crest:

Which I believe is more consistent with (very strong) period effects, not so much cohort effects. Drug overdoses are increasing across both younger and older cohorts.

Nerd Notes

These datasets don’t have covariates, which to use the APC method in Spelman (2022) you would need those (it uses covariates to estimate period effects). I am not so sure that is the best approach to APC decomposition, but it is horses for courses.

What I wish is that the CDC distributed the vital statistics data at the micro level (where each row is a death, with all of the covariates), along with a matching variable dataset of the micro level American Community Survey and the weights. That doesn’t solve the APC issue with identifying the different effects, but makes it easier to do more complicated modelling, e.g. I could fit models or generate graphs for age-gender differences more easily, decompose different death types, etc.

Final nerd note is about forecasting mortality trends. While I am familiar with the PCA-functional data approach advocated by Rob Hyndman (Hyndman & Ullah, 2007), I don’t think that will do very well with this data. I am wondering if doing some type of multi-level GAM model, and doing short term extrapolation of the period effect (check out Gavin Simpson’s posts on multi-level smooths, 1, 2, 3).

So maybe something like:

library(mgcv)
smooth_model <- gam(cbind(Deaths, Population - Deaths) ~ s(Year) + s(Age,by=Cohort), 
                    family = binomial("logit"),
                    data = suicide)

Or maybe just use s(Age,Year) and not worry about the cohort effect. Caveat emptor about this model, this is just my musings, I have not in-depth studied it to make sure it behaves well (although a quick check R does not complain when fitting it).

References

Case, A., & Deaton, A. (2015). Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century. Proceedings of the National Academy of Sciences, 112(49), 15078-15083.
Erosheva, E. A., Matsueda, R. L., & Telesca, D. (2014). Breaking bad: Two decades of life-course data analysis in criminology, developmental psychology, and beyond. Annual Review of Statistics, 1, 301-332.
Gelman, A., & Auerbach, J. (2016). Age-aggregation bias in mortality trends. Proceedings of the National Academy of Sciences, 113(7), E816-E817.
Hyndman, R. J., & Ullah, M. S. (2007). Robust forecasting of mortality and fertility rates: A functional data approach. Computational Statistics & Data Analysis, 51(10), 4942-4956.
Marcotte, D. E., & Hansen, B. (2023). The Re-Emerging Suicide Crisis in the US: Patterns, Causes and Solutions (No. w31242). National Bureau of Economic Research.
Spelman, W. (2022). How cohorts changed crime rates, 1980–2016. Journal of Quantitative Criminology, 38(3), 637-671.

Leave a comment

by Andy Wheeler on July 22, 2023 • Permalink

Posted in data science, Data Visualization, ggplot2, R

Tagged APC, mortality

Posted by Andy Wheeler on July 22, 2023

https://andrewpwheeler.com/2023/07/22/age-period-cohort-graphs-for-suicide-and-drug-overdoses/

Too relaxed? Naive Bayes does not improve recidivism forecasting in the NIJ challenge

So the paper Improving Recidivism Forecasting With a Relaxed Naïve Bayes Classifier (Lee et al., 2023), recently published in Crime & Delinquency, has incorrect results. Note I am not sandbagging on the authors, I reviewed this paper for JQC and Journal of Criminal Justice, so I have given the authors this same feedback already (multiple times!). The authors however did not correct their results, and just journal shopped and published the wrong findings.

I have replication code here to review. (Note I initially made a mistake in my code replication, reversed calculating p(x|y), I calculated p(y|x) by accident, see this older code I shared in my prior reviews, but I was still correct in my assertion that Lee’s results were wrong.)

So the main thing that made me go to this effort, the authors report unbelieveable results. They report Brier Scores for Females (Round 1) of 0.104 and for males 0.159 – these scores blow the competition out of the water. The leaderboard was 0.15 for Females and 0.19 for males. Note how I don’t list to the third decimal – the difference between the teams you needed to go down that low. Lee also reports unbelievably low Brier scores for the alternative logit and random forest models – their results just on their face are not believable.

If the authors really believe their results this kind of sucks for them they did not participate in the NIJ challenge, they would have won more than $150,000! But I am pretty sure they are miscalculating their Brier scores somewhere. My replication code shows them in the same ballpark as everyone else, but they would not have made the leaderboard. Here are my estimates of what their Brier scores should be reported as (the Brier column below in the two tables):

Folks can go and look at their paper and their set of spreadsheets in the supplemental material – I have posted not many more than 50 lines of (non-comment) python code that replicates their regression model coefficients and shows their Brier scores are wrong though. (And subsequently any points Lee et al. 2023 make about fairness are thus wrong as well.)

NIJ probably released papers at some point, but if you want to see other folks discussion, there is Circo & Wheeler (2022) (for mine and Gio’s results for team MCHawks), and Mohler & Porter (2021) for team PASDA.

I may put in the slate sometime to discuss naive Bayes (and other categorical encoding schemes). It is not a bad idea for data with many categories, but for this NIJ data there just isn’t that much to squeeze out of the data. So any future work will be unlikely to dramatically improve upon the competition results (it is difficult to overfit this data). Again given my analysis here, I am pretty sure a valid data analysis (not peeking) at best will “beat” the competition results in the 3rd decimal place (if they can improve at all).

Now part of the authors argument is that this method (relaxed naive Bayes) results in simpler interpretations. Typically people interpret “simple” models in terms of the end results, e.g. having a simple checklist of integer weights. The more I deal with predictive models though, I think this is maybe misguided. You could also interpret “simple” in terms of the code used for how someone derived the weights (and evaluated the final metrics). This is important when auditing code that others have written, as you will ultimately take the code and apply it to your data.

I think this “simpler to estimate the same results” is probably more important for scientists and outside groups wanting to verify the integrity of any particular machine learning model than “simple end result weights”. Otherwise scientists can make up results and say my method is better. Which is simpler I suppose, but misses the boat a bit in terms of why we want simple models to begin with.

References

Circo, G. M., & Wheeler, A. P. (2022). An Open Source Replication of a Winning Recidivism Prediction Model. International Journal of Offender Therapy and Comparative Criminology, Online First.
Lee, Y.J., O, S.H., & Eck, J.E. (2023). Improving recidivism forecasting with a relaxed naive Bayes classifier. Crime & Delinquency, Online First.
Mohler G., Porter M.D. (2021). A note on the multiplicative fairness score in the NIJ recidivism forecasting challenge. Crime Science, 10, 17.

Leave a comment

by Andy Wheeler on July 17, 2023 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, Python

Tagged prediction, relaxed-bayes, replication

Posted by Andy Wheeler on July 17, 2023

https://andrewpwheeler.com/2023/07/17/too-relaxed-naive-bayes-does-not-improve-recidivism-forecasting-in-the-nij-challenge/

Some notes on synthetic control and Hogan/Kaplan

This will be a long one, but I have some notes on synthetic control and the back-and-forth between two groups. So first if you aren’t familiar, Tom Hogan published an article on how the progressive District Attorney (DA) in Philadelphia, Larry Krasner, in which Hogan estimates that Krasner’s time in office contributed to a large increase in the number of homicides. The control homicides are estimated using a statistical technique called synthetic control, in which you derive estimates of the trend in homicides to compare Philly to based on a weighted average of comparison cities.

Kaplan and colleagues (KNS from here on) then published a critique of various methods Hogan used to come up with his estimate. KNS provided estimates using different data and a different method to derive the weights, showing that Philadelphia did not have increased homicides post Krasner being elected. For reference:

Part of the reason I am writing this is if people care enough, you could probably make similar back and forths on every synth paper. There are many researcher degrees of freedom in the process, and in turn you can make reasonable choices that lead to different results.

I think it is worthwhile digging into those in more detail though. For a summary of the method notes I discuss for this particular back and forth:

Researchers determine the treatment estimate they want (counts vs rates) – solvers misbehaving is not a reason to change your treatment effect of interest
The default synth estimator when matching on counts and pop can have some likely unintended side-effects (NYC pretty much has to be one of the donor cities in this dataset)
Covariate balancing is probably a red-herring (so the data issues Hogan critiques in response to KNS are mostly immaterial)

In my original draft I had a note that this post would not be in favor of Hogan nor KNS, but in reviewing the sources more closely, nothing I say here conflicts with KNS (and I will bring a few more critiques of Hogan’s estimates that KNS do not mention). So I can’t argue much with KNS’s headline that Hogan’s estimates are fatally flawed.

An overview of synthetic control estimates

To back up and give an overview of what synth is for general readers, imagine we have a hypothetical city A with homicide counts 10 15 30, where the 30 is after a new DA has been elected. Is the 30 more homicides than you would have expected absent that new DA? To answer this, we need to estimate a counterfactual trend – what the homicide count would have been in a hypothetical world in which a new progressive DA was not elected. You can see the city homicides increased the prior two years, from 10 to 15, so you may say “ok, I expected it to continue to increase at the same linear trend”, in which case you would have expected it to increase to 20. So the counterfactual estimated increase in that scenario is observed - counterfactual, here 30 - 20 = 10, an estimated increase of 10 homicides that can be causally attributed to the progressive DA.

Social scientists tend to not prefer to just extrapolate prior trends from the same location into the future. There could be widespread changes that occur everywhere that caused the increase in city A. If homicide rates accelerated in every city in the country, even those without a new progressive DA, it is likely something else is causing those increases. So say we compare city A to city B, and city B had a homicide count trend during the same time period 10 15 35. Before the new DA in city A, cities A/B had the same pre-trend (both 10 15). The post time period City B increased to 35 homicides. So if using City B as the counterfactual estimate, we have the progressive DA reduced 5 homicides, again observed - counterfactual = 30 - 35 = -5. So even though city A increased, it increased less than we expected based on the comparison city B.

Note that this is not a hypothetical concern, it is pretty basic one that you should always be concerned about when examining macro level crime data. There has been national level homicide increases over the time period when Krasner has been in office (Yim et al, 2020, and see this blog post for updates. U.S. city homicide rates tend to be very correlated with each other (McDowall & Loftin, 2009).

So even though Philly has increased in homicide counts/rates when Krasner has been in office, the question is are those increases higher or lower than we would expect. That is where the synthetic control method comes in, we don’t have a perfect city B to compare to Philadelphia, so we create our own “synthetic” counter-factual, based on a weighted average of many different comparison cities.

To make the example simple, imagine we have two potential control cities and homicide trends, city C1 0 30 20, and city C2 20 0 30. Neither looks like a good comparison to city A that has trends 10 15 30. But if we do a weighted average of C1 and C2, with the weights 0.5 for each city, when combined they are a perfect match for the two pre-treatment periods:

C0  C1 Average cityA
 0  20   10     10
30   0   15     15
20  30   25     30

This is what the synthetic control estimator does, although instead of giving equal weights it determines the optimal weights to match the pre-treatment time period given many potential donors. In real data for example C0 and C1 may be given weights of 0.2 and 0.8 to give the correct balance based on the prior to treatment time periods.

The fundamental problem with synth

The rub with estimating the synth weights is that there is no one correct way to estimate the weights – you have more numbers to estimate than data points. In the Hogan paper, he has 5 pre time periods, 2010-2014, and he has 82 potential donors (99 other of the largest cities in the US minus 17 progressive prosecutors). So you need to learn 82 numbers (the weights) based on 5 data points.

Side note: you can also consider matching on covariates additional data points, although I will go into more detail on how matching on covariates is potentially a red-herring. Hogan I think uses an additional 5*3=15 time varying points (pop, cleared homicide, homicide clearance rates), and maybe 3 additional time invariant (median income, 1 prosecutor categorization, and homicides again!). So maybe has 5 + 15 + 3 = 23 data points to match on (so same fundamental problem, 23 numbers to learn 82 weights). I am just going to quote the full passage on Hogan (2022a) here where he discusses covariate matching:

The number of homicides per year is the dependent variable. The challenge with this synthetic control model is to use variables that both produce parallel trends in the pre-period and are sufficiently robust to power the post-period results. The model that ultimately delivered the best fit for the data has population, cleared homicide cases, and homicide clearance rates as regular predictors. Median household income is passed in as the first special predictor. The categorization of the prosecutors and the number of homicides are used as additional special predictors. For homicides, the raw values are passed into the model. Abadie (2021) notes that the underlying permutation distribution is designed to work with raw data; using log values, rates, or other scaling techniques may invalidate results.

This is the reason why replication code is necessary – it is very difficult for me to translate this to what Hogan actually did. “Special” predictors here are code words for the R synth package for time-invariant predictors. (I don’t know based on verbal descriptions how Hogan used time-invariant for the prosecutor categorization for example, just treats it as a dummy variable?) Also only using median income – was this the only covariate, or did he do a bunch of models and choose the one with the “best” fit (it seems maybe he did do a search, but doesn’t describe the search, only the end selected result).

I don’t know what Hogan did or did not do to fit his models. The solution isn’t to have people like me and KNS guess or have Hogan just do a better job verbally describing what he did, it is to release the code so it is transparent for everyone to see what he did.

So how do we estimate those 82 weights? Well, we typically have restrictions on the potential weights – such as the weights need to be positive numbers, and the weights should sum to 1. These are for a mix of technical and theoretical reasons (having the weights not be too large can reduce the variance of the estimator is a technical reason, we don’t want negative weights as we don’t think there are bizzaro comparison areas that have opposite world trends is a theoretical one).

These are reasonable but ultimately arbitrary – there are many different ways to accomplish this weight estimation. Hogan (2022a) uses the R synth package, KNS use a newer method also advocated by Abadie & L’Hour (2021) (very similar, but tries to match to the closest single city, instead of weights for multiple cities). Abadie (2021) lists probably over a dozen different procedures researchers have suggested over the past decade to estimate the synth weights.

The reason I bring this up is because when you have a problem with 82 parameters and 5 data points, the problem isn’t “what estimator provides good fit to in-sample data” – you should be able to figure out a estimator that accomplishes good in-sample fit. The issue is whether that estimator is any good out-of-sample.

Rates vs Counts

So besides the estimator used, you can break down 3 different arbitrary researcher data decisions that likely impact the final inferences:

outcome variable (homicide counts vs homicide per capita rates)
pre-intervention time periods (Hogan uses 2010-2014, KNS go back to 2000)
covariates used to match on

Lets start with the outcome variable question, counts vs rates. So first, as quoted above, Hogan cites Abadie (2021) for saying you should prefer counts to rates, “Abadie (2021) notes that the underlying permutation distribution is designed to work with raw data; using log values, rates, or other scaling techniques may invalidate results.”

This has it backwards though – the researcher chooses whether it makes sense to estimate treatment effects on the count scale vs rates. You don’t goal switch your outcome because you think the computer can’t give you a good estimate for one outcome. So imagine I show you a single city over time:

        Y0    Y1    Y2
Count   10    15    20
Pop   1000  1500  2000

You can see although the counts are increasing, the rate is consistent over the time period. There are times I think counts make more sense than rates (such as cost-benefit analysis), but probably in this scenario the researcher would want to look at rates (as the shifting denominator is a simple explanation causing the increase in the counts).

Hogan (2022b) is correct in saying that the population is not shifting over time in Philly very much, but this isn’t a reason to prefer counts. It suggests the estimator should not make a difference when using counts vs rates, which just points to the problematic findings in KNS (that making different decisions results in different inferences).

Now onto the point that Abadie (2021) says using rates is wrong for the permutation distribution – I don’t understand what Hogan is talking about here. You can read Abadie (2021) for yourself if you want. I don’t see anything about the permutation inferences and rates.

So maybe Hogan mis-cited and meant another Abadie paper – Abadie himself uses rates for various projects (he uses per-capita rates in the 2021 cited paper, Abadie et al., (2010) uses rates for another example), so I don’t think Abadie thinks rates are intrinsically problematic! Let me know if there is some other paper I am unaware of. I honestly can’t steelman any reasonable source where Hogan (2022a) came up with the idea that counts are good and rates are bad though.

Again, even if they were, it is not a reason to prefer counts vs rates, you would change your estimator to give you the treatment effect estimate you wanted.

Side note: Where I thought the idea with the problem with rates was going (before digging in and not finding any Abadie work actually saying there is issues with rates), was increased variance estimates with homicide data. So Hogan (2022a) estimates for the synth weights Detroit (0.468), New Orleans (NO) (0.334), and New York City (NYC) (0.198), here are those cities homicide rates graphed (spreadsheet with data + notes on sources).

You can see NO’s rate is very volatile, so is not a great choice for a matched estimator if using rates. (I have NO as an example in Wheeler & Kovandzic (2018), that much variance though is fairly normal for high crime not too large cities in the US, see Baltimore for example for even more volatility.) I could forsee someone wanting to make a weighted synth estimator for rates, either make the estimator a population weighted average, or penalize the variance for small rates. Maybe you can trick microsynth to do a pop weighted average out of the box (Robbins et al., 2017).

To discuss the Hogan results specifically, I suspect for example NYC being a control city with high weight in the Hogan paper, which superficially may seem good (both large cities on the east coast), actually isn’t a very good control area considering the differences in homicide trends (either rates or counts) over time. (I am also not so sure about describing NYC and New Orlean’s as “post-industrial” by Hogan (2022a) either. I mean this is true to the extent that all urban areas in the US are basically post-industrial, but they are not rust belt cities like Detroit.)

Here is for reference counts of homicides in Philly, Detroit, New Orleans, and NYC going back further in time:

NYC is such a crazy drop in the 90s, lets use the post 2000 data that KNS used to zoom in on the graph.

I think KNS are reasonable here to use 2000 as a cut point – it is more empirical based (post crime drop), in which you could argue the 90’s are a “structural break”, and that homicides settled down in most cities around 2000 (but still typically had a gradual decline). Given the strong national homicide trends though across cities (here is an example I use for class, superimposing Dallas/NYC/Chicago), I think using even back to the 60’s is easily defensible (moreso than limiting to post 2010).

It depends on how strict you want to be whether you consider these 3 cities “good” matches for the counts post 2010 in Hogan’s data. Detroit seems a good match on the levels and ok match on trends. NO is ok match on trends. NYC and NO balance each other in terms of matching levels, NYC has steeper declines though (even during the 2010-2014 period).

The last graph though shows where the estimated increases from Hogan (2022a) come from. Philly went up and those 3 other cities went down from 2015-2018 (and had small upward bumps in 2019).

Final point in this section, careful what you wish for with sparse weights and sum to 1 in the synth estimate. What this means in practice when using counts and matching on pop size, is that you need lines that are above and below Philly on those dimensions. So to get a good match on Pop, it needs to select at least one of NYC/LA/Houston (Chicago was eliminated due to having a progressive prosecutor). To get a good match on homicide counts, it also has to pick at least one city with more homicides per year as well, which limits the options to New York and Detroit (LA/Houston have lower overall homicide counts to Philly).

You can’t do the default Abadie approach for NYC for example (matching on counts and pop) – it will always have a bad fit when using comparison cities in the US as the donor pool. You either need to allow the weights to sum to larger than 1, or the lasso approach with an intercept is another option (so you only match on trend, not levels).

Because matching on trends is what matters for proper identification in this design, not levels, this is all sorts of problematic with the data at hand. (This is also a potential problem with the KNS estimator as well. KNS note though they don’t trust their estimate offhand, their reasonable point is that small changes in the design result in totally different inferences.)

Covariates and Out of Sample Estimates

For sake of argument, say I said Hogan (2022a) is bunk, because it did not match on “per-capita annual number of cheese-steaks consumed”. Even though on its face this covariate is non-sense, how do you know it is non-sense? In the synthetic control approach, there is no empirical, falsifiable way to know whether an covariate is a correct one to match on. There is no way to know that median income is better than cheese-steaks.

If you wish for more relevant examples, Philly has obviously more issues with street consumption of opioids than Detroit/NOLA/NYC, which others have shown relationships to homicide and has been getting worse over the time Krasner has been in office (Rosenfeld et al., 2023). (Or more simply social disorganization is the more common way that criminologists think about demographic trends and crime.)

This uncertainty in “what demographics to control for” is ok though, because matching on covariates is neither necessary nor sufficient to ensure you have estimated a good counter-factual trend. Abadie in his writings intended for covariates to be more like fuzzy guide-rails – they are qualitative things that you think the comparison areas should be similar on.

Because there are effectively an infinite pool of potential covariates to match on, I prefer the approach of simply limiting the donor pool apriori – Hogan limiting to large cities is on its face reasonable. Including other covariates is not necessary, and does not make the synth estimate more or less robust. Whether KNS used good or bad data for covariates is entirely a red-herring as to the quality of the final synth estimate.

Side note: I don’t doubt that Hogan got advice to not share data and code. It is certainly not normative in criminology to do this. It creates a bizarre situation though, in which someone can try to replicate Hogan by collating original sources, and then Hogan always comes back and says “no, the data you have are wrong” or “the approach you did is not exactly replicating my work”.

I get that collating data takes a long time, and people want to protect their ability to publish in the future. (Or maybe just limit their exposure to their work being criticized.) It is blatantly antithetical to verifying the scientific integrity of peoples work though.

Even if Hogan is correct though in the covariates that KNS used are wrong, it is mostly immaterial to the quality of the synth estimates. It is a waste of time for outside researchers to even bother to replicate Hogan’s covariates he used.

So I used the idea of empirical/falsifiable – can anything associated with synth be falsifiable? Why yes it can – the typical approach is to do some type of leave-one-out estimate. It may seem odd because synth estimates an underlying match to a temporal trend in the treated location, but there is nothing temporal about the synth estimate. You could jumble up the years in the pre-treatment sample and still would estimate the same weights.

Because of this, you can leave-a-year-out in the pre-treatment time period, run your synth algorithm, and then predict that left out year. A good synth estimator will be close to the observed value for the out of sample estimates in the pre-treated time period (and as a side bonus, you can use that variance estimate to estimate the error in the post-trend years).

That is a relatively simple way to determine if the Hogan 5 year vs KNS 15 year time periods are “better” synth controls (my money is on KNS for that one). Because Hogan has not released data/code, I am not going to go through that trouble. As I said in the side note earlier, I could try to do that, and Hogan could simply come back and say “you didn’t do it right”.

This also would settle the issue of “over-fit”. You actually cannot just look at the synth weights, and say that if they are sparse they are not over-fit and if not sparse are over-fit. So for reference, you have in Hogan essentially fitting 82 weights based on 5 datapoints, and he identified a fit with 3 non-zero weights. Flip this around, and say I had 5 data points and fit a model with 3 parameters, it is easily possible that the 3 parameter model in that scenario is overfit.

Simultaneously, it is not necessary to have a sparse weights matrix. Several alternative methods to estimate synth will not have sparse weights (I am pretty sure Xu (2017) will not have sparse weights, and microsynth estimates are not sparse either for just two examples). Because US cities have such clear national level trends, a good estimator in this scenario may have many tiny weights (where good here is low bias and variance out of sample). Abadie thinks sparse weights are good to make the model more interpretable (and prevent poor extrapolation), but that doesn’t mean by default a not sparse solution is bad.

To be clear, KNS admit that their alternative results are maybe not trustworthy due to not sparse weights, but this doesn’t imply Hogan’s original estimates are themselves “OK”. I think maybe a correct approach with city level homicide rate data will have non-sparse weights, due to the national level homicide trend that is common across many cities.

Wrapping Up

If Crim and Public Policy still did response pieces maybe I would go through that trouble of doing the cross validation and making a different estimator (although I would unlikely be an invited commenter). But wanted to at least do this write up, as like I said at the start I think you could do this type of critique with the majority of synth papers in criminology being published at the moment.

To just give my generic (hopefully practical) advice to future crim work:

don’t worry about matching on covariates, worry about having a long pre-period
the default methods you need to worry about if you have enough “comparable” units – this is in terms of levels, not just trends
the only way to know the quality of the modeling procedure in synth is to do out of sample estimates.

Bullet points 2/3 are perhaps not practical – most criminologists won’t have the capability to modify the optimization procedure to the situation at hand (I spent a few days trying without much luck to do my penalized variants suggested, sharing so others can try out themselves, I need to move onto other projects!) Also takes a bit of custom coding to do the out of sample estimates.

For many realistic situations though, I think criminologists need to go beyond just point and clicking in software, especially for this overdetermined system of equations synthetic control scenario. I did a prior blog post on how I think many state level synth designs are effectively underpowered (and suggested using lasso estimates with conformal intervals). I think that is a better default in this scenario as well compared to the typical synth estimators, although you have plenty of choices.

Again I had initially written this as trying to two side the argument, and not being for or against either set of researchers. But sitting down and really reading all the sources and arguments, KNS are correct in their critique. Hogan is essentially hiding behind not releasing data and code, and in that scenario can make an endless set of (ultimately trivial) responses of anyone who publishes a replication/critique.

Even if some of the the numbers KNS collated are wrong, it does not make Hogan’s estimates right.

References

Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2), 391-425.
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. Journal of the American Statistical Association, 105(490), 493-505.
Abadie, A., & L’hour, J. (2021). A penalized synthetic control estimator for disaggregated data. Journal of the American Statistical Association, 116(536), 1817-1834.
Hogan, T.P. (2022a) De‐prosecution and death: A synthetic control analysis of the impact of de‐prosecution on homicides. Criminology & Public Policy, 21(3), 489-534.
Hogan, T.P. (2022b) DE-PROSECUTION AND DEATH: A CORDIAL REPLY TO KAPLAN, NADDEO & SCOTT.
Kaplan, J., Naddeo, J., & Scott, T. (2022) De-prosecution and death: A comment on the fatal flaws in Hogan (2022).
McDowall, D., & Loftin, C. (2009). Do US city crime rates follow a national trend? The influence of nationwide conditions on local crime patterns. Journal of Quantitative Criminology, 25, 307-324.
Robbins, M. W., Saunders, J., & Kilmer, B. (2017). A framework for synthetic control methods with high-dimensional, micro-level data: evaluating a neighborhood-specific crime intervention. Journal of the American Statistical Association, 112(517), 109-126.
Rosenfeld, R., Roth, R., & Wallman, J. (2023). Homicide and the opioid epidemic: a longitudinal analysis. Homicide Studies, 27(3), 321-337.
Wheeler, A. P., & Kovandzic, T. V. (2018). Monitoring volatile homicide trends across US cities. Homicide Studies, 22(2), 119-144.
Xu, Y. (2017). Generalized synthetic control method: Causal inference with interactive fixed effects models. Political Analysis, 25(1), 57-76.
Yim, H. N., Riddell, J. R., & Wheeler, A. P. (2020). Is the recent increase in national homicide abnormal? Testing the application of fan charts in monitoring national homicide trends over time. Journal of Criminal Justice, 66, 101656..

2 Comments

by Andy Wheeler on July 9, 2023 • Permalink

Posted in Crime Analysis, data science, Papers, Python, R, Regression, scholarly

Tagged causal-inference, synthetic-control

Posted by Andy Wheeler on July 9, 2023

https://andrewpwheeler.com/2023/07/09/some-notes-on-synthetic-control-and-hogan-kaplan/

Youtube interview with Manny San Pedro on Crime Analysis and Data Science

I recently did an interview with Manny San Pedro on his YouTube channel, All About Analysis. We discuss various data science projects I conducted while either working as an analyst, or in a researcher/collaborator capacity with different police departments:

Here is an annotated breakdown of the discussion, as well as links to various resources I discuss in the interview. This is not a replacement for listening to the video, but is an easier set of notes to link to more material on what particular item I am discussing.

0:00 – 1:40, Intro

For rundown of my career, went to do PhD in Albany (08-15). During that time period I worked as a crime analyst at Troy, NY, as well as a research analyst for my advisor (Rob Worden) at the Finn Institute. My research focused on quant projects with police departments (predictive modeling and operations research). In 2019 went to the private sector, and now work as an end-to-end data scientist in the healthcare sector working with insurance claims.

You can check out my academic and my data science CV on my about page.

1:40 – 7:30, Outliers in Crime Trends

I discuss the workshop I did at the IACA conference in 2017 on temporal analysis in Excel.

Long story short, don’t use percent change, use other metrics and line graphs.

7:30 – 13:10, Patrol Beat Optimization

I have the paper and code available to replicate my work with Carrollton PD on patrol beat optimization with workload equality constraints.

For analysts looking to teach themselves linear programming, I suggest Hillier’s book. I also give examples on linear programming on this blog.

It is different than statistical analysis, but I believe has as much applicability to crime analysis as your more typical statistical analysis.

13:10 – 14:15, Million Dollar Hotspots

There are hotspots of crime that are so concentrated, the expected labor cost reduction in having officers assigned full time likely offsets the position. E.g. if you spend a million dollars in labor addressing crime at that location, and having a full time officer reduces crime by 20%, the return on investment for hotspots breaks even with paying the officers salary.

I call these Million dollar hotspots.

14:15 – 28:25, Prioritizing individuals in a group violence intervention

Here I discuss my work on social network algorithms to prioritize individuals to spread the message in a focussed deterrence intervention. This is opposite how many people view “spreading” in a network, I identify something good I want to spread, and seed the network in a way to optimize that spread:

I also have a primer on SNA, which discusses how crime analysts typically define nodes and edges using administrative data.

Listen to the interview as I discuss more general advice – in SNA it matters what you want to accomplish in the end as to how you would define the network. So I discuss how you may want to define edges via victimization to prevent retaliatory violence (I think that would make sense for violence interupptors to be proactive for example).

I also give an example of how detective case allocation may make sense to base on SNA – detectives have background with an individuals network (e.g. have a rapport with a family based on prior cases worked).

28:25 – 33:15, Be proactive as an analyst and learn to code

Here Manny asked the question of how do analysts prevent their role being turned into more administrative role (just get requests and run simple reports). I think the solution to this (not just in crime analysis, but also being an analyst in the private sector) is to be proactive. You shouldn’t wait for someone to ask you for specific information, you need to be defining your own role and conducting analysis on your own.

He also asked about crime analysis being under-used in policing. I think being stronger at computer coding opens up so many opportunities that learning python, R, SQL, is the area I would like to see stronger skills across the industry. And this is a good career investment as it translates to private sector roles.

33:15 – 37:00, How ChatGPT can be used by crime analysts

I discuss how ChatGPT may be used by crime analysis to summarize qualitative incident data and help inform . (Check out this example by Andreas Varotsis for an example.)

To be clear, I think this is possible, but the tech I don’t think is quite up to that standard yet. Also do not submit LEO sensitive data to OpenAI!

Also always feel free to reach out if you want to nerd out on similar crime analysis questions!

Leave a comment

by Andy Wheeler on July 2, 2023 • Permalink

Posted in ask me anything, Crime Analysis, Crime Mapping, Criminal Justice, data science, Data Visualization, Python, R, social networking

Tagged linear programming, Predictive-Policing, professional-development

Posted by Andy Wheeler on July 2, 2023

https://andrewpwheeler.com/2023/07/02/youtube-interview-with-manny-san-pedro-on-crime-analysis-and-data-science/

Setting conda environments in crontab

I prefer using conda environments to manage python (partly out of familiarity). Conda is a bit different though, in that it is often set up locally for a users environment, and not globally as an installed package. This makes using it in bash scripts (or on windows .bat files) somewhat tricky.

So first, in a Unix environment, you can choose where to install conda. Then it adds into your .bashrc profile a line that looks something like:

__conda_setup="$('/mnt/miniconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/lib/miniconda/etc/profile.d/conda.sh" ]; then
        . "/lib/miniconda/etc/profile.d/conda.sh"
    else
        export PATH="/lib/miniconda/bin:$PATH"
    fi
fi
unset __conda_setup

Where here I installed it in /lib. This looks complicated at first glance, but really all it is doing is sourcing the conda.sh script and pre-pending miniconda/bin to the path.

Now to be able to run python code on a regular basis in crontab, I typically have crontab run shell scripts, not python directly, so say that is a file run_code.sh:

#!/bin/sh

# Example shell script
time_start=`date +%Y_%m_%d_%H:%M`
echo "Begin script $time_start"

# Sourcing conda
source /lib/miniconda/etc/profile.d/conda.sh

# activating your particular environment
# may need to give full path, not just the name
conda activate your_env

# if you want to check environment
python --version

# you may need to change the directory at this point
echo "Current Directory is set to $PWD"
cd ...

# run your python script
log_file="main_log_$time_start.txt"
python main.py > $log_file 2>&1

I do not need to additionally add to the path in my experience, just sourcing that script is sufficient. Now edit your crontab (via crontab -e and using the VI editor) to look something like:

20 3 * * * bash /.../run_code.sh >> /.../cron_log.txt 2>&1

Where /.../ is shorthand for an explicit path to where the shell script and cron log lives.

This will run the shell script at 3:20 AM and append all of the stuff. In crontab if you just want conda available for all jobs, I believe you could do something like:

# global environment, can set keys, run scripts
some_key=abc
export some_key
source /lib/miniconda/etc/profile.d/conda.sh

20 3 * * * bash /.../run_code.sh >> /.../cron_log.txt 2>&1

But I have not tested this. If this works, you could technically run python scripts directly, but if you need to change environments you would still really need a shell script. It is good to know to be able to inject environment variables though in the crontab environment.

About the only other gotcha is file permissions. Sometimes in business applications you have service accounts running things, so a crontab as the service account. And you just need to make sure to chmod files so the service account has appropriate permissions. I tend to have more issues with log files by accident than I do conda environments though.

Note for people setting up scheduled jobs on windows, I have an example of setting a conda environment in a windows bat file.

Additional random pro-tip with conda environments while I am here – if you by default don’t want conda to set up new environments in your home directory (due to space or production processes), as well as download packages into a different cache location, you can do something like:

conda config --add pkgs_dirs /lib/py_packages
conda config --add envs_dirs /lib/conda_env

Have had issues in the past of having too much junk in home.

Leave a comment

by Andy Wheeler on June 28, 2023 • Permalink

Posted in data science, Python

Tagged process-automation, shell-script

Posted by Andy Wheeler on June 28, 2023

https://andrewpwheeler.com/2023/06/28/setting-conda-environments-in-crontab/

Make more money

So I enjoy Ramit Sethi’s Netflix series on money management – fundamentally it is about money coming in and money going out and the ability to balance a budget. On occasion I see other budget coaches focus on trivial expenses (the money going out) whereas for me (and I suspect the majority of folks reading this blog with higher degrees and technical backgrounds) you should almost always be focused on finding a higher paying job.

Lets go with a common example people use as unnecessary discretionary spending – getting a $10 drink at Starbucks every day. If you do this, over the course of a 365 day year, you will have spent $3650 additional dollars. If you read my blog about coding and statistics and that expense bothers you, you are probably not making as much money as you should be.

Ramit regularly talks about asking for raises – I am guessing most people reading this blog if you got a raise it would be well over that Starbucks expense. But part of the motivation to write this post is in reference to formerly being a professor. I think many criminal justice (CJ) professors are underemployed, and should consider better paying jobs. I am regularly starting to see public sector jobs in CJ that have substantially better pay than being a professor. This morning was shared a position for an entry level crime analyst at the Reno Police Department with pay range from $84,000 to $102,000:

The low end of that starting pay range is competitive with the majority of starting assistant professor salaries in CJ. You can go check out what the CJ professors at Reno make (which is pretty par for the course for CJ departments in the US) in comparison. If I had stayed as a CJ professor, even with moving from Dallas to other universities and trying to negotiate raises, I would be lucky to be making over $100k at this point in time. Again, that Reno position is an entry level crime analyst – asking for a BA + 2 years of experience or a Masters degree.

Private sector data science jobs in comparison, in DFW area in 2019 entry level were often starting at $105k salary (based on personal experience). You can check out BLS data to examine average salaries in data science if you want to look at your particular metro area (it is good to see the total number in that category in an area as well).

While academic CJ salaries can sometimes be very high (over $200k), these are quite rare. There are a few things going against professor jobs, and CJ ones in particular, that depress CJ professor wages overall. Social scientists in general make less than STEM fields, and CJ departments are almost entirely in state schools that tend to have wage compression. Getting an offer at Harvard or Duke is probably not in the cards if you have a CJ degree.

In addition to this, with the increase in the number of PhDs being granted, competition is stiff. There are many qualified PhDs, making it very difficult to negotiate your salary as an early career professor – the university could hire 5 people who are just as qualified in your stead who aren’t asking for that raise.

So even if you are lucky enough to have negotiating power to ask for a raise as a CJ professor (which most people don’t have), you often could make more money by getting a public sector CJ job anyway. If you have quant skills, you can definitely make more money in the private sector.

At this point, most people go back to the idea that being a professor is the ultimate job in terms of freedom. Yes, you can pursue whatever research line you want, but you still need to teach courses, supervise students, and occasionally do service to the university. These responsibilities all by themselves are a job (the entry level crime analyst at Reno will work less overall than the assistant professor who needs to hustle to make tenure).

To me the trade off in freedom is worth it because you get to work directly with individuals who actually care what you do – you lose freedom because you need to make things within the constraints of the real world that real people will use. To me being able to work directly on real problems and implement my work in real life is a positive, not a negative.

Final point to make in this blog, because of the stiff competition for professor positions, I often see people suggesting there are too many PhDs. I don’t think this is the case though, you can apply the skills you learned in getting your CJ PhD to those public and private sector jobs. I think CJ PhD programs just need small tweaks to better prepare students for those roles, in addition to just letting people know different types of positions are available.

It is pretty much at the point that alt-academic jobs are better careers than the majority of CJ academic professor positions. If you had the choice to be an assistant professor in CJ at University of Nevada Reno, or be a crime analyst at Reno PD, the crime analyst is the better choice.

4 Comments

by Andy Wheeler on June 22, 2023 • Permalink

Posted in Crime Analysis, Criminal Justice, data science, scholarly

Tagged career

Posted by Andy Wheeler on June 22, 2023

https://andrewpwheeler.com/2023/06/22/make-more-money/

Some adventures in cloud computing

Recently I have been trying to teach myself a bit of cloud architecture – it has not been going well. The zoo of micro-services available from AWS or Google is testing. Most recent experiment with Google, I had some trial money and spun up the cheapest Postgres database, created a trivial table, added a few rows, and then left it for a month. It racked up nearly $200 of bills in that time span. In addition the only way I could figure out how to interact with the DB was some hacky sqlalchemy python code from my local system (besides the cloud shell psql).

But I have been testing other services that are easier for me to see how I can use them for my business. This post will mostly be about supabase (note I am not paid for this!). Alt title for the post supabase is super easy. Supabase is a cloud postgres database, and out of the box it is set up to make hitting API endpoints very simple. Free tier database can hold 500mb (and get/post calls I believe are unlimited). Their beta pricing for smaller projects can up the postgres DB to 8 gigs (at $25 per month per project). This pricing makes me feel much safer than the cloud stuff – where I am constantly concerned I will accidentally leave something turned on and rack up 4 or 5 digits of expenses.

Unlike the google cloud database, I was able to figure supabase out in a day. So first after creating a project, I created a table to test out:

-- SQL Code
create table
  public.test_simple (
    id bigint generated by default as identity not null,
    created_at timestamp with time zone null default now(),
    vinfo bigint null,
    constraint test_simple_pkey primary key (id)
  ) tablespace pg_default;

I actually created this in the GUI editor. Once you create a table, it has documentation on how to call the API in the top right:

If you don’t speak curl (it also has javascript examples), you can convert curl to python:

# Python code
import requests

sup_row_key = '??yourpublickey??'

headers = {
    'apikey': sup_row_key,
    'Authorization': f'Bearer {sup_row_key}',
    'Range': '0-9',
}

# Where filter
response = requests.get('https://ytagtevlkzgftkgwhsfv.supabase.co/rest/v1/test_simple?id=eq.1&select=*', headers=headers)

# Getting all rows
response = requests.get('https://ytagtevlkzgftkgwhsfv.supabase.co/rest/v1/test_simple?select=*', headers=headers)

When creating a project, you by default get a public key with read access, and a private key that has write. But you can see the nature of the endpoint is quite simple, you just can’t copy paste the link due to needing to pass headers is all.

One example I was thinking about was more on-demand webscraping/geocoding. So as a way to be nice to different people you are scraping data from, you can call them once, and cache the results. Now back in Supabase, to do this I enabled the plv8 database extension to be able to define javascript functions. Here is the SQL I used to create a Postgres function:

-- SQL Code
create or replace function public.test_memoize(mid int)
returns setof public.test_simple as $
    
    // This is javascript
    var json_result = plv8.execute(
        'select * from public.test_simple WHERE id = $1',
        [mid]
    );
    if (json_result.length > 0) {
        return json_result;
    } else {
        // here just an example, you would use your own function
        var nv = mid + 2;
        var res_ins = plv8.execute(
          'INSERT INTO public.test_simple VALUES ($1,DEFAULT,$2)',
          [mid,nv]
        );
        // not really necessary to do a 2nd get call
        // could just pass the results back, ensures
        // result is formatted the same way though
        var js2 = plv8.execute(
        'select * from public.test_simple WHERE id = $1',
        [mid]);
        return js2;
     }

$ language plv8;

This is essentially memoizing a function, just using a database backend to cache the call. So it looks to see if you pass in a value if it exists, if not, do something with the result (here just add 2 to the input), insert the result into the DB, and then return the result.

Now to call this function from a web-endpoint, we need to post the values to the rpc endpoint:

# Python post to supabase function
json_data = {'mid': 20}

response = requests.post('https://ytagtevlkzgftkgwhsfv.supabase.co/rest/v1/rpc/test_memoize', headers=headers, json=json_data)

This type of memoization is good if you have expensive functions, but not all that varied of input (but can’t upfront make a batch lookup table).

Supabase also has the ability to do edge functions (server side typescript). That may be a better case for this, but very nice to be able to make a quick function and test it out.

Next up in the blog when I get a chance, I have also been experimenting with Oracle Cloud free tier. I haven’t been able to figure out the database stuff on their platform yet, but you can spin up a nice little persistent virtual machine (with 1 gig of ram). Very nice for tiny batch jobs, and next blog post will be setting up conda and showing how to do cron jobs. Batch scraping slow but smaller data jobs I am thinking is a good use case. (And having a persistent machine is nice, for the same reason having your car is nice even if you don’t use it all day every day.)

One thing I am still searching for, if I have more data intensive batch jobs – like I need to do more data intensive processing with more RAM (I often don’t need GPUs, but having more RAM is nice), what is my best cloud solution? So no Github actions (can be long running), but need more RAM than the cheap VPS. I am not even sure the correct comparable products in the big companies.

Let me know in the comments if you have any suggestions! Just knowing where to get started is sometimes very difficult.

7 Comments

by Andy Wheeler on June 19, 2023 • Permalink

Posted in data science, Python

Tagged cloud, postgres, SQL

Posted by Andy Wheeler on June 19, 2023

https://andrewpwheeler.com/2023/06/19/some-adventures-in-cloud-computing/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 392 other subscribers
aoristic big-data cartography census choropleth citeulike consulting cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning ESRI excel flow-data folium geocoding github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing linear programming LLM logistic-regression machine-learning MACRO mapping matplotlib meta network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly scraping seaborn shootings simulation small-multiples social-media social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping
Top Posts & Pages
Stack Exchange

Why can’t I send ChatGPT sensitive information?

The solution is local hosting, but local hosting is hard

Nerd Notes

References

References

An overview of synthetic control estimates

The fundamental problem with synth

Rates vs Counts

Covariates and Out of Sample Estimates

Wrapping Up

References

0:00 – 1:40, Intro

1:40 – 7:30, Outliers in Crime Trends

7:30 – 13:10, Patrol Beat Optimization

13:10 – 14:15, Million Dollar Hotspots

14:15 – 28:25, Prioritizing individuals in a group violence intervention

28:25 – 33:15, Be proactive as an analyst and learn to code

33:15 – 37:00, How ChatGPT can be used by crime analysts

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange