New paper: An Open Source Replication of a Winning Recidivism Prediction Model

Our paper on the NIJ forecasting competition (Gio Circo is the first author), is now out online first in the International Journal of Offender Therapy and Comparative Criminology (Circo & Wheeler, 2022). (Eventually it will be in special issue on replications and open science organized by Chad Posick, Michael Rocque, and Eric Connolly.)

We ended up doing the same type of biasing as did Mohler and Porter (2022) to ensure fairness constraints. Essentially we biased results to say no one was high risk, and this resulted in “fair” predictions. With fairness constraints or penalities you sometimes have to be careful what you wish for. And because not enough students signed up, me and Gio had more winnings distributed to the fairness competition (although we did quite well in round 2 competition even with the biasing).

So while that paper is locked down, we have the NIJ tech paper on CrimRXiv, and our ugly code on github. But you can always email for a copy of the actual published paper as well.

Of course since not an academic anymore, I am not uber focused on potential future work. I would like to learn more about survival type machine learning forecasts and apply it to recidivism data (instead of doing discrete 1,2,3 year predictions). But my experience is the machine learning models need very large datasets, even the 20k rows here are on the fringe where regression are close to equivalent to non-linear and tree based models.

Another potential application is simple models. Cynthia Rudin has quite a bit of recent work on interpretable trees for this (e.g. Liu et al. 2022), and my linked post has examples for simple regression weights. I suspect the simple regression weights will work reasonably well for this data. Likely not well enough to place on the scoreboard of the competition, but well enough in practice they would be totally reasonable to swap out due to the simpler results (Wheeler et al., 2019).

But for this paper, the main takeaway me and Gio want to tell folks is to create a (good) model using open source data is totally within the capabilities of PhD criminal justice researchers and data scientists working for these state agencies.They are quantitaive skills I wish more students within our field would pursue, as it makes it easier for me to hire you as a data scientist!

References

Text analysis, alt competition sites, and ASC

A bit of a potpourri blog post today. First, I am not much of a natural language processing wiz. But based on the work of Peter Baumgartner at RTI (assigning reduced form codes based on text descriptions), I was pointed out the simpletransformers library. It is very easy to download complicated NLP architectures (like RoBERTa with 100 million+ parameters) and retrain them to your idiosyncratic data.

Much of the issue working with text data is the cleaning, and with these extensive architectures they are not so necessary. See for example this blog post on classifying different toxic comments. Out of the box the multi-label classification gets an AUC score pretty damn close to the winning entry in the Kaggle contest this data was developed for. No text munging necessary.

Playing around on my personal machine I have been able to download and re-tune the pretrained RoBERTa model – doing that same model as the blog post (with just all the defaults for the model), it takes around 7 hours of my GPU.

The simpletransformers library has a ton of different pre-set architectures for different problems. But the ones I have played around with with labelled data (e.g. you have text data on the right hand side, and want to predict a binary or multinomial outcome), I have had decent success with.

Another text library I have played around with (although have not had as much success in production) is dirty_cat. This is for unsupervised modeling, which unfortunately is a harder task to evaluate what is successful than supervised learning.

Alt Competition Sites

I recently spent two days trying to work on a recent Kaggle competition, a follow up to the toxic comments one above. My solution is nowhere close to the current leaderboard though, and given the prize total (and I expect something like 5,000 participants), this just isn’t worth my time to work on it more.

Two recent government competitions I did compete in though, the NIJ recidivism, and the NICHD maternal morbidity. (I will release my code for the maternal morbidity when the competition is fully scored, it is a fuzzy one not a predictive best accuracy one.) Each of these competitions had under 50 teams participate, so it is much less competition than Kaggle. The CDC has a new one as well, for using a network based approach to violence and drug problems.

For some reason these competitions are not on the Challenge.gov website. Another site I wanted to share as well is DataDriven competitions. If I had found that sooner I might have given the floodwater competition a shot.

I have mixed feeling about the competitions, and they are risky. I probably spent for NIJ and NICHD what I would consider something like $10,000 to $20,000 of my personal time on the code solutions (for each individually). I knew NIJ would not have many submissions (I did not participate in the geographic forecasting, and saw some people win with silly strategies). If you submitted anything in the student category you would have won close to the same amount as my team did (as not all the slots were filled up). And the NICHD was quite onerous to do all the paperwork, so I figured would also be low turnout (and the prizes are quite good). So whether I think it is worth it for me to give a shot is guessing the total competition pool, level of effort to submit a good submission, and how the prizes are divvied up as well as the total dollar amount.

The CDC violence one is strangely low prizes, so I wouldn’t bother to submit unless I already had some project I was working on anyway. I think a better use of the Fed challenges would be to have easier pilot work, and based on the pilot work fund larger projects. So consider the initial challenge sort of equivalent to a grant proposal. This especially makes sense for generating fairness algorithms (not so much for who has the best hypertuned XGBoost model on a particular train/test dataset).

Missing ASC

The American Society of Criminology conference is going on now in Chicago. A colleague emailed the other day asking if I was coming, and I do feel some missing of meeting up with friends. The majority of presentations are quite bad (both for content and presentation style), so it is more of an excuse to have a beer with friends than anything.

I debated with my wife about taking a family vacation to Chicago during this conference earlier in the year. We decided against it for the looming covid – I correctly predicted it would still be quite prevalent (and I am guessing it will be indefinitely at this point given vaccine hesitancy and new variants). I incorrectly predicted though I wouldn’t be able to get a vaccine shot until October (so very impressed with the distribution on that front). Even my son has a shot (didn’t even try to guess when that would happen). So I am not sure if I made the correct choice in retrospect – the risk of contraction is as high as I guessed, but risk of adverse effects given we have the vaccines are very low.