Text analysis, alt competition sites, and ASC

A bit of a potpourri blog post today. First, I am not much of a natural language processing wiz. But based on the work of Peter Baumgartner at RTI (assigning reduced form codes based on text descriptions), I was pointed out the simpletransformers library. It is very easy to download complicated NLP architectures (like RoBERTa with 100 million+ parameters) and retrain them to your idiosyncratic data.

Much of the issue working with text data is the cleaning, and with these extensive architectures they are not so necessary. See for example this blog post on classifying different toxic comments. Out of the box the multi-label classification gets an AUC score pretty damn close to the winning entry in the Kaggle contest this data was developed for. No text munging necessary.

Playing around on my personal machine I have been able to download and re-tune the pretrained RoBERTa model – doing that same model as the blog post (with just all the defaults for the model), it takes around 7 hours of my GPU.

The simpletransformers library has a ton of different pre-set architectures for different problems. But the ones I have played around with with labelled data (e.g. you have text data on the right hand side, and want to predict a binary or multinomial outcome), I have had decent success with.

Another text library I have played around with (although have not had as much success in production) is dirty_cat. This is for unsupervised modeling, which unfortunately is a harder task to evaluate what is successful than supervised learning.

Alt Competition Sites

I recently spent two days trying to work on a recent Kaggle competition, a follow up to the toxic comments one above. My solution is nowhere close to the current leaderboard though, and given the prize total (and I expect something like 5,000 participants), this just isn’t worth my time to work on it more.

Two recent government competitions I did compete in though, the NIJ recidivism, and the NICHD maternal morbidity. (I will release my code for the maternal morbidity when the competition is fully scored, it is a fuzzy one not a predictive best accuracy one.) Each of these competitions had under 50 teams participate, so it is much less competition than Kaggle. The CDC has a new one as well, for using a network based approach to violence and drug problems.

For some reason these competitions are not on the Challenge.gov website. Another site I wanted to share as well is DataDriven competitions. If I had found that sooner I might have given the floodwater competition a shot.

I have mixed feeling about the competitions, and they are risky. I probably spent for NIJ and NICHD what I would consider something like $10,000 to $20,000 of my personal time on the code solutions (for each individually). I knew NIJ would not have many submissions (I did not participate in the geographic forecasting, and saw some people win with silly strategies). If you submitted anything in the student category you would have won close to the same amount as my team did (as not all the slots were filled up). And the NICHD was quite onerous to do all the paperwork, so I figured would also be low turnout (and the prizes are quite good). So whether I think it is worth it for me to give a shot is guessing the total competition pool, level of effort to submit a good submission, and how the prizes are divvied up as well as the total dollar amount.

The CDC violence one is strangely low prizes, so I wouldn’t bother to submit unless I already had some project I was working on anyway. I think a better use of the Fed challenges would be to have easier pilot work, and based on the pilot work fund larger projects. So consider the initial challenge sort of equivalent to a grant proposal. This especially makes sense for generating fairness algorithms (not so much for who has the best hypertuned XGBoost model on a particular train/test dataset).

Missing ASC

The American Society of Criminology conference is going on now in Chicago. A colleague emailed the other day asking if I was coming, and I do feel some missing of meeting up with friends. The majority of presentations are quite bad (both for content and presentation style), so it is more of an excuse to have a beer with friends than anything.

I debated with my wife about taking a family vacation to Chicago during this conference earlier in the year. We decided against it for the looming covid – I correctly predicted it would still be quite prevalent (and I am guessing it will be indefinitely at this point given vaccine hesitancy and new variants). I incorrectly predicted though I wouldn’t be able to get a vaccine shot until October (so very impressed with the distribution on that front). Even my son has a shot (didn’t even try to guess when that would happen). So I am not sure if I made the correct choice in retrospect – the risk of contraction is as high as I guessed, but risk of adverse effects given we have the vaccines are very low.

CrimCon Roundtable: Flipping a Criminal Justice PhD to an alt-academic Data Science Career

This Thursday 11/19/2020 at 1 PM Eastern, I will be participating in a roundtable for the online CrimCon event. This is free for everyone to zoom in, and here is the link to the program, I am on Stream 3!

The title is above — I have been a private sector data scientist at HMS for not quite a year now. I wanted to organize a panel to help upcoming PhD’s in criminal justice get some more exposure to potential data science positions, outside the traditional tenure track. Here is the abstract:

Tenure-track positions in academia are becoming more challenging to obtain, and only a small portion of junior faculty continue in academia to the rank of full professor. Therefore, students may opt to explore alternate options to obtain employment after their PhD is finished. These alternatives to the tenure track are often called “alt-academic” jobs. This roundtable will be focused on discussing various opportunities that exist for PhD’s in criminal justice and behavioral sciences spanning the public sector, the private sector, and non-profits/think tanks. Panelists will also discuss gaps in the typical PhD curriculum, with the goal of aiding current students to identify steps they can take to make themselves more competitive for alt-academic roles.

And here are each of the panelists bios:

Dr. Andrew Wheeler is currently a Data Scientist at HMS working on problems related to predictive modeling and optimization in relation to health insurance claims. Before joining HMS, he received a PhD degree in Criminal Justice from SUNY Albany. While in academia his research focused on collaborating with police departments for various problems including; evaluating crime reduction initiatives, place based and person based predictive modeling, data analytics for crime analysis, and developing models for the efficient and fair delivery of police resources.

Dr. Jennifer Gonzalez is the Senior Director of Population Health at the Meadows Mental Health Policy Institute, where she manages the Institute’s research and data portfolio. She earned her doctoral degree in epidemiology and a M.S. degree in criminal justice. Before joining MMHPI, Dr. Gonzalez was a tenured associate professor at the University of Texas School of Public Health, where she maintained a portfolio of more than $10 million in research funding and published more than one hundred interdisciplinary articles focused on the health of those who come into contact with—and work within—the criminal justice system.

Dr. Kyleigh Clark-Moorman is a Senior Research Associate for the Public Safety Performance Project at The Pew Charitable Trusts, a non-profit public policy organization. Kyleigh began working at Pew in 2019 and completed her PhD in Criminology and Criminal Justice at the University of Massachusetts, Lowell in May 2020. As an early career researcher, Dr. Clark-Moorman’s work has been published in Criminal Justice and Behavior, Criminal Justice Studies, and the Journal of Criminal Justice. In her role at Pew, Kyleigh is responsible for research design and data analysis focused on various criminal justice topics while also working with external partners to produce high-impact reports and analyses to raise awareness and drive public policy.

Matt Vogel is Associate Professor in the School of Criminal Justice at the University at Albany, SUNY and the Director of the Laboratory for Decision Making in Criminology and Criminal Justice. Matt regularly assists local agencies with data and evaluation needs. Some of his ongoing collaborations include assessments of racial representation on capital juries in Missouri, a longitudinal evaluation of a school-based violence reduction program, and the implementation of a police-hospital collaboration to help address retaliatory violence in St. Louis. Prior to joining the faculty at UAlbany, Matt worked in the Department of Criminology and Criminal Justice at the University of Missouri – St. Louis and held a long-term visiting appointment with the Faculty of Architecture at TU Delft (the Netherlands).

If you have any upfront questions you would like addressed by the panel, always feel free to send me a pre-emptive email (or comment below).


Update: The final roundtable is now posted on Youtube. See below for the panels thoughts on pursuing non-tenure track jobs with your social science Phd.

A bunch of random shout outs

Busy, busy, busy! Hopefully I will have some time in the near future to write up some more data science posts. But for now, here is a small python snippet to help you build interaction variables between two sets of numpy arrays/dataframes.

import numpy as np
def np_int(a,b):
    rows = a.shape[0]
    cols = a.shape[1]*b.shape[1]
    return np.einsum('ij,ik->ijk', a, b).reshape((rows,cols))

This works for pytorch as well (just replace np.einsum with torch.einsum). So coming up (eventually) I will illustrate encoding interaction between hidden layers in a deep learning model. But for now some quicker updates.

Shout out #1: Scott Jacques has continued to push the charge for open access to criminology journals. He has two recent posts about post-prints, and how our main journal (Criminology) has an excessive policy of not allowing authors to post post prints for over two years (whereas the majority of criminology journals allow you to post immediately).

Several aspects of open science are tricky – posting pre-prints/post-prints is not. If we can come together as a group this is an easy, no cost way to greatly improve the accessibility of our work to the greater public.

Shout out #2: The folks at Police Rewired have hosted a hackathon intended to Hack Hate. It is too late to participate, but they will be displaying the results this Sunday. I have not had the chance to participate in any code hackathons, I will need to make a concerted effort in the future to give at least one a shot. (It seems hard, how can you do any work in only a day or a week or two!? But the proof is in the pudding so to speak, I’ve have seen some pretty cool things come out of various hackathons in the past.)

Shout out #3: My workplace, HMS, is involved in a data sharing collaborative called the Digital Health DRC. They also have a hackathon coming up, but this is related to Telehealth use. The Digital Health DRC is pretty cool though, it is basically a way for HMS (and several other private sector entities) to share various datasets with researchers over the globe.

The scope of HMS’s data is somewhat outside the realm of my old stomping grounds of criminology (but not entirely, a big part of my job is identifying potentially fraudulent patterns in claims data). But for folks who have a research question that could be answered using health insurance claims data, this is a good resource to look into. (HMS has pretty good coverage of Medicare claims across the US.)

Finally, I experimented a few days on the site with hosting ads. I managed to serve up a few thousand and make 10 cents. So I will turn that off for now. I debated on putting the button for folks to donate a coffee, but even that is not necessary. (I can afford the few bucks for the domain, and I use dropbox to back up my files anyway, so hosting extra materials is not a big deal.) I rather folks just take my nerdy notes and make your own cool stuff (and share them with me!) I may need to figure out a better hosting solution for images though — google photos is continuing to give me troubles I see (so if you see an image is not coming through feel free to let me know in the comments or send me an email).

Reasons Police Departments Should Consider Collaborating with Me

Much of my academic work involves collaborating and consulting with police departments on quantitative problems. Most of the work I’ve done so far is very ad-hoc, through either the network of other academics asking for help on some project or police departments cold contacting me directly.

In an effort to advertise a bit more clearly, I wrote a page that describes examples of prior work I have done in collaboration with police departments. That discusses what I have previously done, but doesn’t describe why a police department would bother to collaborate with me or hire me as a consultant. In fact, it probably makes more sense to contact me for things no one has previously done before (including myself).

So here is a more general way to think about (from a police departments or criminal justice agencies perspective) whether it would be beneficial to reach out to me.

Should I do X?

So no one is going to be against different evidence based policing practices, but not all strategies make sense for all jurisdictions. For example, while focussed deterrence has been successfully applied in many different cities, if you do not have much of a gang violence problem it probably does not make sense to apply that strategy in your jurisdiction. Implementing any particular strategy should take into consideration the cost as well as the potential benefits of the program.

Should I do X may involve more open ended questions. I’ve previously conducted in person training for crime analysts that goes over various evidence based practices. It also may involve something more specific, such as should I redistrict my police beats? Or I have a theft-from-vehicle problem, what strategies should I implement to reduce them?

I can suggest strategies to implement, or conduct cost-benefit analysis as to whether a specific program is worth it for your jurisdiction.

I want to do X, how do I do it?

This is actually the best scenario for me. It is much easier to design a program up front that allows a police department to evaluate its efficacy (such as designing a randomized trial and collecting key measures). I also enjoy tackling some of the nitty-gritty problems of implementing particular strategies more efficiently or developing predictive instruments.

So you want to do hotspots policing? What strategies do you want to do at the hotspots? How many hotspots do you want to target? Those are examples of where it would make sense to collaborate with me. Pretty much all police departments should be doing some type of hot spots policing strategy, but depending on your particular problems (and budget constraints), it will change how you do your hot spots. No budget doesn’t mean you can’t do anything — many strategies can be implemented by shifting your current resources around in particular ways, as opposed to paying for a special unit.

If you are a police department at this stage I can often help identify potential grant funding sources, such as the Smart Policing grants, that can be used to pay for particular elements of the strategy (that have a research component).

I’ve done X, should I continue to do it?

Have you done something innovative and want to see if it was effective? Or are you putting a bunch of money into some strategy and are skeptical it works? It is always preferable to design a study up front, but often you can conduct pretty effective post-hoc analysis using quasi-experimental methods to see if some crime reduction strategy works.

If I don’t think you can do a fair evaluation I will say so. For example I don’t think you can do a fair evaluation of chronic offender strategies that use officer intel with matching methods. In that case I would suggest how you can do an experiment going forward to evaluate the efficacy of the program.

Mutual Benefits of Academic-Practitioner Collaboration

Often I collaborate with police departments pro bono — which you may ask what is in it for me then? As an academic I get evaluated mostly by my research productivity, which involves writing peer reviewed papers and getting research grants. So money is not the main factor from my perspective. It is typically easier to write papers about innovative problems or programs. If it involves applying for a grant (on a project I am interested in) I will volunteer my services to help write the grant and design the study.

I could go through my career writing papers without collaborating with police departments. But my work with police departments is more meaningful. It is not zero-sum, I tend to get better ideas when understanding specific agencies problems.

So get in touch if you think I can help your agency!

The week at Stackexchange 5/21/2013 Edition

Posts I’ve found interesting during the week (or likely over longer periods!) at various forums I participate at.

CrossValidated

GIS

Academia

Others

SPSS Nabble Group

Hopefully I get more time to blog in the near future, but currently busy, busy, busy! Working on visualizing JTC flow data (presenting at ASC this fall), getting everyone to approve my prospectus, and I have a few more SPSS blog posts I have in mind (restricted cubic splines, visually weighted regression, and using Ripley’s K to analyze temporal crime sprees!)

The week at Stackexchange 4/28/13 Edition

I know several individuals have blog posts in which they list interesting other articles that happened during the week. Mine will be a bit of a different twist though. I participate in a few of the stack exchange sites (and the SPSS Nabble forum), and often I think a question is interesting, but don’t follow along closely enough to see an answer given. Another situation that happens is I give an answer, and I don’t see other answers to the same post. To help me, and bring greater attention to various posts I find interesting, I figured I would create a weekly listing of those particular questions (no guarantee the question had anything to do with the previous week – I’ll try not to be redundant though!)

CrossValidated

GIS

Academia

Others

I’ve just noted this because I’ve seen a ton of nice ggplot2 examples from the Didzis Elferts recently on stackoverflow.

SPSS Nabble Group

My posts on CrossValidated Blog

I’ve made several guest posts on the (now) currently dormat Cross Validated Community Blog. They are;

The notes on making tables is IMO my most useful collection, with the post on small multiples coming in second. Other contributions currently include;

For those not familiar, Cross Validated is a stackexchange website where one can ask and answer various questions related to statistics. They are a large improvement over list-serve emails, and I hope to promote their usefulness and encourage individuals to either contribute to current forums and/or start new ones for particular areas. I also particpate on the GIS and Academia sites (as well as programming questions for SPSS and R on stackoverflow).

The blog is just an extension of the site, in that Q/A sessions are not well suited for long discussions. So instead of fitting a square peg in a round hole at times, I believe the blog is a useful place for discussions and greater commentary useful for the communities that aren’t quite well placed in Q/A. Unfortunately, community up-take in the blog has been rather minor, so it is currently dormant. Feel free to stop by the Cross Validated chat room Ten fold if you are interested in contributing. I hope to see the blog not die, but IMO there isn’t much point in any of the current people to continue to contribute unless there becomes greater community contribution from other individuals.

My experience blogging in 2012

I figured I would write a brief post about my experience blogging. I created this blog and published my first post in December of 2011. Since then, in 2012, I published 30 blog posts, and totaled 7,200 views. While I thought the number was quite high (albeit a bit dissapointing compared to the numbers of Larry Wasserman), it is still many more people than would have listened to what I had to say if I didn’t write a blog. When starting out I averaged under 10 views a day, but throughout the year it steadily grew, and now I average about 30 views per day. The post that had the most traffic in one day was When should we use a black background for a map?, and that was largely because of some twitter traffic (a result of Steven Romalewski tweeting it and then it being re-tweeted by Kenneth Field), and it had 73 views.

I started the blog because I really loved reading alot of others blogs, and so I hope to encourage others to do so as well. It is a nice venue to share work and opinions for an academic, as it is more flexible and can be less formal than articles. Also much of what I write about I would just consider helpful tips or generic discussion that I wouldn’t get to discuss otherwise (SPSS programming and graph tips will never make it into a publication). One of my main motivations was actually R-Bloggers and the SAS blog roll; I would like a similarly active community for SPSS, but there is none really that I have found outside of the NABBLE forum (some exceptions are Andy Field, The Analysis Factor, Jon Peck and these few posts by a Louis K I only found through the labyrinth that is the IBM developerworks site (note I think you need to be signed in to even see that site), but they certainly aren’t very active and/or don’t write much about SPSS). I assume the best way to remedy that is to lead by example! Most of my more popular posts are ones about SPSS, and I frequently get web-traffic via general google searches of SPSS + something else I blogged about (hacking the template and comparing continuous distributions are my two top posts).

Also the blog is also just another place to highlight my academic work and bring more attention to it. WordPress tells me how often someone clicks a link on the blog, and someone has clicked the link to my CV close to 40 times since I’ve made the blog. Hopefully I have some pre-print journal articles to share on the blog in the near future (as well as my prospectus). My post on my presentation at ASC did not generate much traffic, but I would love to see a similar trend for other criminologists/criminal justicians in the future. My work isn’t perfect for sure, but why not get it out there at least for it to be judged and hopefully get feedback.

I would like to blog more, and I actively try to write something if I haven’t in a few weeks, but I don’t stress about it too much. I certainly have an infinite pool of posts to write about programming and generating graphs in SPSS. I have also thought about talking about historical graphics in criminology and criminal justice, or generally talking about some historical and contemporary crime mapping work. Other potential posts I’d like to write about are a more formal treatment about why I loathe most difference-in-differences designs, and perhaps about the sillyness that can ensue when using null-hypothesis significance testing to determine racial bias. But they will both take more careful elaboration on, so might not be anytime soon.

So in short, SPSSer’s, crime mapper’s, criminologist’s/criminal justician’s, I want you to start blogging, and I will eagerly consume your work (and in the meantime hopefully produce some more useful stuff on my end)!

CJ blog watch! Any ones I’m missing?

I follow alot of blogs. Although I don’t personally write alot about criminology or criminal justice related matters (maybe in the future when I have more time or inclination), but I figured I would share some of my favorites and query the crowd for more recommendations.

So a few with general discussion related to criminology and criminal justice matters are;

Both sites are well known criminologists/criminal justicians. I am aware of a few blogs written by current/former police chiefs;

  • Tom Casady’s The Director’s Desk. Tom Casady is currently the director of public safety for Lincoln, Nebraska and was previously the Police Chief at Lincoln’s department for quite some time. Tom is also very active in a variety of criminology/criminal justice organizations (so if you go to a related conference there is a good chance he is around somewhere!)
  • Chief’s Blog by Chief Ramsay of the Duluth Police Dept in Minnesota.

There are also a few that are highly focused on crime mapping & analysis;

  • Location Based Policing by Drew Dasher. He is a crime analyst for the Lincoln Nebraska PD.
  • Saferview – crime, fear and mapping: A blog by a retired police officer who is a student at University College London.
  • Diego Valle-Jones: Although his blog has a wider variety of topics, he has a series of very detailed posts and analysis on violence in Mexico and central american nations. I know crime stats are frequent fodder for generic statistical demonstrations, but this is real insightful analysis. My favorite is his investigation into the validity of homicide data statistics.

Are there others I am missing out on or should know about? Let me know in the comments if you have other suggestions.

FYI – the title of the blog post was motivated by Hans Toch’s new book, Cop Watch.

Connect with me!

In taking the advice of this question on the Academia stack exchange site, Is web-presence important for researchers?, I’ve created this blog and joined several other social networking sites. If you are interested in my work, or think I would be interested in yours, feel free to connect with me in any of these following venues.

Are there any other sites I should be joining? Let me know in the comments.