A bunch of random shout outs

Busy, busy, busy! Hopefully I will have some time in the near future to write up some more data science posts. But for now, here is a small python snippet to help you build interaction variables between two sets of numpy arrays/dataframes.

import numpy as np
def np_int(a,b):
    rows = a.shape[0]
    cols = a.shape[1]*b.shape[1]
    return np.einsum('ij,ik->ijk', a, b).reshape((rows,cols))

This works for pytorch as well (just replace np.einsum with torch.einsum). So coming up (eventually) I will illustrate encoding interaction between hidden layers in a deep learning model. But for now some quicker updates.

Shout out #1: Scott Jacques has continued to push the charge for open access to criminology journals. He has two recent posts about post-prints, and how our main journal (Criminology) has an excessive policy of not allowing authors to post post prints for over two years (whereas the majority of criminology journals allow you to post immediately).

Several aspects of open science are tricky – posting pre-prints/post-prints is not. If we can come together as a group this is an easy, no cost way to greatly improve the accessibility of our work to the greater public.

Shout out #2: The folks at Police Rewired have hosted a hackathon intended to Hack Hate. It is too late to participate, but they will be displaying the results this Sunday. I have not had the chance to participate in any code hackathons, I will need to make a concerted effort in the future to give at least one a shot. (It seems hard, how can you do any work in only a day or a week or two!? But the proof is in the pudding so to speak, I’ve have seen some pretty cool things come out of various hackathons in the past.)

Shout out #3: My workplace, HMS, is involved in a data sharing collaborative called the Digital Health DRC. They also have a hackathon coming up, but this is related to Telehealth use. The Digital Health DRC is pretty cool though, it is basically a way for HMS (and several other private sector entities) to share various datasets with researchers over the globe.

The scope of HMS’s data is somewhat outside the realm of my old stomping grounds of criminology (but not entirely, a big part of my job is identifying potentially fraudulent patterns in claims data). But for folks who have a research question that could be answered using health insurance claims data, this is a good resource to look into. (HMS has pretty good coverage of Medicare claims across the US.)

Finally, I experimented a few days on the site with hosting ads. I managed to serve up a few thousand and make 10 cents. So I will turn that off for now. I debated on putting the button for folks to donate a coffee, but even that is not necessary. (I can afford the few bucks for the domain, and I use dropbox to back up my files anyway, so hosting extra materials is not a big deal.) I rather folks just take my nerdy notes and make your own cool stuff (and share them with me!) I may need to figure out a better hosting solution for images though — google photos is continuing to give me troubles I see (so if you see an image is not coming through feel free to let me know in the comments or send me an email).

300 blog posts and public good criminology

This isn’t technically my 300th blog post, but the 300th page I’ve constructed on my blog (so e.g. it includes when I’ve made a page for a class). I’ve posted a spreadsheet of the titles and dates of the posts over time (and updating it I noticed I was at 300).

I typically get around 200~300 views per day. Most of these are probably bots, but unless say over 90% are bots this website gets way more views than the cumulative views of all my academic papers combined. Here is a screen shot of the stats wordpress gives to me. My downtick in 2019 I thought was going to spiral into very few views, but it is still holding on.

I kind of have three different types of blog posts. One are example code snippets/data analysis. Often these are things I have done multiple times, so I want to create a record for me to more easily search up later. For example making a hexbin map in ggplot, or a margins plot in Stata. I wrote a recent post because I was talking with a friend about crime weights, and I wanted an example of using regression in python and an error bar plot for my library. (Quite a few birds with that stone.)

Two are questions I repeatedly encounter by students. For example, I made a list of demographic variables I use in the census, and where to find or scrape crime generator variables. Consistently my most popular post is testing the equality of two regression coefficients.

The third are just more generic opinion pieces. For example my notes on (the now late) David Bayley’s writing on the police potential to reduce crime, or Jane Jacob’s take on neighborhoods, or that I don’t think latent trajectories are real things.

Some are multiple of these categories put together, particularly opinion pieces with example code snippets to illustrate the points I am making. Like a simulation of why I like to model individual delinquency items, or how to balance false positives in bail decisions.

On Public Good Criminology

None of these per se fit in the example framework of typical peer review output. So despite no peer review, I think things like deriving optimal treatment allocation with network spillovers, or that conformal predictions intervals for synthetic control estimates are much smaller than permutation tests are a substantive contribution to share!

So that brings me to the public good point. Most criminologists have a default of only valuing a closed peer review system. Despite my blog posts not being peer reviewed (ditto for the pre-prints I post at first), I hope folks can take the time to judge for themselves whether they are valuable or not. We would be much better off as a group if we did things like share code, share class preps, or failed projects by default.

Some of these posts I might write up if we had a short journal for our field akin to Economics Letters, but even that is a lot of work for very little value added to be frank. (If I had infinite time I also might turn my notes on Poisson/Negative Binomial regression into a little Sage green book.) Being a private sector data scientist now without the tenure boot on my neck, I don’t really have any need or desire to go through that process.

If all you value are getting the opinions of a handful of other academics than by all means keep your work close to the chest and only publish in peer reviewed journals. If you want to provide a public good though, your work actually needs to be public.

Why I publish preprints

I encourage peers to publish preprint articles — journal articles before they go through the whole peer review process and are published. It isn’t normative in our field, and I’ve gotten some pushback from colleagues, so figured I would put on paper why I think it is a good idea. In short, the benefits (increased exposure) outweigh the minimal costs of doing so.

The good — getting your work out there

The main benefit of posting preprints is to get your work more exposure. This occurs in two ways: one is that traditional peer-review work is often behind paywalls. This prevents the majority of non-academics from accessing your work. This point about paywalls applies just the same to preventing other academics from reading your work in some cases. So while the prior blog post I linked by Laura Huey notes that you can get access to some journals through your local library, it takes several steps. Adding in steps you are basically losing out on some folks who don’t want to spend the time. Even through my university it is not uncommon for me to not be able to access a journal article. I can technically take the step of getting the article through inter-library loan, but that takes more time. Time I am not going to spend unless I really want to see the contents of the article.

This I consider a minor benefit. Ultimately if you want your academic work to be more influential in the field you need to write about your work in non-academic outlets (like magazines and newspapers) and present it directly to CJ practitioner audiences. But there are a few CJ folks who read journal articles you are missing, as well as a few academics who are missing your work because of that paywall.

A bigger benefit is actually that you get your work out much quicker. The academic publishing cycle makes it impossible to publish your work in a timely fashion. If you are lucky, once your paper is finished, it will be published in six months. More realistically it will be a year before it is published online in our field (my linked article only considers when it is accepted, tack on another month or two to go through copy-editing).

Honestly, I publish preprints because I get really frustrated with waiting on peer review. No offense to my peers, but I do good work that I want others to read — I do not need a stamp from three anonymous reviewers to validate my work. I would need to do an experiment to know for sure (having a preprint might displace some views/downloads from the published version) but I believe the earlier and open versions on average doubles the amount of exposure my papers would have had compared to just publishing in traditional journals. It is likely a much different audience than traditional academic crim people, but that is a good thing.

But even without that extra exposure I would still post preprints, because it makes me happy to self-publish my work when it is at the finish line, in what can be a miserably long and very much delayed gratification process otherwise.

The potential downsides

Besides the actual time cost of posting a preprint (next section I will detail that more precisely, it isn’t much work), I will go through several common arguments why posting preprints are a bad idea. I don’t believe they carry much weight, and have not personally experienced any of them.

What if I am wrong — Typically I only post papers either when I am doing a talk, or when it is ready to go out for peer review. So I don’t encourage posting really early versions of work. While even at this stage there is never any guarantee you did not make a big mistake (I make mistakes all the time!), the sky will not fall down if you post a preprint that is wrong. Just take it down if you feel it is a net negative to the scholarly literature (which is very hard to do — the results of hypothesis tests do not make the work a net positive/negative). If you think it is good enough to send out for peer review it is definitely at the stage where you can share the preprint.

What if the content changes after peer review — My experience with peer review is mostly pedantic stuff — lit. review/framing complaints, do some robustness checks for analysis, beef up the discussion. I have never had a substantive interpretation change after peer-review. Even if you did, you can just update the preprint with the new results. While this could be bad (an early finding gets picked up that is later invalidated) this is again something very rare and a risk I am willing to take.

Note peer review is not infallible, and so hedging that peer review will catch your mistakes is mostly a false expectation. Peer review does not spin your work into gold, you have to do that yourself.

My ideas may get scooped — This I have never personally had happen to me. Posting a preprint can actually prevent this in terms of more direct plagiarism, as you have a time-stamped example of your work. In terms of someone taking your idea and rewriting it, this is a potential risk (same risk if you present at a conference) — really only applicable for folks working on secondary data analysis. Having the preprint the other person should at least cite your work, but sorry, either presenting some work or posting a preprint does not give you sole ownership of an idea.

Journals will view preprints negatively — Or journals do not allow preprints. I haven’t come across a journal in our field that forbids preprints. I’ve had one reviewer note (out of likely 100+ at this point) that the pre-print was posted as a negative (suggesting I was double publishing or plagiarizing my own work). An editor that actually reads reviews should know that is not a substantive critique. That was likely just a dinosaur reviewer that wasn’t familiar with the idea of preprints (and they gave an overall positive review in that one case, so did not get the paper axed). If you are concerned about this, just email the editor for feedback, but I’ve never had a problem from editors.

Peer reviewers will know who I am — This I admit is a known unknown. So peer review in our crim/cj journals are mostly doubly blind (most geography and statistic journals I have reviewed for are not, I know who the authors are). If you presented the work at a conference you have already given up anonymity, and also the field is small enough a good chunk of work the reviewers can guess who the author is anyway. So your anonymity is often a moot point at the peer review stage anyway.

So I don’t know how much reviewers are biased if they know who you are (it can work both ways, if you get a friend they may be more apt to give a nicer review). It likely can make a small difference at the margins, but again I personally don’t think the minor risk/cost outweighs the benefits.

These negatives are no doubt real, but again I personally find them minor enough risks to not outweigh the benefits of posting preprints.

The not hard work of actually posting preprints

All posting a preprint involves is uploading a PDF file of your work to either your website or a public hosting service. My workflow currently I have my different components of a journal article in several word documents (I don’t use LaTex very often). (Word doesn’t work so well when it has one big file, especially with many pictures.) So then I export those components to PDF files, and stitch them together using a freeware tool PDFtk. It has a GUI and command line, so I just have a bat file in my paper directory that lists something like:

pdftk.exe TitlePage.pdf MainPaper.pdf TablesGraphs.pdf Appendix.pdf cat output CombinedPaper.pdf

So just a double click to update the combined pdf when I edit the different components.

Public hosting services to post preprints I have used in the past are Academia.edu, SSRN, and SoxArXiv, although again you could just post the PDF on your webpage (and Google Scholar will eventually pick it up). I use SocArXiv now, as SSRN currently makes you sign up for an account to download PDFs (again a hurdle, the same as a going through inter-library loan). Academia.edu also makes you sign up for an account, and has weird terms of service.

Here is an example paper of mine on SocArXiv. (Note the total downloads, most of my published journal articles have fewer than half that many downloads.) SocArXiv also does not bother my co-authors to create an account when I upload a paper. If we had a more criminal justice focused depository I would use that, but SocArXiv is fine.

There are other components of open science I should write about — such as replication materials/sharing data, and open peer reviewed journals, but I will leave those to another blog post. Posting preprints takes very little extra work compared to what academics are currently doing, so I hope more people in our field start doing it.