Reproducible research and code review for journals

Recently came across two different groups broaching the subject of code reviews and reproducible research more broadly for criminal justice. There are certainly aspects of either that make it difficult in the context of peer review. But I am not one to let the perfect be the enemy of the good, so I will layout the difficulties and give some comments on potential good enough solutions that still make marked improvements on the current state of affairs in crim/cj research.

Reproducible Research

So what do I mean by reproducible research? Jeromy Anglim on crossvalidated has a good breakdown on different ways we may apply the term. So to some it may mean if you did a hot spots policing experiment, can I replicate the same crime reduction results in another city.

These are important to publish (simply because social science experiments will inevitably have quite a bit of variance), but this is often not what we are talking about when we talk about replication. We are often talking about a much smaller in scope goal – if I give you the exact same data, can you reproduce the tables/figures in the manuscript you used to make your inferences?

One problem that is often the case with CJ research is that we are working with sensitive data. If I do analysis on a survey of a sensitive topic, I often cannot share the data. But, I do not believe that should entirely put a spike in the question of reproducible data. I have broken down different levels that are possible in making research more reproducible:

  1. A Sharing data and code files to reproduce the paper results
  2. B Sharing code files and simulated data that illustrate the results
  3. C Sharing the plain-text log files showing the code and results of tables/figures

So I have not seen C proposed anywhere, but it is a dead simple solution that almost everyone should be able to accommodate. It simply involves typing log using "output.txt", text at the top of your Stata file, or OUTPUT EXPORT /PDF DOCUMENTFILE="output.pdf" at the end of your SPSS analysis (or could be done via the GUI), etc. These are the log/output files used to generate the results you report in the paper, and typically contain both the commands run, as well as the resulting tables. These files can quite easily not contain privileged information (in fact they won’t be default most of the time, unless you printed out individual names in a table for example in intermediate results).

To accomplish C does take some modicum of wherewithal in terms of writing code, but it is a pretty low bar. So I see no reason why all quantitative analyses cannot require at least this step right now. I realize it is not foolproof – a bad actor could go and edit the results (same as they could edit the results without this information). But it ups the level of effort to manipulate results by quite a bit, and more importantly has the potential to catch more mundane transcription errors that occur quite frequently.

Sometimes I want more details on the code used, the nature of the data etc. (Most quasi-experimental design for example can be summed up as shape your data in a special way and run a particular regression model.) For people like me who care about that, B helps with that, in that I can see the code front-to-back, can actually go and inspect the shape and values in a particular rectangular dataset, and see how the code interacts with those objects. The only full on example of this I am aware of is a recent example paper in Nature Behavior that shares the code using simulated data.

B is also very similar to people who release statistical packages to reproduce their code. So if you release an R package that conducts your new fancy technique, even if you can’t share your data it is really good for people to be able to view the underlying code even by itself to understand the technique better and in conjunction build on your work more. If you do a new technique, it is a crazy ton of work to replicate that on your own, so most people will not bother.

A is most of the way there to the gold standard – if you can share both the data and the code used to reproduce the analysis. Both A and B take a significant amount of knowledge of statistical programming to accomplish. Most people in our field do not have the skills to write an analysis front-to-back that can run in a series of scripts though. To get to A/B grad programs in crim/cj need to spend crazy more time on teaching these skills, which is near zero now almost across the board.

One brief thing to mention about A is that the boundary is difficult to define. So for example, I share code to reproduce analysis in my 311 and crime at micro places in DC paper (paper link, code). But this starts from a dataset that has the street units in DC and all of the covariates already compiled. But where did that dataset come from? I created it by compiling many different sources, so the base dataset is itself very difficult to replicate. Again not letting the perfect be the enemy of the good, I think just starting from your compiled dataset, and replicating the tables/graphs in the manuscript is better than letting the fuzzy boundary prevent you from sharing anything.

Code Reviews for Journal Submissions

The hardest part of A is that even after you share your data, some journals want to be able to run the code locally to entirely reproduce your results. So while I have shared data code (A above) for many papers, see this spreadsheet, they have not been externally vetted by any of those journals. This vetting is the standard in some economic journals now I believe, and would not be surprised in some poli-sci journals as well. This is a very hard problem though, and requires significant resources from both the journal and the researcher to be able to do that.

The biggest hurdle is that even if you share your data/code, your particular system may be idiosyncratic. You may have different R libraries installed than me. You may have different versions of python packages. I may have used a program on Windows to do some analysis you cannot do on a Mac. You may rely on some paid API I cannot access.

These are often solvable problems, but take quite a bit of time to work out. A comparable example to my work is when data scientists say ‘going to production’. This often involves taking some analysis I did on my local machine, and making it run autonomously on my companies servers. There are some things that make it more or less difficult than the typical academic situation, but I think it is broadly comparable. To go to production for a project will typically take me 3-6 months at 50% of my time, so maybe something like 300 hours for a lowish end estimate. And that is just the time it takes from the researchers end, from the journals end it will also take a significant amount of time to compile every ones code and verify the results.

Because of this, I don’t think the fully reproducible re-run my code and generate the exact same tables are feasible in the current way we do academic research and peer review. But again that is why I list C above – we shouldn’t let the perfect be the enemy of the good.

Validating New Empirical Techniques

The code review above is not really code review in the sense that someone looks at your code and says this is correct, it is simply just saying can I get the same results as you. You may want peer review to accomplish the task of not only saying is it reproducible, but is it valid/correct? There are a few things towards this end I would like to see more often in crim/cj. I realize we are not statistics, so cannot often ask for formal proofs. But there are simpler things we can do to verify the results. These are the responsibility of the researcher to provide, not the reviewer to script up on their own to validate someone elses work.

One, illustrate the technique using a very simplified example. So for instance, in my p-median patrol areas paper, I show an example of constructing the linear program with only four areas. You should be able to calculate what the result should be by hand, so can verify the correctness of your algorithm. This has the added benefit of being a very good pedagogical way to describe your method.

Two, illustrate the technique on a larger sample of simulated data in which you again know the correct result. For one example of this, I showed how to estimate group based trajectory models using deep learning libraries. Again your model/method should be able to recover the correct result (which you know) given the simulated fake data.

Three, validate the result using real data compared to the current standard. For crime mapping papers, this means comparing forecasts compared to RTM, or simpler regression models, or simply prior crime = future crime on out of sample data. Amazingly many machine learning papers in CJ do not do out of sample predictions. If it is an inferential procedure, comparing the results to some other status quo technique is similar, such as showing conformal prediction intervals have smaller widths (so more statistical power) than placebo results for synthetic control designs (at least for that example with state panel level crime data).

You may not have all three of these examples in any particular paper, but I think for very new techniques 1 or 2 is necessary. 3 is often a by-product on the analysis anyway. So I do not believe any of these asks are that onerous. If you have the skills to create some new technique, you should be able to accomplish 1 or 2.

I do not have any special advice in terms of the reviewers perspective. When I do code reviews at work, what we do is go line by line, and my co-workers give high level design advice. E.g. you should use a config file for this instead of defining it inline, you should turn this block into a function, you should make a class to open/close the database connections etc. The code reviews do not validate the technical correctness, so if I queried the wrong data they wouldn’t know in the code review. The proof is in the pudding so to speak, so if my results are performing really badly in the real world I know I am doing something wrong. (And the obverse, if my results are on the mark and making money I am pretty sure I did nothing terribly wrong.)

Because there are not these real world mechanisms to validate code in peer reviewed papers, my suggestions for 1/2/3 are the closest I think we can get in many circumstances. That and simply making your code available will dramatically improve the reproducibility and validity of your research compared to the current status quo in our field.

Publishing in Peer Review?

I am close, but not quite, entirely finished with my current crim/cj peer reviewed papers. Only one paper hangs on, the CCTV clearance paper (with Yeondae Jung). Rejected twice so far (once on R&R from Justice Quarterly), and has been under review in toto around a year and a half so far. It will land somewhere eventually, but who knows where at this point. (The other pre-prints I have on my CV but are not in peer review journals I am not actively seeking to publish anymore.)

Given the typical lags in the peer review process, if you look at my CV I will appear active in terms of publishing in 2020 (6 papers) and 2021 (4 papers and a book). But I have not worked on any peer review paper in earnest since I started working at HMS in December 2019, only copy-editing things I had already produced. (Which still takes a bit of work, for example my Cost of Crime hot spots paper took around 40 hours to respond to reviewers.)

At this point I am not sure if I will pursue any more peer reviewed publications directly in criminology/criminal justice. (Maybe as part of a team in giving support, but not as the lead.) Also we have discussed at my workplace pursuing publications, but that will be in healthcare related projects, not in Crim/CJ.

Part of the reason is that the time it takes to do a peer review publication is quite a bit relative to publishing a simple blog post. Take for instance my recent post on incorporating harm weights into the WDD test. I received the email question for this on Wednesday 11/18, thought about how to tackle the problem overnight, and wrote the blog post that following Thursday morning before my CrimCon presentation, (I took off work to attend the panel with no distractions). So took me around 3 hours in total. Many of my blog posts take somewhat longer, but I definitely do not take any more than 10-20 hours on an individual one (that includes the coding part, the writing part is mostly trivial).

I have attempted to guess as to the relative time it takes to do a peer reviewed publication based on my past work. I averaged around 5 publications per year, worked on average 50 hours a week while I was an academic, and spent something like I am guessing 60% to 80% (or more) of my time on peer review publications. Say I work 51 weeks a year (I definitely did not take any long vacations!, and definitely still put in my regular 50 hours over the summertime), that is 51*50=2550 hours. So that means around (2550*0.6)/5 ~ 300 or (2550*0.8)/5 ~ 400 so an estimate of 300 to 400 hours devoted to an individual peer review publication over my career. This will be high (as it absorbs things like grants I did not get), but is in the ballpark of what I would guess (I would have guessed 200+).

So this is an average. If I had recorded the time, I may have had a paper only take around 100 hours (I don’t think I could squeeze any out in less than that). I have definitely had some take over 400 hours! (My Mapping RTM using Machine Learning I easily spent over 200 hours just writing computer code, not to brag, it was mostly me being inefficient and chasing a few dead ends. But that is a normal part of the research process.)

So it is hard for me to say, OK here is a good blog post that took me 3 hours. Now I should go and spend another 300 to write a peer review publication. Some of that effort to publish in peer review journals is totally legitimate. For me to turn those blog posts into a peer review article I would need a more substantive real-life application (if not multiple real-life applications), and perhaps detailed simulations and comparisons to other techniques for the methods blog posts. But a bunch is just busy work – the front end lit review and answering petty questions from peer reviewers is a very big chunk of that 300 hours (and has very little value added).

My blog posts typically get many more views than my peer review papers do, so I have very little motivation to get the stamp of approval for peer review. So my blog posts take far less time, are more wide read, and likely more accessible than peer reviewed papers. Since I am not on the tenure track and do not get evaluated by peer reviewed publications anymore, there is not much motivation to continue them.

I do have additional ideas I would like to pursue. Fairness and efficiency in siting CCTV cameras is a big one on my mind. (I know how to do it, I just need to put in the work to do the analysis and write it up.) But again, it will likely take 300+ hours for me to finish that project. And I do not think anyone will even end up using it in the end – peer reviewed papers have very little impact on policy. So my time is probably better spent writing a few blog posts and playing video games with all the extra time.

If you are an editor reading this, I still do quite a few peer reviews (so feel free to send me those). I actually have more time to do those promptly since I am not hustling writing papers! I have actually debated on whether it is worth it to start my own peer reviewed journal, or maybe contribute to editing an already existing journals (just joined the JQC editorial board). Or maybe start writing my own crime analysis or methods text books. I think that would be a better use of my time at this point than pursuing individual publications.

Lit reviews are (almost) functionally worthless

The other day I got an email from ACJS about the most downloaded articles of the year for each of their journals. For The Journal of Criminal Justice Education it was a slightly older piece, How to write a literature review in 2012 by Andrew Denney & Richard Tewksbury, DT from here on. As you can guess by the title of my blog post, it is not my most favorite subject. I think it is actually an impossible task to give advice about how to write a literature review. The reason for this is that we have no objective standards by which to judge a literature review – whether one is good or bad is almost wholly subject to the discretion of the reader.

The DT article I don’t think per se gives bad advice. Use an outline? Golly I suggest students do that too! Be comprehensive in your lit review about covering all relevant work? Well who can argue with that!

I think an important distinction to make in the advice DT give is the distinction between functional actions and symbolic actions. Functional in this context means an action that makes the article better accomplish some specific function. So for example, if I say you should translate complicated regression models to more intuitive marginal effects to make your results more interpretable for readers, that has a clear function (improved readability).

Symbolic actions are those that are merely intended to act as a signal to the reader. So if the advice is along the lines of, you should do this to pass peer review, that is on its face symbolic. DT’s article is nearly 100% about taking symbolic actions to make peer reviewers happy. Most of the advice doesn’t actually improve the content of the manuscript (or in the most charitable interpretation how it improves the manuscript is at best implicit). In DT’s section Why is it important this focus on symbolic actions becomes pretty clear. Here is the first paragraph of that section:

Literature reviews are important for a number of reasons. Primarily, literature reviews force a writer to educate him/herself on as much information as possible pertaining to the topic chosen. This will both assist in the learning process, and it will also help make the writing as strong as possible by knowing what has/has not been both studied and established as knowledge in prior research. Second, literature reviews demonstrate to readers that the author has a firm understanding of the topic. This provides credibility to the author and integrity to the work’s overall argument. And, by reviewing and reporting on all prior literature, weaknesses and shortcomings of prior literature will become more apparent. This will not only assist in finding or arguing for the need for a particular research question to explore, but will also help in better forming the argument for why further research is needed. In this way, the literature review of a research report “foreshadows the researcher’s own study” (Berg, 2009, p. 388).

So the first argument, a lit review forces a writer to educate themselves, may offhand seem like a functional objective. It doesn’t make sense though, as lit. reviews are almost always written ex post research project. The point of writing a paper is not to educate yourself, but educate other people on your research findings. The symbolic motivation for this viewpoint becomes clear in DT’s second point, you need to demonstrate credibility to your readers. In terms of integrity if the advice in DT was ‘consider creating a pre-analysis plan’ or ‘release data and code files to replicate your results’ that would be functional advice. But no, it is important to wordsmith how smart you are so reviewers perceive your work as more credible.

Then the last point in the paragraph, articulating the need for a particular piece of research, is again a symbolic action in DT’s essay. You are arguing to peer reviewers about the need for a particular research question. I understand the spirit of this, but think back to what function does this serve? It is merely a signal to reviewers to say, given finite space in a journal, please publish my paper over some other paper, because my topic is more important.

You actually don’t need a literature review to demonstrate a topic is important and/or needed – you can typically articulate that in a sentence or two. For a paper I reviewed not too long ago on crime reductions resulting from CCTV installations in a European city, I was struck by another reviewers critique saying that the authors “never really motivate the study relative to the literature”. I don’t know about you, but the importance of that study seems pretty obvious to me. But yeah sure, go ahead and pad that citation list with a bunch of other studies looking at the same thing to make some peer reviewers happy. God forbid you simply cite a meta-analysis on prior CCTV studies and move onto better things.

What should a lit review accomplish?

So again I don’t think DT give bad advice – mostly vapid but not obviously bad. DT focus on symbolic actions in lit reviews because as lit reviews are currently performed in CJ/Crim journals, they are almost 100% symbolic. They serve almost no functional purpose other than as a signal to reviewers that you are part of the club. So DT give about the best advice possible navigating a series of arbitrary critiques with no clear standard.

As an example for this position that lit reviews accomplish practically nothing, conduct this personal experiment. The next peer review article you pick up, do not read the literature review section. Only read the abstract, and then the results and conclusion. Without having read the literature review, does this change the validity of a papers findings? It for the most part does not. People get feelings hurt by not being cited (including myself), but even if someone fails to cite some of my work that is related it pretty much never impacts the validity of that persons findings.

So DT give advice about how peer review works now. No doubt those symbolic actions are important to getting your paper published, even if they do not improve the actual quality of the manuscript in any clear way. I rather address the question about what I think a lit review should look like – not what you should do to placate three random people and the editor. So again I think the best way to think about this is via articulating specific functions a lit review accomplishes in terms of improving the manuscript.

Broadening the scope abit to consider the necessity of citations, the majority of citations in articles are perfunctory, but I don’t think people should plagiarize. So when you pull a very specific piece of information from a source, I think it is important to cite that work. Say you are using a survey instrument developed by someone else, citing the work that establishes that instruments reliability and validity, as well as the original population those measures were established on, is certainly useful information to the reader. Sources of information/measures, a recent piece saying the properties of your statistical model are I think other good examples of things to cite in your work. Unfortunately I cannot give a bright line here, I don’t cite Gauss every time I use the normal distribution. But if I am using a code library someone else developed that is important, inasmuch as that if someone wants to do a similar project they could use the same library.

In terms of discussing relevant results in prior studies, again the issue is the boundary of what is relevant is very difficult to articulate. If there is a relevant meta-analysis on a topic, it seems sufficient to me to simply state the results of the meta-analysis. Why do I think that is important though? It helps inform your priors about the current study. So if you say a meta-analysis effect size is X, and the current study has an effect size much larger, it may give you pause. It is also relevant if you are generalizing from the results of the study, it is just another piece of evidence in addition to the meta-analysis, not an island all by itself.

I am not saying discussing prior specific results are not needed entirely, but they do not need to be extensive. So if studies Z, Y, X are similar to yours but all had null results, and you think it was because the sample sizes were too small, that is relevant and useful information. (Again it changes your priors.) But it does not need to be belabored on in detail. The current standard of articulating different theoretical aspects ad-nauseum in Crim/CJ journals does not improve the quality of manuscripts. If you do a hot spots policing experiment, you do not need to review all the different minutia of general deterrence theory. Simply saying this experiment is likely to only accomplish general deterrence, not specific deterrence, seems sufficient to me personally.

When you propose a book you need to say ‘here are some relevant examples’ – I think the same idea would be sufficient for a lit review. OK here is my study, here are a few additional studies I think the reader may be interested in that are related. This accomplishes what contemporary lit reviews do in a much more efficient manner – citing more articles makes it much more difficult to pull out the really relevant related work. So admit this does not improve the quality of the current manuscript in a specific way, but helps the reader identify other sources of interest. (I as a reader typically go through the citation list and note a few articles I am interested in, this helps me accomplish that task much quicker.)

I’ve already sprinkled a few additional pieces of advice in this blog post (marginal effect estimates, pre-analysis plans, sharing data code), although you may say they don’t belong in the lit review. Whatever, those are things that actually improve either the content of the manuscript or the actual integrity of the research, not some spray paint on your flowers.

Relevant Other Work

CrimRxiv, Alt-Journal Contributions, and Mike Maltz’s Retrospective

As I’m sure followers of mine know, I am a big proponent of posting pre-prints. Spearheaded by Scott Jacques, he has started a specifically criminology focused pre-print server title CrimRxiv. It is still in beta but anyone can contribute a paper if they want.

One of the things me and Scott have been jamming about is how to leverage crimrxiv to make a journal that not only takes advantage of all the goodies on the internet, such as being able to embed interactive graphics or other rich media directly in a journal articles. But to really widen the scope of what ‘counts’ in terms of scholarly contribution. Why can’t things like a cool app, or a really good video lecture you edited, or a blog post illustrating code be put on the same level with journal articles?

Part of the reason I am writing this blog post is that I saw Michael Maltz recently publish a retrospective on his career on Academia.edu. This isn’t a typical journal article, but despite that there is no reason why you shouldn’t share such pieces. So I was able to convince Mike to post A Retrospective Look at My Professional Life to crimrxiv. When he first posted it on Academia.edu here was my response on how Mike (despite never having crossed paths) has influenced my career.


Hi Michael and thank you for sharing,

I’ve followed your work since a grad student at Albany. I initially got hooked on data viz based on Tufte’s book. When I looked for examples of criminologists discussing data viz you were the only one I found. That was sometime around 2010, so you had that chapter in the handbook of quantitative crim. You also had another article about drawing glyphs to illustrate life course transitions I was familiar with.

When I finished my classes at SUNY, I then worked at Troy as a crime analyst while finishing my dissertation. I doubt any of the coffee shops were the same from your time, but I did like walking over to Famous hotdogs for lunch every now and then.

Most of my work at the PD was making time series graphs and maps. No regression, so most of my stats training was not particularly useful. Even my mapping course I took focused on areal data analysis was not terribly relevant.

I tried to do similar projects to your glyph life-courses with interval censored crime data, but I was never really successful with that, they always ended up being too complicated with even moderately large crime datasets, see https://andrewpwheeler.com/2013/02/28/interval-graph-for-viz-temporal-overlap-in-crime-events/ and https://andrewpwheeler.com/2014/10/02/stacking-intervals/ for my attempts.

What was much more helpful was simply doing monitoring metrics over time, simple running means, and then I just inverted the PDF of the Poisson to give error bars, e.g. https://andrewpwheeler.com/2016/06/23/weekly-and-monthly-graphs-for-monitoring-crime-patterns-spss/. Then cases that were outside the error bands signified an anomalous pattern. In Troy there was an arrest of a single prolific person breaking into cars, and the trend went from a creeping 10 year high to a 10 year low instantly in those graphs.

So there again we have your work on the Poisson distribution and operations research in that JQC article. Also sometime in there I saw a comment you made on Andrew Gelman’s blog pointing to your work with error bands for BJS. Took that ‘fan chart’ idea later on and provided error bands for city level and USA level homicide trends, e.g. https://apwheele.github.io/MathPosts/FanChart_NewOrleans.html. Most of popular discussion of large scale crime trends is misguided over-interpreting short term noise in my opinion.

So all my degrees are in criminal justice, but I have been focusing more on linear programming over time borrowing from operations researchers as well, https://andrewpwheeler.com/2020/05/29/an-intro-to-linear-programming-for-criminologists/. I’ve found that taking outputs from a predictive model and then applying a decision analysis to specifically articulate strategies CJ agencies should take is much more fruitful than the typical way academic research is done.

Thank you again for sharing your story and best, Andy Wheeler

Some more peer review musings

It is academics favorite pastime to complain about peer review. There is no getting around it – peer review is degrading, both in the adjective sense of demoralizing, as well as in the verb ‘wear down a rock’ sense.

There is nothing quite like it in other professional spheres that I can tell. For example if I receive a code review at work, it is not adversarial in nature like peer review is. I think most peer reviewers treat the process like a prosecutor, poking at all the minutia that they can uncover, as opposed to being judges of the truth.

At work I also have a pretty unobjectional criteria for success – if my machine learning model is making money then it is good. Not so for academic papers. Despite everyone learning about the ‘scientific method’, academics don’t really have a coda to follow like that. And that lack of objective criteria causes quite a bit of friction in the peer review process.

But all the things I think are wrong with peer review I have a hard time articulating succinctly. So we have a bias problem, in that reviewers have preferences for particular methods or styles. We have a problem that individuals get judged based on highly arbitrary standards of what is interesting. Many critiques are focused on highly pedantic things that make little to no material difference, such as the use of personal pronouns. Style advice can be quite bad to be frank, in my career I’m up to something like four different peer reviews saying providing results in the introduction is inappropriate. Edit: And these complaints are not exhaustive as well, we have reviewers pile on endless complaints in multiple rounds, and people phone it in with nonsense short descriptions as well. (I’m sure I can continue to add to the list, but I’ve personally experienced all of these things, in addition to being called a racist in peer review.)

I’ve previously provided advice about how I think peer reviews should be done. To sum up my advice:

  • Differentiate between big problems and minor stuff
  • Be very specific what things you want changed (and how to change them)

I think I should add two to this as well – don’t be a jerk, and don’t pile on (or be respectful of peoples time). For the jerk part I don’t even meet that standard if I am being honest with myself. For one of my peer reviews I am going to share a later round I got pretty snippy (with the Urban folks on their CCTV paper in Milwaukee), that was over the top in retrospect. (I don’t think reviewers should get a say on revisions, editors should just arbiter whether the responses by the original authors were sufficient. I am pretty much never happy if I suggest something and an author says no thanks.)

For the pile on part, I recently posted my responses to my cost of crime hot spots paper. Although all three reviewers were positive, it still took me ~40 hours to respond to all of the critiques. Even though all of the reviews were in good faith, it honestly was not worth my time to make those revisions. (I think two were legitimate, changing the front end to say my hot spots are crime cost, not crime harm, and the ask for more details on the Hunt estimates. The rest were just fluff though.)

If you do consulting think about your rate – and whether addressing all those minor critiques meet the threshold of ‘is the paper improved by the amount to justify my consulting fee’. My experience it does not come close, and I am quite a cheap consultant.

So I have shared in the past a few examples of my response to reviewers (besides above, see here for the responses to my how to select participants for focussed deterrence paper, and here for my tables and graphs for crime analysis paper). But maybe instead of bagging on others, I should just try to lead by example.

So below are several of my recent reviews. I’ve only pulled out the recent ones that I know the paper has been published. This is then subject to a selection bias, papers I have more negative things to say are less likely to be published in the end. So keep that bias in mind when judging these.

The major components of how I believe I am different from the modal reviewer is I subdivide between major and minor concerns. Many reviewers take the running commentary through the paper approach, which not only does not distinguish between major/minor, but many critiques are often explicitly addressed (just at a different point in the manuscript from where it popped into the reviewers head).

The way I do my reviews I actually do the running commentary in the side of the paper, then I sleep on it, then I organize into the major sections. In this process of organizing I drop quite a bit of minor fluff, or things that are cleared up at later points in the paper. (I probably end up deleting ~50% of my original notes.)

Second, for papers I give a thumbs up for I take actual time to articulate why they are good. I don’t just give a complement sandwich and pile on a series of minor critiques. Positive comments are fleeting in peer review, but necessary to judge a piece.

So here are some of my examples from peer review, take them for what they are worth. I no doubt do not always follow my advice I lay out above, by try my best to.


Title: Going Local: Do Consent Decrees and Other Forms of Federal Intervention in Municipal Police Departments Reduce Police Killings?

The article is well written and a timely topic. The authors have a quote that I think sums it up quite nicely “This paper is therefore the first to evaluate the effects of civil rights investigations and the consent decree process on an important – arguably the most important – measure of use of force: death.” Well put.

The analysis is also well executed. It is observational, but city fixed effects and the event history study were the main things I was looking for.

I have a three major revision requests. But they are just editing the manuscript type stuff, so can be easily accomplished by the authors.

  1. Drop the analysis of Crime Rates

While I think the analysis of officer deaths is well done, the analysis of crime rates I have more doubts about. Front end is not focused on this at all, and would need a whole different section about it discussing recent work in Chicago/Baltimore/NYC. Also I don’t think the same analysis (fixed effects), is sufficient for crime trends – we are basically talking about the crime drop period. Also very important to assess heterogeneity in that analysis – a big part of the discussion is that Chicago/Baltimore are different than NYC.

The analysis of officer deaths is sufficient to stand on its own. I agree the crime rates question is important – save it for another paper where it can be given enough attention to do the topic justice.

  1. Conclusion

Like I said, I was most concerned about the event study, and the authors show that there was some evidence of pre-treatment selection trends. You don’t talk about the event study results in the conclusion though. It is a limitation that the event study analysis had less clear evidence of crime reductions than the simpler pre/post analysis.

This is likely due to larger error bars with the rare outcome, which is itself a limitation of using shootings as the outcome. I think it deserves to be mentioned even if overall the effects of consent decrees are not 100% clear on reducing officer involved deaths, locally monitoring more frequent outcomes is worthwhile. See the recent work by MacDonald/Braga in NYC. New Orleans is another example, https://journals.sagepub.com/doi/full/10.1177/1098611117709785.

I note the results are important to publish without respect to the results of the analysis. It would be important to publish this work even if it did not show reductions in officer involved deaths.

  1. Front end lit review

For 2.2 racial disparities, this section misses much of the recent work on the topic of officer involved shootings except for Greg Ridgeway’s article. There is a wide array of work that uses other counterfactual approaches to examine implicit bias, see Roland Fryer work (https://www.journals.uchicago.edu/doi/abs/10.1086/701423), or the Wheeler/Worrall papers on Dallas shoot/don’t shoot data (https://journals.sagepub.com/doi/full/10.1177/1525107118759900 & https://journals.sagepub.com/doi/full/10.1177/0011128718756038). These are the observational duel to the experimental lab work by Lois James. Also there are separate papers by Justin Nix (https://www.tandfonline.com/doi/full/10.1080/0735648X.2018.1547269), and Joseph Cesario (https://journals.sagepub.com/doi/10.1177/1948550618775108) that show when using different benchmarks estimates of disparity can vary by quite abit.

For the use of force review (section 2.3), people typically talk about situational factors that influence use of force (in addition to individual/extra-legal and organizational). So you may say “consent decrees don’t/can’t change situational behavior of offenders, so what is the point of writing about it” tis true, but it still should be articulated. To the extent that situational factors are a large determinant of shootings, it may be consent decrees are not the right way to reduce officer deaths then if it is all situational. But, consent decrees may indirectly effect police/citizen interactions (such as via de-escalation or procedural justice training), that could be a mechanism through which fewer officer deaths occur.

Long story short, 2.2 should be updated with more recent work on officer involved shootings and the benchmark problem, and 2.3 should be updated to include discussion of the importance of situational factors in the use of force.

Additional minor stuff:

  • pg 12, killings are not a proxy for use of force (they count as force!)
  • regression equations need some editing. Definitely need a log or exponential function on one of the sides of the equation, and generalized linear models do not have an error term like linear models do. I personally write them as:

log(E[crime_it]) = intercept + B1*X + …..

where E[crime_it] is the expected value of crime at place i and time t (equivalent to lambda in your current formulation).

  • pg 19 monitor misspelled (equation type)

Title: Immigration Enforcement, Crime and Demography: Evidence from the Legal Arizona Workers Act

Well done paper, uses current status quo empirical techniques to estimate the effect employment oversight for illegal immigrant workers had on subsequent crime reductions.

Every critique I thought of was systematically addresses in the paper already. Discussed issues with potential demographic spillovers (biasing estimates because controls may have crime increases). Eliminating states from the pool of placebos with stronger E-Verify support, and later on robustness checks for neighboring states. And using some simple analyses to give decompositions that would be expected due to the decrease in the share of young males.

Minor stuff

  • Light and Miller quote on pg 21 not sure if it is space/dash missing or just kerning problems in LaTex
  • Pg 26, you do the estimates for 08 and 09 separately, but I think you should pool 08-09 together in that table (the graphs do the visual tests for 08 & 09 independently). For example, Violent crimes are negative for both 08 & 09, which in the graphs are not outside the typical bounds for each individual year, but cumulatively that may be quite low (most will have a variable low and then high). This should get you more power I think given the few potential placebo tests. So it should be something like (-0.067 + -0.108) with error bars for violent crimes.
  • I had to stare at your change equation (pg 31) for quite a bit. I get the denominator of equation 2, I’m confused about the numerator (although it is correct). Here is how I did it (using your m1, m2, a, and X).

Pre-Crime = m1*aX + (1 - m1)X = X * (m1*a + 1 - m1)

Post-Crime = m2*aX + (1 - m2)X = X * (m2*a + 1 - m2) #so you can drop the X

% Change = (Post - Pre) / Pre = Post/Pre - 1

At least that is easier for me to wrap my head around. Also should m1 and m2 be the overall share of young adults? Not just limited to immigrants? (Since the estimated crime reduction is for everybody, not just crimes committed by immigrants?)


Title: How do close-circuit television cameras impact crimes and clearances? An evaluation of the Milwaukee Police Department’s public surveillance system

Well written paper. Uses appropriate quasi-experimental methods (matching and diff-in-diff) to evaluate reductions in crimes and potential increases in case clearances.

I have some minor requests on analysis/descriptive stats, but things that can be easily addressed by the authors.

First, I would request a simpler pre-post DiD table. The panel models not only introduce more complications of interpretation, but they are based on asymptotics that I’m not sure if they are met here.

So if you do something like:

              Pre Crime   Post Crime  Difference DiD
Treated      100          80              -20         -30
Control      100        110               10  

It is much more straightforward to interpret, and I provide a stat test and power analysis advice in https://crimesciencejournal.biomedcentral.com/articles/10.1186/s40163-018-0085-5. Your violent crime counts are very low, so I think you would need unrealistic effects (of say 50% reductions) to detect an effect with your group sizes.

You can do the same table for % clearances, and do whatever binomial test of proportions (which will have much more power than the regression equation). And again is simpler to interpret.

Technically doing a poisson regression with an exposure is not the same as modelling the clearance counts with total crimes as an exposure. The predicted PMF can technically go above 1 (so you can have %’s above 100%). It would be more appropriate to use binomial regression, so something like below in Stata:

glm arrests i.Treat i.Post Treat#Post i.Unit, family(binomial crimes) link(logit)

(No xtbinomial unfortunately, maybe can coerce xtlogit with weights to do it, or use meglm since you are doing random effects. I think you should probably do fixed effects here anyway.) I don’t know if it will make a difference in the results here though.

Andy Wheeler


Title: The formation of suspicion: A vignette study

Well done experimental study examining suspiciousness using a vignette approach. There is no power analysis done, but it is a pretty large sample, and only examines first order effects. (With a note about examining a well powered interaction effect described in the analysis section.)

My only big ask is that the analysis should include a dummy variable for the two different sampling frames (e.g. a dummy variable for 1=New York). Everything else is minor/easy stuff.

Minor stuff:

  • How were the vignettes randomized? (They obviously were, balance is really good!)
  • For the discussion, it is important to understand these characteristics that start an interaction because of another KT heuristic bias – anchoring effects. Paul Taylor has some recent work of interest on Dispatch priming that is relevent. (Also Dan Mears had a recent overview paper on biases/heuristics in Journal of Criminal Justice I also think should probably be cited.)
  • For Table 1 I would label the “Dependent Variable” with a more descriptive label (suspiciousness)
  • Also people typically code the variables 0/1 instead of 1/2, it actually won’t impact the analysis here, since it is just a linear shift of +1 (just change the intercept term). The variables of Agency Size, Education, & Experience are coded as ordinal variables though, and they should maybe be included as dummy variables for each category. (I don’t think this will make a big difference though for the main randomized variables of interest though.)

Title: The criminogenic effect of marijuana dispensaries in Denver, Colorado: A microsynthetic controls quasi-experiment and cost-benefit analysis

This work is excellent. It is a timely topic, as many states are still considering whether to legalize and what that will exactly look like.

The work is a replication/extension of a prior study also in Denver, but this is a stronger research design. Using the micro units allows for matching on pre-trends and drawing synthetic controls, which is a stronger design than prior pre/post work at neighborhood level (both in Denver and in LA). Micro units are also more relevant to test direct criminogenic effects of the stores. The authors may be interested in https://www.tandfonline.com/doi/full/10.1080/0735648X.2019.1582351, which is also a stronger research design using matched comparison groups, but is only for medical dispensaries in DC and is a much smaller sample.

Even if one considers that to be a minor contribution (crime increase findings are similar magnitude to Hughes JQ paper), the cost benefit analysis is a really important contribution. It hits on all of the important considerations – that even if the book of costs/benefits is balanced, they are relevant to really different segments of society. So even if tax revenue offsets the books, places may still not want to take on that extra crime burden.

Only two (very minor) suggestions. One, some of the permutation lines are outside of the figure boundaries. Two, I would like a brief ado in the discussion mentioning the trade-off in economic investment making places busier (which fits right into the current discussion of how costs-benefits). Likely if you added 100 ice-cream shops crime might go up due to increased commercial activity – weed has the same potential negative externality – but is not necessarily worse than say opening a bunch of bars or convenience stores. (Same thing is relevant for any BID, https://journals.sagepub.com/doi/full/10.1177/0011128719834559.)


Title: Understanding the Predictors of Street Robbery Hot Spots: A Matched Pairs Analysis and Systematic Social Observation

Note: Reviewed for Justice Quarterly and rejected. Final published version is at Crime & Delinquency (which I was not a reviewer for)

The article uses the case-control method to match high-crime to low-crime street segments, and then examine local land use factors (bars, convenience stores, etc.) as well as the more novel source of physical disorder coded from Google Street View images. The latter is the main motivation for the case-control design, as manual coding prevents one from doing the entire city.

Case-control designs by their nature you cannot manipulate the number of cases you have at your disposal. Thus the majority of such designs typically focus on ONE exposure of interest, and match on any other characteristic that is known to affect the outcome, but is not of direct interest to the study. E.g. if you examined lung cancer given exposure to vehicle emissions, you would need to match controls as to whether they smoked or not. This allows you to assess the exposure of interest with the maximum power given the design limitations, although you can’t say anything about smoking vs not smoking on the outcome.

The authors here match within neighborhoods and on several other local characteristics, but then go onto examine different crime generators (7 factors), as well as the two disorder coded variables. Given that this is a fairly small sample size of 129 matched cases, this is likely a pretty underpowered research design. Logistic regression relies on asymptotic properties, and even with fewer variables it is questionable whether 260 cases is sufficient, see https://www.tandfonline.com/doi/abs/10.1080/02664763.2017.1282441. Thus you get in abstract terms fairly large odds ratios, but are still not significant (e.g. physical disorder has an odds ratio of 1.7, but is insignificant). So you have low power to test all of those coefficients.

I believe a stronger research design would focus on the novel measures (the Google Street View disorder), and match on the crime generator variables. The crime generator factors have been well established in prior research, so that work is a small contribution. The front end focuses on the typical crime generator typology, and lacks any broader “so what” about the disorder measures. (Which can be made, focusing on broken windows theory and the controversy it has caused.)

It would be feasible for authors to match on the crime generators, but it would result in different control cases, so they would need to code additional locations. (If you do match on crime generators, I think it is OK to include 0 crime areas in the pool. Main reason it is sometimes not fair to include 0 crime areas is because they are things like a bridge or a park.)

Minor notes:

  • You cannot interpret the coefficients in causal terms on which you matched in the conclusion. (Top page 23.) It only says the extent to which your matching was successful, not anything about causality. Later on you also attempt to weasel out of causal interpretations (page 26). Saying this is not causality, but otherwise interpreting regression coefficients as if they have any meaning is an act of cognitive dissonance.
  • Given that there is perfect separation between convenience stores and hot spots, the model should have infinite standard errors for that factor. (You have a few coefficients that appear to have explosive standard errors.)
  • I wouldn’t describe as you match on the dependent variable [bottom page 8]. I realize it is confusing mixing propensity score terminology with case-control (although the method is fine). You match cases-to-controls on the independent variables you choose at the onset.
  • page 5, Dan O’Brien in his work has shown that you have super-callers for 311. Which fits right into your point of how coding of images may be better, as a single person can bias the measure at micro places. (This is sort of the opposite of not calling problem you mention.)
  • You may be interested, some folks have tried to automate the scoring part using computer vision, see https://ieeexplore.ieee.org/document/6910072?tp=&arnumber=6910072&url=http:%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6910072 or https://mikebader.net/projects/canvas-usa/ . George Mohler had a talk at ASC where he used Google’s automated image labeling to identify disorder (like graffiti) in pictures https://cloud.google.com/vision/docs/labels

 

The length it takes from submission to publication

The other day I received a positive comment about my housing demolition paper. It made me laugh abit inside — it felt like I finished that work so long ago it was talking about history. That paper was not so ancient though, I submitted it 8/4/17, went through one round of revision, and I got the email from Jean McGloin for conditional acceptance on 1/16/18. It then came online first a few months later (3/15/18), and is in the current print issue of JRCD, which came out in May 2018.

This ignores the time it takes from conception to finishing a project (we started the project sometime in 2015), but focusing just on the publishing process this is close to the best case scenario for the life-cycle of a paper through peer reviewed journals in criminology & criminal justice. The realist best case scenario typically is:

  • Submission
  • Wait 3 months for peer reviews
  • Get chance to revise-resubmit
  • Wait another 3 months for second round of reviews and editor final decision

So ignoring the time it takes for editors to make decisions and the time for you to turn around edits, you should not bank on a paper being accepted under 6 months. There are exceptions to this, some journals/editors don’t bother with the second three month wait period for reviewers to look at your revisions (which I think is the correct way to do it), and sometimes you will get reviews back faster or slower than three months, but that realist scenario is the norm for most journals in the CJ/Crim field. Things that make this process much slower (multiple rounds of revisions, editors taking time to make decisions, time it takes to make extensive revisions), are much more common than things that can make it go shorter (I’ve only heard myths about a uniform accept on the first round without revisions).

Not having tenure this is something that is on my mind. It is a bit of a rat race trying to publish all the papers expected of you, and due to the length of peer review times you essentially need to have your articles out and under review well before your tenure deadline is up. The six month lag is the best case scenario in which your paper is accepted at the first journal you submit to. The top journals are uber competitive though, so you often have to go through that process multiple times due to rejections.

So to measure that time I took my papers, including those not published, to see what this life-cycle time is. If I only included those that were published it would bias the results to make the time look shorter. Here I measured the time it took from submission of the original article until when I received the email of the paper being accepted or conditionally accepted. So I don’t consider the lag time at the end with copy-editing and publishing online, nor do I consider up front time from conception of the project or writing the paper. Also I include three papers that I am not shopping around anymore, and censored them at the date of the last reject. For articles still under review I censored them at 5/9/18.

So first, for 25 of my papers that have received one editorial decision, here is a graph of the typical number of rejects I get for each paper. A 0 for a paper means it was published at the first journal I submitted to, a 1 means I had one reject and was accepted at the second journal I submitted the paper to, etc. (I use "I" but this includes papers I am co-author on as well.) The Y axis shows the total percentage, and the label for each bar shows the total N.

So the proportion of papers of mine that are accepted on the first round is 28%, and I have a mean of 1.6 rejections per article. This does not take into account censoring (not sure how to for this estimate), and that biases the estimate of rejects per paper downward here, as it includes some articles under review now that will surely be rejected at some point after writing this blog post.

The papers with multiple rejects run the typical gamut of why academic papers are sometimes hard to publish. Null results, a hostile reviewer at multiple places, controversial findings. It also illustrates that peer review is not necessarily a beacon showing the absolute truth of an article. I’m pretty sure everything I’ve published, even papers accepted at the first venue, have had one reviewer with negative comments. You could find reasons to reject the findings of anything I write that has been peer reviewed — same as you can think many of my pre-print articles are correct or useful even though they do not currently have a peer review stamp of approval.

Most of those rejections add about three months to the life-cycle, but some can be fast (these include desk rejections), and some can be slower (rejections on later rounds of revisions). So using those begin times, end times, and taking into account censoring, I can estimate the typical survival time of my papers within the peer-review system when lumping all of those different factors together into the total time. Here is the 1 - survival chart, so can be interpreted as the number of days until publication. This includes 26 papers (one more that has not had a first decision), so this estimate does account for papers that are censored.

The Kaplan-Meier estimate of the median survival times for my papers is 290 days. So if you want a 50% chance of your article being published, you should expect 10 months based on my experience. The data is too sparse to estimate extreme quantiles, but say I want an over 80% probability of an article being published based on this data, how much time do I need? The estimate based on this data is at least 460 days.

Different strategies will produce different outcomes — so my paper survival times may not generalize to yours, but I think that estimate will be pretty reasonable for most folks in Crim/CJ. I try to match papers to journals that I think are the best fit (so I don’t submit everything to Criminology or Justice Quarterly at the first go), so I have a decent percent of papers that land on the first round. If I submitted first round to more mediocre journals overall my survival times would be faster. But even many mid-tiered journals in our field have overall acceptance rates below 10%, nothing I submit I ever think is really a slam dunk sure thing, so I don’t think my overall strategy is the biggest factor. Some of that survival time is my fault and includes time editing the article in between rejects and revise-resubmits, but the vast majority of this is simply waiting on reviewers.

So the sobering truth for those of us without tenure is that based on my estimates you need to have your journal articles out of the door well over a year before you go up for review to really ensure that your work is published. I have a non-trivial chunk of my work (near 20%) that has taken over one and a half years to publish. Folks currently getting their PhD it is the same pressure really, since to land a tenure track job you need to have publications as well. (It is actually one I think reasonable argument to take a longer time writing your dissertation.) And that is just for the publishing part — that does not include actually writing the article or conducting the research. The nature of the system is very much delayed gratification in having your work finally published.

Here is a link to the data on survival times for my papers, as well as the SPSS code to reproduce the analysis.

Roadblocks in Buffalo update (plus more complaints about peer-review!)

I’ve updated the roadblocks in Buffalo manuscript due to a rejection and subsequent critiques. So be prepared about my complaints of the peer-review!

I’ve posted the original manuscript, reviews and a line-by-line response here. This was reviewed at Policing: An International Journal of Police Strategies & Management. I should probably always do this, but I felt compelled to post this review by the comically negative reviewer 1 (worthy of an article on The Allium).

The comment of reviewer 1 that really prompted me to even bother writing a response was the critique of the maps. I spend alot of time on making my figures nice and understandable. I’m all ears if you think they can be improved, but best be prepared for my response if you critique something silly.

So here is the figure in question – spot anything wrong?

The reviewer stated it did not have legend, so it does not meet "GIS standards". The lack of a legend is intentional. When you open google maps do they have a legend? Nope! It is a positive thing to make a graphic simple enough that it does not need a legend. This particular map only has three elements: the outline of Buffalo, the streets, and the points where the roadblocks took place. There is no need to make a little box illustrating these three things – they are obvious. The title is sufficient to know what you are looking at.

Reviewer 2 was more even keeled. The only thing I would consider a large problem in their review was they did not think we matched comparable control areas. If true I agree it is a big deal, but I’m not quite sure why they thought this (sure balance wasn’t perfect, but it is pretty close across a dozen variables). I wouldn’t release the paper if I thought the control areas were not reasonable.

Besides arbitrary complaints about the literature review this is probably the most frustrating thing about peer-reviews. Often you will get a list of two dozens complaints, with most being minor and fixable in a sentence (if not entirely arbitrary), but the article will still be rejected. People have different internal thresholds for what is or is not publishable. I’m on the end that even with worts most of the work I review should still be published (or at least the authors given a chance to respond). Of the 10 papers I’ve reviewed, my record is 5 revise-and-resubmits, 4 conditional accepts, and 1 rejection. One of the revise-and-resubmits I gave a pretty large critique of (in that I didn’t think it was possible to improve the research design), but the other 4 would be easily changed to accept after addressing my concerns. So worst case scenario I’ve given the green light to 8/10 of the manuscripts I’ve reviewed.

Many reviewers are at the other end though. Sometimes comically so, in that given the critiques nothing would ever meet their standards. I might call it the golden-cow peer review standard.

Even though both of my manuscripts have been rejected from PSM, I do like their use of a rubric. This experience makes me wonder what if the reviewers did not give a final reject-accept decision – just the editors took the actual comments and made their own decision. Editors do a version of this currently, but some are known to reject if any of the reviewers give a rejection no matter what the reviewers actually say. It would force the editor to use more discretion if the reviewers themselves did not make the final judgement. It also forces reviewers to be more clear in their critiques. If they are superficial the editor will ignore them, whereas the final accept-reject is easy to take into account even if the review does not state any substantive critiques.

I don’t know if I can easily articulate what I think is a big deal and what isn’t though. I am a quant guy, so the two instances I rejected were for model identification in one and for sample selection biases in the other. So things that could not be changed essentially. I haven’t read a manuscript that was so poor I considered it to be unsalvagable in terms of writing. (I will do a content analysis of reviews I’ve recieved sometime, but almost all complaints about the literature review are arbitrary and shouldn’t be used as reasons for rejection.)

Often times I write abunch of notes on the paper manuscript my first read, and then when I go to write up the critique specifically I edit them out. This often catches silly initial comments of mine, as I better understand the manuscript. Examples of silly comments in the reviews of the roadblock paper are claiming I don’t conduct a pre-post analysis (reviewer 1), and asking for things already stated in the manuscript (reviewer 2 asking for how long the roadblocks were and whether they were "high visibility"). While it is always the case things could be explained more clearly, at some point the reviewer(s) needs to be more careful in their reading of the manuscript. I think my motto of "be specific" helps with this. Being generic helps to conceal silly critiques.

Posting my peer reviews on Publons and a few notes about reviewing

Publons is a service that currates your peer review work. In a perfect world this would be done by publishers – they just forward your reviews with some standard meta-data. This would be useful when one paper is reviewed multiple times, as well as identifying good vs. poor reviewers. I forget where I saw the suggestion recently (maybe on the orgtheory or scatterplot blog), but someone mentioned it would be nice if you submit your paper to a different journal to forward your previous reviews at your discretion. I wouldn’t mind that at all, as oft the best journals will reject for lesser reasons because they can be more selective. (Also I’ve gotten copy-paste same reviews from different journals, even though I have updated the draft to address some of the comments. Forwarding would allow me to address those comments directly before the revise-resubmit decision.)

I’ve posted all of my reviews so far, but they are only public if the paper is eventually accepted. So here you can see my review for the recent JQC article Shooting on the Street: Measuring the Spatial Influence of Physical Features on Gun Violence in a Bounded Street Network by Jie Xu and Elizabeth Griffiths.

I’ve done my fair share of complaining about reviews before, but I don’t think the whole peer-review process is fatally flawed despite its low reliability. People take peer review a bit too seriously at times – but that is a problem for most academics in general. Even if you think your idea is not getting a fair shake, just publish it yourself on your website (or places like SSRN and ArXiv). This of course does not count towards things like tenure – but valuing quantity over quality is another separate problem currently in academia.


In the spirit of do unto others as you would have them do unto you, here are a two main points I try to abide by when I review papers.

  • be as specific as possible in your critique

There is nothing more frustrating than getting a vague critique (the paper has multiple mispellings and grammar issues). A frequent one I have come across (both in reviews of my papers and seeing comments others have made on papers I’ve reviewed) is in the framing of the paper – a.k.a. the literature review. (Which makes sending the paper to multiple journals so frustrating, you will always get more arbitrary framing debates each time with new reviewers.)

So for a few examples:

  • (bad) The literature review is insufficient
  • (good) The literature review skips some important literature, see specifically (X, 2004; Y, 2006; Z, 2007). The description of (A, 2000) is awkward/wrong.
  • (bad) The paper is too long, it can be written in half the length
  • (better) The paper could be shortened, section A.1 can be eliminated in my opinion, and section A.2 can be reduced to one paragraph on X.

Being specific provides a clear path for the author to correct what you think, or at least respond if they disagree. The “you can cut the paper in half” I don’t even know how to respond to effectively, nor the generic complaint about spelling. One I’ve gotten before is “x is an innapropriate measure” with no other detail. This is tricky because I have to guess why you think it is innapropriate, so I have to make your argument for you (mind read) and then respond why I disagree (which obviously I do, or I wouldn’t have included that measure to begin with). So to respond to a critique I at first have to critique my own paper – maybe this reviewer is more brilliant than I thought.

Being specific I also think helps cut down on arbitrary complaints that are arguable.

  • provide clear signals to the editor, both about main critiques and the extent to which they can be addressed

Peer review has two potential motivations, one is a gate-keeper and one is to improve the draft. Often times arbitrary advice by reviewers intended for the latter is not clearly delineated in the review, so it is easily confused for evidence pertinent to the gate-keeper function. I’ve gotten reviews of 20 bullet points or 2,000 words that make it seem like a poor paper due to sheer length of the comment, but the majority are minor points or arbitrary suggestions. Longer reviews actually suggest the paper is better – if there is something clearly wrong you can say it in a much shorter space.

Gabriel Rossman states these different parts of peer review a bit more succintly than me:

You need to adopt a mentality of “is it good how the author did it” rather than “how could this paper be made better”

I think this is a good quip to follow. I might add “don’t sweat the small stuff” to that as well. Some editors will read the paper and reviews and make judgement calls – but some editors just follow the reviewers blindly – so I worry with the 20 bullet point minor review that it unduly influenced a reject decision. I’m happy to respond to the bullets, and happy you took the time, but I’m not happy about you (the reviewer) not giving clear advice to the editor of the extent to which I can address those points.

I still give advice about improving the manuscript, but I try to provide clear signals to the editor about main critiques, and I also will explicitly state whether they can be addressed. The “can be addressed” is not for the person writing the paper – it is for the editor making the decision for whether to revise-and-resubmit! The main critiques in my experience will either entail 2-3 main points (or none at all for some papers). I also typically say when things are minor and put them in a separate section, which editors can pretty much ignore.

Being a quantitative guy, the ones that frustrate me the most are complaints about model specifications. Some are legitimately major complaints, but often times it will be things that are highly unlikely to greatly influence the reported results. Examples are adding/dropping/changing a particular control variable and changes in the caliper used for propensity score matching. Note I’m not saying you shouldn’t ask to see differences, I’m asking that you clearly articulate why your suggestion is preferable and make an appropriate judgement as to whether it is a big problem or a little problem. A similar complaint is what information to include in tables or in the main manuscript or appendix. The author already has the information, so it is minor editing, not a major problem.


While I am here I will end with three additional complaints that don’t fit into anywhere previously in my post. One, multiple rounds of review are totally a waste. So the life cycle of the paper-review should be

paper -> review -> editor decision -> reject/accept
                                          or
                                      revise-resumbit -> new paper + responses to reviews -> editor decision

The way the current system works, I have to submit another review after the new paper has been submitted. I rather the editor take the time to see if the authors sufficiently addressed the original complaints, because as a reviewer I am not an unbiased judge of that. So if I say something is important and you retort it is not, what else do you want me to say in my second review! It is the editors job at that point to arbiter disagreements. This then stops the cycle of multiple rounds of review, which have a very large amount of diminishing returns in improving the draft.

This then leads into my second complaint, generally about keeping a civil tone for reviews. In general I don’t care if a reviewer is a bit gruff in the review – it is not personal. But since reviewers have a second go, when I respond I need to keep an uber deferential tone. I don’t think that is really necessary, and I’d rather original authors have similar latitude to be gruff in responses. Reviewers say stupid things all the time (myself included) and you should be allowed to retort that my critique is stupid! (Of course with reasoning as to why it is stupid…)

Finally, I had one reviewer start recently:

This paper is interesting and very well written…I will not focus on the paper’s merits, but instead restrict myself to areas where it can be improved.

The good is necessary to signal to the editor whether a paper should be published. I’ve started to try in my own reviews to include more of the good (which is definately not the norm) and argue why a paper should be published. You can see in my linked review of the Xu and Griffiths paper by the third round I simply gave arguments why the paper should be published, despite a disagreement about the change-point model they reported on in the paper.