Some notes on self-publishing a tech book

So my book, Data Science for Crime Analysis with Python, is finally out for purchase on my Crime De-Coder website. Folks anywhere in the world can purchase a paperback or epub copy of the book. You can see this post on Crime De-Coder for a preview of the first two chapters, but I wanted to share some of my notes on self publishing. It was some work, but in retrospect it was worth it. Prior books I have been involved with (Wheeler 2017; Wheeler et al. 2021) plus my peer review experience I knew I did not need help copy-editing, so the notes are mostly about creating the physical book and logistics of selling it.

Academics may wish to go with a publisher for prestige reasons (I get it, I was once a professor as well). But it is quite nice once you have done the legwork to publish it yourself. You have control of pricing, and if you want to make money you can, or have it cheap/free for students.

Here I will detail some of the set up of compiling the book, and then the bit of work to distribute it.

Compiling the documents

So the way I compiled the book is via Quarto. I posted my config notes on how to get the book contents to look how I wanted on GitHub. Quarto is meant to run code at the same time (so works nicely for a learning to code book). But even if I just wanted a more typical science/tech book with text/images/equations, I would personally use Quarto since I am familiar with the set up at this point. (If you do not need to run dynamic code you could do it in Pandoc directly, not sure if there is a way to translate a Quarto yaml config to the equivalent Pandoc code it turns into.)

One thing that I think will interest many individuals – you write in plain text markdown. So my writing looks like:

# Chapter Heading

blah, blah blah

## Subheading

Cool stuff here ....

In a series of text files for each chapter of the book. And then I tell Quarto quarter render, and it turns my writing in those text files into both an Epub and a PDF (and other formats if you cared, such as word or html). You can set up the configuration for the book to be different for the different formats (for example I use different fonts in the PDF vs the epub, nice fonts in one look quite bad in the other). See the _quarto.yml file for the set up, in particular config options that are different for both PDF and Epub.

One thing is that ebooks are hard to format nicely – if I had a book I wanted to redo to be an epub, I would translate it to markdown. There are services online that will translate, they will do a bad job though with scientific texts with many figures (and surely will not help you choose nice fonts). So just learn markdown to translate. Folks who write in one format and save to the other (either Epub/HTML to PDF, or PDF to Epub/HTML) are doing it wrong and the translated format will look very bad. Most advice online is for people who have just books with just text, so science people with figures (and footnotes, citations, hyperlinks, equations, etc.) it is almost all bad advice.

So even for qualitative people, learning how to write in markdown to self-publish is a good skill to learn in my opinion.

Setting up the online store

For awhile I have been confused how SaaS companies offer payment plans. (Many websites just seem to copy from generic node templates.) Looking at the Stripe API it just seems over the top for me to script up all of my own solution to integrate Stripe directly. If I wanted to do a subscription I may need to figure that out, but it ended up being for my Hostinger website I can set up a sub-page that is WordPress (even though the entire website is not), and turn on WooCommerce for that sub-page.

WooCommerce ends up being easy, and you can set up the store to host web-assets to download on demand (so when you purchase it generates a unique URL that obfuscates where the digital asset is saved). No programming involved to set up my webstore, it was all just point and click to set things up one time and not that much work in the end.

I am not sure about setting up any DRM for the epub (so in reality people will purchase epub and share it illegally). I don’t know of a way to prevent this without using Amazon+Kindle to distribute the book. But the print book should be OK. (If there were a way for me to donate a single epub copy to all libraries in the US I would totally do that.)

I originally planned on having it on Amazon, but the low margins on both plus the formatting of their idiosyncratic kindle book format (as far as I can tell, I cannot really choose my fonts) made me decide against doing either the print or ebook on Amazon.

Print on Demand using LuLu

For print on demand, I use LuLu.com. They have a nice feature to integrate with WooCommerce, the only thing I wish shipping was dynamically calculated. (I need to make a flat shipping rate for different areas around the globe the way it is set up now, slightly annoying and will change the profit margins depending on area.)

LuLu is a few more dollars to print than Amazon, but it is worth it for my circumstance I believe. Now if I had a book I expected to get many “random Amazon search buys” I could see wanting it on Amazon. I expect more sales will be via personal advertising (like here on the blog, social media, or other crime analyst events). My Crime De-Coder site (and this blog) will likely be quite high in google searches for some of the keywords fairly quickly, so who knows, maybe just having on personal site is just as many sales.

LuLu does has an option to turn on distribution to other wholesalers (like Barnes & Noble and Amazon) – have not turned that on but maybe I will in the future.

LuLu has a pricing calculator to see how much to print on their website. Paperback and basically the cheapest color option for letter sized paper (which is quite large) is just over $17 for my 310 page book (Amazon was just over $15). For folks if you are less image heavy and more text, you could get away with a smaller size book (and maybe black/white) and I suspect will be much cheaper. LuLu’s printing of this book is higher quality compared to Amazon as well (better printing of the colors and nicer stock for the paperback cover).

Another nice thing about print on demand is I can go in and edit/update the book as I see fit. No need to worry about new versions. Not sure what that exactly means for citing the work (I could always go and change it), you can’t have a static version of record and an easy way to update at the same time.

Other Random Book Stuff

I purchased ISBNs on Bowker, something like 10 ISBNs for $200. (You want a unique ISBN for each type of the book, so you may want three in the end if you have epub/paperback/hardback.) Amazon and LuLu though have options to have them give you an ISBN though, so that may have not been necessary. I set the imprint to be my LLC though in Bowker, so CRIME De-Coder is the publisher.

You don’t technically need an ISBN at all, but it is a simple thing, and there may be ways for me to donate to libraries in the future. (If a University picks it up as a class text, I have been at places you need at least one copy for rent at the Uni library.)

I have not created an index – I may have a go at feeding my book through LLMs and seeing if I can auto-generate a nice index. (I just need a list of key words, after that can just go and find-replace the relevent text in the book to fill in so it auto-compiles an index.) I am not sure that is really necessary though for a how-to book, you should just look at the table of contents to see the individual (fairly small) sections. For epub you can just doing a direct text search, so not sure if people use an index at all in epubs.

Personal Goals

So I debated on releasing the book open source, I do want to try and see if I can make some money though. I don’t have this expectation, but there is potential to get some “data science” spillover, and if that is the case sales could in theory be quite high. (I was surprised in searching the “data science python” market on Amazon, it is definitely not saturated.) Personally I will consider at least 100 sales to be my floor for success. That is if I can sell at least 100 copies, I will consider writing more books. If I can’t sell 100 copies I have a hard time justifying the effort – it would just be too few of people buying the book to have the types of positive spillovers I want.

To make back money relative to the amount of work I put in, I would need more than 1000 sales (which I think is unrealistic). I think 500 sales is about best case, guesstimating the size of the crime analyst community that may be interested plus some additional sales for grad students. 1000 sales it would need to be in the multiple professors using it for a class book over several years. (Which if you are a professor and interested in this for a class let me know, I will give your class a discount.)

Another common way for individuals to make money off of books is not for sales, but to have training’s oriented around the book. I am hoping to do more of that for crime analysts directly in the future, but those opportunities I presume will be correlated with total sales.

I do enjoy writing, but I am busy, so cannot just say “I am going to drop 200 hours writing a book”. I would like to write additional python topics oriented towards crime analysts/criminology grad students like:

  • GIS analysis in python
  • Regression
  • Machine Learning & Optimization
  • Statistics for Crime Analysis
  • More advanced project management in python

Having figured out much of this grunt work definitely makes me more motivated, but ultimately in the end need to have a certain level of sales to justify the effort. So please if you like the blog pick up a copy and tell a friend you like my work!

References

My word template for Quarto

I have posted on Github my notes on creating a word template to use with quarto. And since Quarto is just feeding into pandoc, those who are just using pandoc (so not doing intermediate computations), should maybe find that template worthwhile as well.

So first, why word? Quarto by default looks pretty nice for HTML. That is fine for them to prioritize that, but the majority of reports I want to use quarto for HTML is not the best format. Many times I want a report that can be emailed in PDF and/or printed. And sometimes I (or my clients) want a semi-automated report that can be edited after the fact. In those cases word is a good choice.

Editing LaTeX is too hard, and I am pretty happy with the this template for small reports. I will be sharing my notes on writing my python book in Quarto soonish, but for now wanted to share how I created a word template.

Note some of the items may seem gratuitous (why so many CRIME De-Coder logos?). Part of those are just notes though (like how to insert an image after your author name, I have done this to insert my signature in reports for example). The qmd file has most of the things I am interested in doing in documents, such as how to format markdown tables in python, doings sections/footnotes, references, table/figure captions, etc.

I do like my logo though in the header (it is hyperlinked even, so in subsequent PDFs if you click the logo it will go to my website), and the footer page numbers I commonly need in reports as well. And my title page and TOC do not look bad as well IMO. I am not one to discuss fonts, but I like small caps for titles and the Verdana font is nice to make it look somewhat different.

Creating the Template

So first, you can do from the command line:

quarto pandoc -o custom-reference-doc.docx --print-default-data-file reference.docx

From there, you should edit that reference.docx file to get what you want. So for example, if you want to change the font used for code snippets, in Word you can open up Styles, and on the right hand side select different elements and edit them:

Here for example to change the font for code snippets, you modify the HTML code style (I like Consolas):

There ended up being a ton of things I edited, I did not keep a list. Offhand you will want to modify the Title, Headings 1 & 2, First Paragraph, Body Text. And then you can edit things like the page numbers and header/footer.

So when rendering a document, you can sometimes click on the element in the rendered document and figure out what style it inherits from. Here for example you can see in the test.docx file that the quote section uses the “Block Text” style:

This does not always work though, and it can take some digging/experimentation in the original template file to get the right style modifier. (If you are having a real hard problem, convert the word document format to .zip, and dig into the XML documents. You can see the style formats in inherits from in the XML tree.) It doesn’t work for the code segments for example. Do not render a document and edit the style in that document, only edit the original --print-default-data-file reference.docx that was generated from the command line to update your template.

I have placed a few notes in the readme on Github, but one of my main things was making tables look nice. So this plays nicely with markdown tables, which I can use python to render directly. Here is an example of spreading tables across multiple pages.

One thing to note though is that this has limits – different styles are interrelated, so sometimes I would change one and it would propagate errors to different elements. (I can’t figure out how to change the default bullets to squares instead of circles for example without having bullets in places they should not be in tables – try to figure that one out. I also cannot figure out how to change the default font in tables, I would use monospace, without changing the font for other text elements in normal blocks.) So this template was the best I could figure without making other parts broken.

I have a few notes in the qmd file as well, showing how to use different aspects of markdown, as well as some sneaky things to do extra stuff (like formatting fourth level headings to produce a page break, I do not think I will need that deep of headings).

Even for those not using Quarto for computational workflows, writing in markdown is a really useful skill. You write in plain text, and can then have the output in different formats. Even for qualitative folks (or people in industry creating documents), I think many people would be well served by writing content in plain text markdown, and then rendering to whatever output they wanted.


If interested in other tutorials like this, I suggest you check out two of my books:

Each can be purchase in either paperback for epub versions worldwide from my Crime De-Coder store.

Some musings on plagiarism

There have been several recent high profile cases of plagiarism in the news recently. Chris Rufo, a conservative pundit, has identified plagiarism in Claudine Gay and Christina Cross’s work. In a bit of tit-for-tat, Neri Oxman was then profiled for similar sins (as her husband was one of the most vocal critics of Gay).

Much of the discussion around this seems to be more focused on who is doing the complaining than the actual content of the complaints. It can simultaneously be true that Chris Rufo has ulterior motives to malign those particular scholars (ditto for those criticizing Neri Oxman), but the complaints are themselves legitimate.

What is plagiarism? I would typically quote a specific definition here, but I actually do not like the version in the Merriam Webster (or American Heritage) dictionary. I assert plagiarism is the direct copying of material without proper attribution to its original source. The concept of “idea” plagiarism seems to muddy up the waters – I am ignoring that here. That is, even if you don’t “idea” plagiarize, you can do “direct copying of material without proper attribution”.

It may seem weird that you can do one without the other, but many of the examples given (whether by Rufo or those who pointed out Neri Oxman’s) are insipid passages that are mostly immaterial to the main thesis of the paper, or they have some form of attribution but it is not proper. So it goes, the way academics write you can cut out whole cloth large portions of writing and it won’t make a difference to the story line.

These passages however are clearly direct copying of material without proper attribution. I think Cross’s is a great example, here is one (of several Rufo points out):

Words are so varied, if I gave you a table of descriptions, something like:

survey years: 1998, 2000, ...., 2012
missing data: took prior years survey values

And asked 100 scholars to put that in a plain text description in a paragraph, all 100 would have some overlap in wording, but not near this extent. For numbers, these paragraphs have around 30 words, say your word vocabulary relevant to describe that passage is 100 words, the overlap would be under 10% of the material. Here is a simple simulation to illustrate.

And this does not consider the word order (and the fact that the corpus is much larger than 100 words). Which will both make the probability of this severe of overlap much smaller.

Unlike what ASA says, it does not matter that what was copied is a straightforward description of a dataset, it clearly still is plagiarism in terms of “direct copying without proper attribution”. The Neri Oxman examples are similar, they are direct copying, even if they have an in text citation at the end, that is not proper attribution. If these cases went in front of ethics boards at universities that students are subject to, they would all be found guilty of plagiarism. The content of what was copied and its importance to the work does not matter at all.

So the defense of the clearly indefensible based on ideological grounds among academics I find quite disappointing. But, as a criminologist I am curious as to its prevalence – if you took a sample of dissertations or peer reviewed articles, how many would have plagiarized passages? Would it be 1%, or 10%, or 50%, I genuinely do not know (it clearly won’t be 0% though!).

I would do this to my own articles if I could easily (I don’t think I did, but it is possible I self-plagiarized portions of my dissertation in later peer reviewed articles). So let me know if you have Turnitin (or are aware of another service), that can upload PDFs to check this.

I’d note here that some of the defense of these scholars is the “idea” part of plagiarism, which is a shifting sands definition of saying it is necessary to steal some fundamental idea for it to be plagiarism. Idea plagiarism isn’t really a thing, at least not a thing anymore than vague norms among academics (or even journalists). Scooping ideas is poor form, but that is it.

I am not aware of my words being plagiarized, I am aware of several examples of scholars clearly taking work from my code repositories or blog posts and not citing them. (As a note to scholars, it is fine to cite my blog posts, I maybe have a dozen or so citations to my blog at this point.) But those would not typically be considered plagiarism. If someone copies one of my functions and applies it to their own data, it is not plagiarism as typically understood. If I noted that and sent those examples to COPE and asked for the articles to be retracted, I am pretty sure COPE would say that is not plagiarism as typically treated.

Honestly it does not matter though. I find it obnoxious to not be cited, but it is a minor thing, and clearly does not impact the quality of the work. I basically expect my work on the blog to be MIT licensed (subject to change) – it is mostly a waste of time for me to police how it is used. Should Cross be disciplined for her plagiarism? Probably not – if it was an article I would suggest a correction would be sufficient.

I can understand students may be upset that they are held to higher standards than their professors, but I am not sure if that means students should be given more slack or if professors should be given less. These instances of plagiarism by Gay/Cross/Oxman I take more as laziness than anything, but they do not have much to do with whether they are fit for their jobs.

AI writing

Pretty soon, even my plagiarism definition is not going to work anymore, as generative chatbots are essentially stochastic parrots – everything they do is paraphrasing plagiarism, but in a form that is hard to see direct copying (that tools like Turnitin will identify). So I am starting to get links to the blog from Perplexity and Iask. So people may cite ChatGPT or whatever service generated the text, but that service copied from someone else, and no one is the wiser.

These specific services have paraphrasing citations, e.g. you ask a question, it gives the response + 3 citations (these are called RAG applications in LLM speak). So you may think they are OK in terms of above, they give paraphrasing type at the end of paragraph citations. But I have noticed they frequently spit out near verbatim copies of my work for a few blog posts I get traffic for. The example I have linked to was this post on fitting a beta-binomial model in python. So I intentionally wrote that post because the first google results were bad/wrong and ditto for the top stackoverflow questions. So I am actually happy my blog got picked up by google quite fast and made it to the top of the list.

These services are stochastic, so subject to change, but iask is currently a direct copy of my work and does not give me as a list of its citations (although I come up high in their search rankings, so not clear to me why I am not credited). Even if it did though it would be the same problem as Oxman, it would not be a proper quoted citation.

And the Cross faux pas edit a few words will not only be common but regular with these tools.

This is pretty much intended though from my perspective with these tools – I saw a wrong on the internet and wrote a post to make it so others do not make the same mistakes. Does it matter if the machine does not properly attribute me? I just want you to do your stats right, I don’t get a commission if you properly cite me. (I rather you just personally hire me to help than get a citation!)

I think these services may make citation policing over plagiarism impossible though. No major insights here to prevent it – I say now it is mostly immaterial to the quality of the work, but I do think saying you can carte blanche plagiarize is probably not good.

Year in Review 2023: How did CRIME De-Coder do?

In 2023, I published 45 pages on the blog. Cumulative site views were slightly more than last year, a few over 150,000.

I would have had pretty much steady cumulative views from last year (site views took a dip in April, the prior year had quite a bit of growth, I suspect something to do with the way WordPress counts stats changed), but in December my post Forecasts need to have error bars hit front page on Hackernews. This generated about 12k views for that post over two days. (In 2022 I had just shy of 140k views in total.)

It was very high on the front page (#1) for most of that day. So for folks who want to guesstimate the “death by Hackernews” referrals, I would guess if your site/app can handle 10k requests in an hour you will be ok. WordPress by default this is fine (my Crime De-Coder Hostinger site is maybe not so good for that, the SLA is 20k requests per day). Also interesting note, about 10% of people who were referred to the forecast post clicked at least one other page on the site.

So I started CRIME De-Coder in February this year. I have published a few over 30 pages on that site during the year, and have accumulated a total of a few more than 11k site views. This is very similar to the first year of my personal blog, with publishing around 30 posts and getting just over 7k total views for the year. This is almost entirely via direct referrals (I share posts on LinkedIn, google searches are just a trickle).

Sometimes people are like “cool you started your own company”, but really I did that same type of consulting since I was in grad school. I have had a fairly consistent set of consulting work (around $20k per year) for quite awhile. That was people cold asking me for help with mostly statistical analysis.

The reason I started CRIME De-Coder was to be more intentional about it – advertise the work I do, instead of waiting for people to come to me. Doing your own LLC is simple, and it is more a website than anything.

So how much money did I make this year for CRIME De-Coder? Not that much more than $30k (I don’t count the data competitions I won in that metric, but actual commissioned work.) I do have substantially more work lined up for next year though already (more on the order of $50k so far, although no doubt some of that will fall through).

I sent out something like 30 some soft pitches during the year to people in my extended network (first or strong second degree). I don’t know the typical rate of something like that, but mine was abysmal – I was lucky to get an email response no thanks. These are just ideas like “hey I could build you an interactive dashboard with your data” or “you paid this group $150k, I would do that same thing for less than $30k”.

Having CRIME De-Coder did however did increase my first degree network to “ask me for stat analysis” more. So it was definitely worth spending time doing the website and creating the LLC. Don’t ask me for advice though about making pitches for consulting work!

The goal is ultimately to be able to go solo, and just do my consulting work as my full time job. It is hard to see that happening though – even if I had 5 times the amount of work lined up, it would still just be short term single projects. I have pitched more consistent retainers, but no one has gone for that. Small police departments if interested in outsourcing crime analysis let me know – that is I believe the best solution for them. Also have pitched to think tanks to hire me part time as well, as well as CJ programs to hire me in part time roles as well. I understand the CJ programs no interest, I am way more expensive than typical adjunct, I am a good deal for other groups though. (I mean I am good deal for CJ programs as well, part of the value add is supervising students for research, but universities don’t value that very high.)

I will ultimately keep at it – sending email pitches is easy. And I am hoping that as the website gets more organic search referrals, I will be able to break out of my first degree network.

The sausage making behind peer review

Even though I am not on Twitter, I still lurk every now and then. In particular I can see webtraffic referrals to the blog, so I will go and use nitter to look it up when I get new traffic.

Recently my work about why I publish preprints was referenced in a thread. That blog post was from the perspective of why I think individual scholars should post preprints. The thread that post was tagged in was not saying from a perspective of an individual writer – it was saying the whole idea of preprints is “a BIG problem” (Twitter thread, Nitter Thread).

That is, Dan thinks it is a problem other people post preprints before they have been peer reviewed.

Dan’s point is one held by multiple scholars in the field (have had similar interactions with Travis Pratt back when I was on Twitter). Dan does not explicitly say it in that thread, but I take this as pretty strong indication Dan thinks posting preprints without peer review is unethical (Dan thinks postprints are ok). The prior conversations I had with Pratt on Twitter he explicitly said it was unethical.

The logic goes like this – you can make errors, so you should wait until colleagues have peer reviewed your work to make sure it is “OK” to publish. Otherwise, it is misleading to readers of the work. In particular people often mention the media uncritically reporting preprint articles.

There are several reasons I think this opinion is misguided.

One, the peer review system itself is quite fallible. Having received, delivered, and read hundreds of peer review reports, I can confidently say that the entire peer review system is horribly unreliable. It has both a false negative and a false positive problem – in that things that should be published get rejected, and things that should not be published get through. Both happen all the time.

Now, it may be the case that the average preprint is lower quality than a peer reviewed journal article (given selection of who posts preprints I am actually not sure this is the case!) In the end though, you need to read the article and judge the article for yourself – you cannot just assume an article is valid simply because it was published in peer review. Nor can you assume the opposite – something not peer reviewed is not valid.

Two, the peer review system is vast currently. To dramatically oversimplify, there are “low quality” (paid for journals, some humanities journals, whatever journals publish the “a square of chocolate and a glass of red wine a day increases your life expectancy” garbage), and “high quality” journals. The people who Dan wants to protect from preprints are exactly the people who are unlikely to know the difference.

I use scare quotes around low and high quality in that paragraph on purpose, because really those superficial labels are not fair. BMC probably publishes plenty of high quality articles, it just happened to also publish an a paper that used a ridiculous methodology that dramatically overestimated vaccine adverse effects (where the peer reviewers just phoned in superficial reviews). Simultaneously high quality journals publish junk all the time, (see Crim, Pysch, Econ, Medical examples).

Part of the issue is that the peer review system is a black box. From a journalists perspective you don’t know what papers had reviewers phone it in (or had their buddies give it a thumbs up) versus ones that had rigorous reviews. The only way to know is to judge the paper yourself (even having the reviews is not informative relative to just reading the paper directly).

To me the answer is not “journalists should only report on peer reviewed papers” (or the same, no academic should post preprints without peer review) – all consumers need to read the work for themselves to understand its quality. Suggesting that something that is peer reviewed is intrinsically higher quality is bad advice. Even if on average this is true (relative to non-peer reviewed work), any particular paper you pick up may be junk. There is no difference from the consumer perspective in evaluating the quality of a preprint vs a peer reviewed article.

The final point I want to make, three, is that people publish things that are not peer reviewed all the time. This blog is not peer reviewed. I would actually argue the content I post here is often higher quality than many journal articles in criminology (due to transparent, reproducible code I often share). But you don’t need to take my word for it, you can read the posts and judge that for yourself. Ditto for many other popular blogs. I find it pretty absurd for someone to think me publishing a blog is unethical – ditto for preprints.

No point in arguing with peoples personal opinions about what is ethical vs what is not though. But thinking that you are protecting the public by only allowing peer reviewed articles to be reported on is incredibly naive as well as paternalistic.

We would be better off, not worse, if more academics posted preprints, peer review be damned.

This one simple change will dramatically improve reproducibility in journals

So Eric Stewart is back in the news, and it appears a new investigation has prompted him to resign from Florida State. For a background on the story, I suggest reading Justin Pickett’s EconWatch article. In short, Justin did analysis of his own papers he co-authored with Stewart to show what is likely data fabrication. Various involved parties had superficial responses at first, but after some prodding many of Stewart’s papers were subsequently retracted.

So there is quite a bit of human messiness in the responses to accusations of error/fraud, but I just want to focus on one thing. In many of these instances, the flow goes something like:

  1. individual points out clear numeric flaws in a paper
  2. original author says “I need time to investigate”
  3. multiple months later, original author has still not responded
  4. parties move on (no resolution) OR conflict (people push for retraction)

My solution here is a step that mostly fixes the time lag in steps 2/3. Authors who submit quantitative results should be required to submit statistical software log files along with their article to the journal from the start.

So there is a push in social sciences to submit fully reproducible results, where an outside party can replicate 100% of the analysis. This is difficult – I work full time as a software engineer – it requires coding skills most scientists don’t have, as well as outside firms to devote resources to the validation. (Offhand, if you hired me to do this, I would probably charge something like $5k to $10k I am guessing given the scope of most journal articles in social sciences.)

An additional problem with this in criminology research, we are often working with sensitive data that cannot easily be shared.

I agree a fully 100% reproducible would be great – lets not make the perfect the enemy of the good though. What I am suggesting is that authors should directly submit the log files that they used to produce tables/regression results.

Many authors currently are running code interactively in Stata/R/SPSS/whatever, and copy-pasting the results into tables. So in response to 1) above (the finding of a data error), many parties assume it is a data transcription error, and allow the original authors leeway to go and “investigate”. If journals have the log files, it is trivial to see if a data error is a transcription error, and then can move into a more thorough forensic investigation stage if the logs don’t immediately resolve any discrepancies.


If you are asking “Andy, I don’t know how to save a log file from my statistical analysis”, here is how below. It is a very simple thing – a single action or line of code.

This is under the assumption people are doing interactive style analysis. (It is trivial to save a log file if you have created a script that is 100% reproducible, e.g. in R it would then just be something like Rscript Analysis.R > logfile.txt.) So is my advice to save a log file when doing interactive partly code/partly GUI type work.

In Stata, at the beginning of your session use the command:

log using "logfile.txt", text replace

In R, at the beginning of your session:

sink("logfile.txt")
...your code here...
# then before you exit the R session
sink()

In SPSS, at the end of your session:

OUTPUT EXPORT /PDF DOCUMENTFILE="local_path\logfile.pdf".

Or you can go to the output file and use the GUI to export the results.

In python, if you are doing an interactive REPL session, can do something like:

python > logfile.txt
...inside REPL here...

Or if you are using Jupyter notebooks can just save the notebook a html file.

If interested in learning how to code in more detail for regression analysis, I have PhD course notes on R/SPSS/Stata.


This solution is additional work from the authors perspective, but a very tiny amount. I am not asking for 100% reproducible code front to back, I just want a log file that shows the tables. These log files will not show sensitive data (just summaries), so can be shared.

This solution is not perfect. These log files can be edited. Requiring these files will also not prevent someone from doctoring data outside of the program and then running real analysis on faked data.

It ups the level of effort for faking results though by a large amount compared to the current status quo. Currently it just requires authors to doctor results in one location, this at a minimum requires two locations (and to keep the two sources equivalent is additional work). Often the outputs themselves have additional statistical summaries though, so it will be clearer if someone doctored the results than it would be from a simpler table in a peer reviewed article.

This does not 100% solve the reproducibility crisis in social sciences. It does however solve the problem of “I identified errors in your work” and “Well I need 15 months to go and check my work”. Initial checks for transcription vs more serious errors with the log files can be done by the journal or any reasonable outsider in at most a few hours of work.

Getting access to paywalled newspaper and journal articles

So recently several individuals have asked about obtaining articles they do not have access to that I cite in my blog posts. (Here or on the American Society of Evidence Based Policing.) This is perfectly fine, but I want to share a few tricks I have learned on accessing paywalled newspaper articles and journal articles over the years.

I currently only pay for a physical Sunday newspaper for the Raleigh News & Observer (and get the online content for free because of that). Besides that I have never paid for a newspaper article or a journal article.

Newspaper paywalls

Two techniques for dealing with newspaper paywalls. 1) Some newspapers you get a free number of articles per month. To skirt this, you can open up the article in a private/incognito window on your preferred browser (or open up the article in another browser entirely, e.g. you use Chrome most of the time, but have Firefox just for this on occasion.)

If that does not work, and you have the exact address, you can check the WayBack machine. For example, here is a search for a WaPo article I linked to in last post. This works for very recent articles, so if you can stand being a few days behind, it is often listed on the WayBack machine.

Journal paywalls

Single piece of advice here, use Google Scholar. Here for example is searching for the first Braga POP Criminology article in the last post. Google scholar will tell you if a free pre or post-print URL exists somewhere. See the PDF link on the right here. (You can click around to “All 8 Versions” below the article as well, and that will sometimes lead to other open links as well.)

Quite a few papers have PDFs available, and don’t worry if it is a pre-print, they rarely substance when going into print.1

For my personal papers, I have a google spreadsheet that lists all of the pre-print URLs (as well as the replication materials for those publications).

If those do not work, you can see if your local library has access to the journal, but that is not as likely. And I still have a Uni affiliation that I can use for this (the library and getting some software cheap are the main benefits!). But if you are at that point and need access to a paper I cite, feel free to email and ask for a copy (it is not that much work).

Most academics are happy to know you want to read their work, and so it is nice to be asked to forward a copy of their paper. So feel free to email other academics as well to ask for copies (and slip in a note for them to post their post-prints to let more people have access).

The Criminal Justician and ASEBP

If you like my blog topics, please consider joining the American Society of Evidence Based Policing. To be clear I do not get paid for referrals, I just think it is a worthwhile organization doing good work. I have started a blog series (that you need a membership for to read), and post once a month. The current articles I have written are:

So if you want to read more of my work on criminal justice topics, please join the ASEBP. And it is of course a good networking resource and training center you should be interested in as well.


  1. You can also sign up for email alerts on Google Scholar for papers if you find yourself reading a particular author quite often.↩︎

Over 10 years of blogging

I just realized the other day that I have been blogging for over 10 years (I am old!) First hello world post post was back in December 2011.

I would recommend folks in academia/coding to at a minimum do a personal webpage. I use wordpress for my blog (did a free wordpress for quite a long time). WordPress is 0 code to make a personal page to host your CV.

I treat the blog as mostly my personal nerd journal, and blog about things I am working on or rants on occasion. I do not make revenue off of the blog directly, but in terms of getting me exposure it has given quite a few consulting leads over the years. As well as just given my academic work a much wider exposure.

So I always have a few things I want to blog about in the hopper. But always feel free to ask me anything (similar to how Andrew Gelman answers emails), and if I get a chance I will throw up a blog post in response.

CCTV and clearance rates paper published

My paper with Yeondae Jung, The effect of public surveillance cameras on crime clearance rates, has recently been published in the Journal of Experimental Criminology. Here is a link to the journal version to download the PDF if you have access, and here is a link to an open read access version.

The paper examines the increase in case clearances (almost always arrests in this sample) for incidents that occurred nearby 329 public CCTV cameras installed and monitored by the Dallas PD from 2014-2017. Quite a bit of the criminological research on CCTV cameras has examined crime reductions after CCTV installations, which the outcome of that is a consistent small decrease in crimes. Cameras are often argued to help solve cases though, e.g. catch the guy in the act. So we examined that in the Dallas data.

We did find evidence that CCTV increases case clearances on average, here is the graph showing the estimated clearances before the cameras were installed (based on the distance between the crime location and the camera), and the line after. You can see the bump up for the post period, around 2% in this graph and tapering off to an estimate of no differences before 1000 feet.

When we break this down by different crimes though, we find that the increase in clearances is mostly limited to theft cases. Also we estimate counterfactual how many extra clearances the cameras were likely to cause. So based on our model, we can say something like, a case would have an estimated probability of clearance without a camera of 10%, but with a camera of 12%. We can then do that counterfactual for many of the events around cameras, e.g.:

Probability No Camera   Probability Camera   Difference
    0.10                      0.12             + 0.02
    0.05                      0.06             + 0.01
    0.04                      0.10             + 0.06

And in this example for the three events, we calculate the cameras increased the total expected number of clearances to be 0.02 + 0.01 + 0.06 = 0.09. This marginal benefit changes for crimes mostly depends on the distance to the camera, but can also change based on when the crime was reported and some other covariates.

We do this exercise for all thefts nearby cameras post installation (over 15,000 in the Dallas data), and then get this estimate of the cumulative number of extra theft clearances we attribute to CCTV:

So even with 329 cameras and over a year post data, we only estimate cameras resulted in fewer than 300 additional theft clearances. So there is unlikely any reasonable cost-benefit analysis that would suggest cameras are worthwhile for their benefit in clearing additional cases in Dallas.

For those without access to journals, we have the pre-print posted here. The analysis was not edited any from pre-print to published, just some front end and discussion sections were lightly edited over the drafts. Not sure why, but this pre-print is likely my most downloaded paper (over 4k downloads at this point) – even in the good journals when I publish a paper I typically do not get 1000 downloads.

To go on, complaint number 5631 about peer review – this took quite a while to publish because it was rejected on R&R from Justice Quarterly, and with me and Yeondae both having outside of academia jobs it took us a while to do revisions and resubmit. I am not sure the overall prevalence of rejects on R&R’s, I have quite a few of them though in my career (4 that I can remember). The dreaded send to new reviewers is pretty much guaranteed to result in a reject (pretty much asking to roll a Yahtzee to get it past so many people).

We then submitted to a lower journal, The American Journal of Criminal Justice, where we had reviewers who are not familiar with what counterfactuals are. (An irony of trying to go to a lower journal for an easier time, they tend to have much worse reviewers, so can sometimes be not easier at all.) I picked it up again a few months ago, and re-reading it thought it was too good to drop, and resubmitted to the Journal of Experimental Criminology, where the reviews were reasonable and quick, and Wesley Jennings made fast decisions as well.

Bias and Transparency

Erik Loomis over at the LGM blog writes:

It’s fascinating to be doing completely unfundable research in the modern university. It means you don’t matter to administration. At all. You are completely irrelevant. You add no value. This means almost all humanities people and a good number of social scientists, though by no means all. Because universities want those corporate dollars, you are encouraged to do whatever corporations want. Bring in that money. But why would we trust any research funded by corporate dollars? The profit motive makes the research inherently questionable. Like with the racism inherent in science and technology, all researchers bring their life experiences into their research. There is no “pure” research because there are no pure people. The questions we ask are influenced by our pasts and the world in which we grew up. The questions we ask are also influenced by the needs of the funder. And if the researcher goes ahead with findings that the funder doesn’t like, they are severely disciplined. That can be not winning the grants that keep you relevant at the university. Or if you actually work for the corporation, being fired.

And even when I was an unfunded researcher at university collaborating with police departments this mostly still applied. The part about the research being quashed was not an issue for me personally, but the types of questions asked are certainly influenced. A PD is unlikely to say ‘hey, lets examine some unintended consequences of my arrest policy’ – they are much more likely to say ‘hey, can you give me an argument to hire a few more guys?’. I do know of instances of others people work being limited from dissemination – the ones I am familiar with honestly it was stupid for the agencies to not let the researchers go ahead with the work, but I digress.

So we are all biased in some ways – we might as well admit it. What to do? One of my favorite passages in relation to our inherent bias is from Denis Wood’s introduction to his dissertation (see some more backstory via John Krygier). But here are some snippets from Wood’s introduction:

There is much rodomontade in the social sciences about being objective. Such talk is especially pretentious from the mouths of those whose minds have never been sullied by even the merest passing consideration of what it is that objectivity is supposed to be. There are those who believe it to consist in using the third person, in leaning heavily on the passive voice, in referring to people by numbers or letters, in reserving one’s opinion, in avoiding evaluative adjectives or adverbs, ad nauseum. These of course are so many red herrings.

So we cannot be objective, no point denying it. But a few paragraphs later from Wood:

Yet this is no opportunity for erecting the scientific tombstone. Not quite yet. There is a pragmatic, possible, human out: Bare yourself.

Admit your attitudes, beliefs, politics, morals, opinions, enthusiasms, loves, odiums, ethics, religion, class, nationality, parentage, income, address, friends, lovers, philosophies, language, education. Unburden yourself of your secrets. Admit your sins. Let the reader decide if he would buy a used car from you, much less believe your science. Of course, since you will never become completely self-aware, no more in the subjective case than in the objective, you cannot tell your reader all. He doesn’t need it all. He needs enough. He will know.

This dissertation makes no pretense at being objective, whatever that ever was. I tell you as much as I can. I tell you as many of my beliefs as you could want to know. This is my Introduction. I tell you about this project in value-loaded terms. You will not need to ferret these out. They will hit you over the head and sock you in the stomach. Such terms, such opinions run throughout the dissertation. Then I tell you the story of this project, sort of as if you were in my – and not somebody else’s – mind. This is Part II of the dissertation. You may believe me if you wish. You may doubt every word. But I’m not conning you. Aside from the value-loaded vocabulary – when I think I’ve done something wonderful, or stupid, I don’t mind giving myself a pat on the back, or a kick in the pants. Parts I and II are what sloppy users of the English language might call “objective.” I don’t know about that. They’re conscientious, honest, rigorous, fair, ethical, responsible – to the extent, of course, that I am these things, no farther.

I think I’m pretty terrific. I tell you so. But you’ll make up your mind about me anyway. But I’m not hiding from you in the the third person passive voice – as though my science materialized out of thin air and marvelous intentions. I did these things. You know me, I’m

Denis Wood

We will never be able to scrub ourselves clean to be entirely objective – a pure researcher as Loomis puts its. But we can be transparent about the work we do, and let readers decide for themselves whether the work we bring forth is sufficient to overcome those biases or not.