Ask me anything: Advice for learning statistics?

For a bit of background, Loki, a computer science student in India, was asking me about my solution to the DrivenData algae bloom competition. Much of our back and forth was specific to my coding solution and “how I knew how to do that” (in particular I used a machine learning variant of doubly robust estimation in part of the solution, which I am sure others have used before but is not real common that I see, it is more often “causal inference” motivated). As for more general advice in learning, I said:

Only advice is to learn stats – not just for competitions but for real-world jobs. Many people are just copy-pasting code, and don’t know what they are doing. Understanding selection bias is important in many real-world scenarios. Often times it is just knowing a little about the scientific scenario you are modeling, and correctly formulating a model.

In response Loki asks:

I decided to take your suggestion and strengthen my grasp on statistics. I consider myself somewhere between beginner to intermediate in stats. I came across several resources on the internet, but feel confused about what to go with. I am wondering if “The Elements of Statistical Learning” by Trevor Hastie and Robert Tibishirani is a good one to start with. Or could you please suggest any books/lectures/courses that have practical applications to solidify my understanding of statistics that you have personally read or liked?

Which I think is a good piece to expand to the readers on my blog in general. Here is my response:

I would not start with that book. It is a mistake to start with too advanced of material. (I don’t learn anything that way anyway.)

Starting from the basics, no joke Gonick’s Cartoon Guide to Statistics is in my opinion the best intro to statistics and probability book. After that, it is important to understand causality – like really understand it – selection bias lurks everywhere. (I am not sure I have great advice for books that focus on causality, Pearl’s book is quite tough, maybe Shadish, Cook, Campbell Experimental and Quasi-Experimental Designs and/or Mostly Harmless Econometrics).

After that, follow questions on https://stats.stackexchange.com, it is high quality on average (many internet sources, like Medium articles or https://datascience.stackexchange.com, are very low quality on average – they can have gems but more often than not they are bad for anything besides copy/pasting code). Andrew Gelman’s blog is another good source for contemporary discussion around stats/research/pitfalls, https://statmodeling.stat.columbia.edu.

In terms of more advanced modeling, after having the basics down, I would suggest Harrell’s Regression Modeling Strategies before the Hastie book. You can interpret pretty much all of machine learning in terms of regression models. For small datasets, understanding how to do simpler regression modeling the right way is the best approach.

When moving onto machine learning, then maybe the Hastie book is a good resource (I didn’t find it all that much useful at this point beyond web resources). Statquest videos are very good walkthroughs of more complicated ML algorithms, https://www.youtube.com/@statquest, trees/boosting/neural-networks.

This is a hodge-podge – I don’t tend to learn things just to learn them – I have a specific project in mind and try to tackle that project the best I can. Many of these resources are items I picked up along the way (Gonick I got to review intro stats books for teaching, Harrell’s I picked up to learn a bit more about non-linear modeling with splines, Statquest I reviewed when interviewing for data science positions).

It is a long road to get to where I am. It was not via picking a book and doing intense study, it was a combination of applied projects and learning new things over time. I learned a crazy lot from the Cross Validated site when I was in grad school. (For those interested in optimization, the Operations Research site is also very high quality.) That was more broad learning though – seeing how people tackled problems in different domains.

Crime De-Coder LLC Website

So I have created CRIME De-Coder LLC, a firm to do my consulting work with police departments. Check out my website, crimede-coder.com.

Feedback is welcome. In particular check out the services pages, and my first blog post on what distinguishes my services from most firms. Providing computer code to generate the end product is “teaching a man a fish”, whereas most firms just drop a final report and leave.

And of course feel free to reach out to consult@crimede-coder.com if you are interested in pursuing a project. Going forward I plan on making a new post around once a month, so sign up in your feed reader or using a service like IFTTT.


Setting up a stand alone website is not that hard in the end. Currently it is a static site with some custom javascript (hosted on Hostinger). I should do a PHP server for the new blog posts and RSS feed eventually, but for now this is fine. I suggest for those interested in the same get the Jon Duckett books (HTML/Javascript/PHP) for overview of the tech, and then check out Dani Kross’s youtube tutorials (for random things like editing the htaccess file).

I am not doing a newsletter for the blog-posts, as I am concerned it will get my email on random block lists. But if there is demand for it in the future I will figure out some other service I guess to do that.

I wanted a more bare-metal setup (not a hosted wordpress like this site), as in the future I will likely do demo’s of dashboards, host some pyscript, make a sign in for paid content, etc. I just wanted flexibility from the start. So stay tuned for more content from CRIME De-Coder!

Getting access to paywalled newspaper and journal articles

So recently several individuals have asked about obtaining articles they do not have access to that I cite in my blog posts. (Here or on the American Society of Evidence Based Policing.) This is perfectly fine, but I want to share a few tricks I have learned on accessing paywalled newspaper articles and journal articles over the years.

I currently only pay for a physical Sunday newspaper for the Raleigh News & Observer (and get the online content for free because of that). Besides that I have never paid for a newspaper article or a journal article.

Newspaper paywalls

Two techniques for dealing with newspaper paywalls. 1) Some newspapers you get a free number of articles per month. To skirt this, you can open up the article in a private/incognito window on your preferred browser (or open up the article in another browser entirely, e.g. you use Chrome most of the time, but have Firefox just for this on occasion.)

If that does not work, and you have the exact address, you can check the WayBack machine. For example, here is a search for a WaPo article I linked to in last post. This works for very recent articles, so if you can stand being a few days behind, it is often listed on the WayBack machine.

Journal paywalls

Single piece of advice here, use Google Scholar. Here for example is searching for the first Braga POP Criminology article in the last post. Google scholar will tell you if a free pre or post-print URL exists somewhere. See the PDF link on the right here. (You can click around to “All 8 Versions” below the article as well, and that will sometimes lead to other open links as well.)

Quite a few papers have PDFs available, and don’t worry if it is a pre-print, they rarely substance when going into print.1

For my personal papers, I have a google spreadsheet that lists all of the pre-print URLs (as well as the replication materials for those publications).

If those do not work, you can see if your local library has access to the journal, but that is not as likely. And I still have a Uni affiliation that I can use for this (the library and getting some software cheap are the main benefits!). But if you are at that point and need access to a paper I cite, feel free to email and ask for a copy (it is not that much work).

Most academics are happy to know you want to read their work, and so it is nice to be asked to forward a copy of their paper. So feel free to email other academics as well to ask for copies (and slip in a note for them to post their post-prints to let more people have access).

The Criminal Justician and ASEBP

If you like my blog topics, please consider joining the American Society of Evidence Based Policing. To be clear I do not get paid for referrals, I just think it is a worthwhile organization doing good work. I have started a blog series (that you need a membership for to read), and post once a month. The current articles I have written are:

So if you want to read more of my work on criminal justice topics, please join the ASEBP. And it is of course a good networking resource and training center you should be interested in as well.


  1. You can also sign up for email alerts on Google Scholar for papers if you find yourself reading a particular author quite often.↩︎

Counting lines of code

Was asked recently about how many lines of python code was in my most recent project. A simple command line check, cd into your project directory and run:

find -type f -name "*.py" | xargs wc -l

(If on windows, you can download the GOW tools to be able to use these same tools by default available on unix/mac.) This will include whitespace and non-functional lines (like docstrings), but that I think is ok. Doing this for my current main project at Gainwell, I have about 30k lines of python code. Myself (and now about 4 other people) have been working on that code base for nearly a year.

For my first production project at (then) HMS, the total lines of python code are 20k, and I developed the bulk of that in around 7 months of work. Assuming 20 work days in a month, that results in around 20000/140 ~ 143 lines of code per workday. I did other projects during that time span, but this was definitely my main focus (and I was the sole developer/data scientist). I think that is high (more code is not necessarily better, overall code might have decreased as future development of this project happened over time), but is ballpark reasonable expectations for working data scientists (I would have guessed closer to around 100 per day). In the grand scheme of things, this is like 2 functions or unit tests per work day (when considering white space and docstrings).

Doing this for all of my python code on my personal machine is around 60k (I do around, as I am removing counts for projects that are just cloned). And for all the python code on my work machine is around 140k (for 3 years of work). (I am only giving fuzzy numbers, I have some projects joint work I am half counting, and some cloned code I am not counting at all.)

Doing this same exercise for R code, I only get around 40k lines of code on my personal machine. For instance, my ptools package has under 3k lines of "*.R" code total. I am guessing this is due to not only R code being more precise than python, but to take code into production takes more work. Maybe worth another blog post, but the gist of the difference between an academic project is that you need the code to run one time, whereas for a production project the code needs to keep running on a regular schedule indefinitely.

I have written much more SPSS code over my career than R code, but I have most of it archived on Dropbox, so cannot easily get a count of the lines. I have a total of 1846 sps files (note that this does not use xargs).

find -type f -name "*.sps" | wc -l

It is possible that the average sps file on my machine is 200 lines per file (it definitely is over 100 lines). So my recent python migration I don’t think has eclipsed my cumulative SPSS work going back over a decade (but maybe in two more years will).

Outputs vs Outcomes and Agile

For my criminal justice followers, there is a project planning strategy, Agile, that dominates software engineering. The idea behind Agile is to formulate plans in short sprints (we do two week sprints at my work). So we have very broad based objectives (Epics) that can span a significant amount of time. Then we have shorter goals (Stories) that are intended to take up the sprint. Within each story, we further break down our work into specific tasks that we can estimate how long they will take. So something at my work may look like:

  • Build Model to Predict Readmission for Heart Attacks (Epic)
    • Create date pipeline for training data (Story)
      • SQL functions to prepare data (Task, 2 days)
      • python code to paramaterize SQL (Task, 3 days)
      • Unit tests for python code (Task, 1 day)
    • Build ML Model (Story)
      • evaluate different prediction models (Task, 2 days)
    • Deploy ML Model in production (Story)

Etc. People at this point often compare Agile vs Waterfall, where waterfall is more longish term planning (often on say a quarterly schedule). And Agile per its name is suppossed to be more flexible, and modify plans on short term. Most of my problems with Agile could apply though to Waterfall planning as well – short term project planning (almost by its nature) has to be almost solely focused on outputs and not outcomes.

Folks with a CJ background will know what I am talking about here. So police management systems often contrast focusing on easily quantifiable outputs, such as racking up traffic tickets and low level arrests, vs achieving real outcomes, such as increased traffic safety or reducing violent crime. While telling police officers to never do these things does not make sense, you can give feedback/nudge them to engage in higher quality short term outputs that should better promote those longer term outcomes you want.

Agile boards (where we post these Epics/Stories/Tasks, for people to keep tabs on what everyone is doing) are just littered with outputs that have little to no tangible connection to real life outcomes. Take my Heart Attack example. It may be there is a current Heart Attack prediction system in place based on a simple scorecard – utility in that case would be me comparing how much better my system is than the simpler scorecard method. If we are evaluating via dollars and cents, it may only make sense to evaluate how effective my system is in promoting better health outcomes (e.g. evaluating how well my predictive system reduces follow up heart attacks or some other measure of health outcomes).

The former example is not a unit of time (and so counts for nothing in the Agile framework). Although in reality it should be the first thing you do (and drop the project if you cannot sufficiently beat a simple baseline). You don’t get brownie points for failing fast in this framework though. In fact you look bad, as you did not deliver on a particular product.

The latter example unfortunately cannot be done in a short time period – we are often talking about timescales of years at that point instead of weeks. People can look uber productive on their Agile board, and can easily accomplish nothing of value over broad periods of time.

Writing this post as we are going through our yearly crisis of “we don’t do Agile right” at my workplace. There are other more daily struggles with Agile – who defines what counts as meeting an objective? Are we being sufficiently specific in our task documentation? Are people over/under worked on different parts of the team? Are we estimating the time it takes to do certain tasks accurately? Does our estimate include actual work, or folds in uncertainty due to things other teams are responsible for?

These short term crises of “we aren’t doing Agile right” totally miss the boat for me though. I formulate my work strategy by defining end goals, and then work backwards to plan the incremental outputs necessary to achieve those end goals. The incremental outputs are a means to that end goal, not the ends themselves. I don’t really care if you don’t fill out your short term tasks or mis-estimate something to take a week instead of a day – I (and the business) cares about the value added of the software/models you are building. It isn’t clear to me that looking good on your Agile board helps accomplish that.

Over 10 years of blogging

I just realized the other day that I have been blogging for over 10 years (I am old!) First hello world post post was back in December 2011.

I would recommend folks in academia/coding to at a minimum do a personal webpage. I use wordpress for my blog (did a free wordpress for quite a long time). WordPress is 0 code to make a personal page to host your CV.

I treat the blog as mostly my personal nerd journal, and blog about things I am working on or rants on occasion. I do not make revenue off of the blog directly, but in terms of getting me exposure it has given quite a few consulting leads over the years. As well as just given my academic work a much wider exposure.

So I always have a few things I want to blog about in the hopper. But always feel free to ask me anything (similar to how Andrew Gelman answers emails), and if I get a chance I will throw up a blog post in response.

Some peer review ideas

I recently did two more reviews for Crime Solutions. I actually have two other reviews due, in which I jumped Crime Solutions up in my queue. This of course is likely to say nothing about anyone but myself and my priorities, but I think I can attribute this behavior to two things:

  1. CrimeSolutions pays me to do a review (not much, $250, IMO I think I should get double this but DSG said it was pre-negotiated with NIJ).
  2. CrimeSolutions has a pre-set template. I just have to fill in the blanks, and write a few sentences to point to the article to support my score for that item.

Number 2 in particular was a determinant in me doing the 2nd review CrimeSolutions forwarded to me in very short order. After doing the 1st, I had the template items fresh in my mind, and knew I could do the second with less mental overhead.

I think these can, on the margins, improve some of the current issues with peer reviews. #1 will encourage more people to do reviews, #2 will improve the reliability of peer reviews (as well as make it easier for reviewers by limiting the scope). (CrimeSolutions has the reviewers hash it out if we disagree about something, but that has only happened once to me so far, because the template to fill in is laid out quite nicely.)

Another problem with peer reviews is not just getting people to agree to review, but to also to get them to do the review in a timely manner. For this, I suggest a time graded pay scale – if you do the review faster, you will get paid more. Here are some potential curves if you set the pay scale to either drop linearly with number of days or a logarithmic drop off:

So here, if using the linear scale and have a base rate of $300, if you do the review in two weeks, you would make $170, but if you take the full 30 days, you make $10. I imagine people may not like the clock running so fast, so I also devised a logarithmic pay scale, that doesn’t ding you so much for taking a week or two, but after that penalizes you quite heavily. So at two weeks is just under $250.

I realize pay is unlikely to happen (although is not crazy unreasonable, publishers extract quite a bit of rents from University libraries to subscriptions). But standardized forms are something journals could do right now.

Reversion in the tech stack and why DS models fail

Hackernews recently shared a story about not using an IDE, and I feel mostly the same way. Hence the title in the post – so my current workflow for when I steal some time to work on my R package ptools my workflow looks like this, using Rterm from the shell:

I don’t have anything against RStudio, I just only have so much room in my brain. Sometimes conversations at work are like a foreign language, “How are we going to test the NiFi script from Hadoop to our Kubernetes environment” or “I can pull the docker image from JFrog, but I run out of room when extracting the image on our sandbox machine. But df says we have plenty of room on all the partitions?”.

If you notice at the top the (base) in front of the shell, that is because this is within the anaconda shell as well. So if you look at many of my past blog posts (see here for one example), I am just using the snipping tool in windows to take screenshots of the shell output in interactive mode.

I am typically just writing the code in Notepad++ (as well as this blog post) – and it is quite simple to switch between interactive copy this function/code and compiling entire scripts. Here is a screenshot of R unit tests for example.

So Notepad++ has some text highlight (for both R and python), and that is nice, but honestly not that necessary. Main thing I use is the selection of brackets to make sure they are balanced. I am sure I am missing out on some nice autocomplete features that would make me more productive, and function hints in Spyder are nice for pandas functions (I mostly use google still though for that when I need it).

I do use VS Code for development work on our headerless virtual machines at work. But that is more to replicate essentially the workflow on Windows with file explorer + Notepad++ + Shell (I am not a vim ninja – what is it esc + wq:, need to look that up everytime). I fucked up one of my git repos the other day using VS Code stuff tools, and I am just using git directly anymore. (Again this means I am the problem, not VS Code!)

Why data science projects fail?

Some more random musings, but the more I get involved and see what projects work and what don’t at work, pretty much all of the failures I have come across are due to what I will call “not modeling the right thing”. That potentially covers a bit, but quite a few are simply not understanding counterfactual reasoning and selection bias.

The modeling part (in terms of actually fitting models) is typically quite easy – you do some simple but slightly theoretically informed feature engineering and feed that info into a machine learning model that is very flexible. But maybe that is the problem, people can easily fool themselves into thinking a model looks good, but because they are modeling the wrong thing it does not result in better decision making.

Even in most of the failures I have seen, selection bias is surmountable (often just requires multiple models or models on different samples of data – reduced form for the win!). So learning how to train/test split the data, and feed your data into XGboost only takes a few classes to learn. How to know the right thing to model though takes a bit more thought.

A secondary part of the failure is not learning how to translate the model outputs into actionable decisions. But the not modeling the right thing is at the start, so makes any downstream decision not work out how you want.

Ask me anything

So I get cold emails probably a few times a month asking random coding questions (which is perfectly fine — main point of this post!). I’ve suggested in the past that folks use a few different online forums, but like many forums I have participated in the past they died out quite quickly (so are not viable alternatives currently).

I think going forward I will mimic what Andrew Gelman does on his blog, just turn my responses into blog posts for everyone (e.g. see this post for an example). I will of course ask people permission before I post, and omit names same as Gelman does.

I have debated over time of doing a Patreon account, but I don’t think that would work very well (imagine I would get 1.2 subscribers for $3 a month!). Ditto for writing books, I debate on doing a Data Science for Crime Analysts in Python or something along those lines, but then I write the outline and think that is too much work to have at best a few hundred people purchase the book in the end. I will do consulting gigs for folks, but the majority of questions people ask do not take long enough to justify running a tab for the work (and I have no desire to rack up charges for grad students asking a few questions).

So feel free to ask me anything.

CrimRxiv, Alt-Journal Contributions, and Mike Maltz’s Retrospective

As I’m sure followers of mine know, I am a big proponent of posting pre-prints. Spearheaded by Scott Jacques, he has started a specifically criminology focused pre-print server title CrimRxiv. It is still in beta but anyone can contribute a paper if they want.

One of the things me and Scott have been jamming about is how to leverage crimrxiv to make a journal that not only takes advantage of all the goodies on the internet, such as being able to embed interactive graphics or other rich media directly in a journal articles. But to really widen the scope of what ‘counts’ in terms of scholarly contribution. Why can’t things like a cool app, or a really good video lecture you edited, or a blog post illustrating code be put on the same level with journal articles?

Part of the reason I am writing this blog post is that I saw Michael Maltz recently publish a retrospective on his career on Academia.edu. This isn’t a typical journal article, but despite that there is no reason why you shouldn’t share such pieces. So I was able to convince Mike to post A Retrospective Look at My Professional Life to crimrxiv. When he first posted it on Academia.edu here was my response on how Mike (despite never having crossed paths) has influenced my career.


Hi Michael and thank you for sharing,

I’ve followed your work since a grad student at Albany. I initially got hooked on data viz based on Tufte’s book. When I looked for examples of criminologists discussing data viz you were the only one I found. That was sometime around 2010, so you had that chapter in the handbook of quantitative crim. You also had another article about drawing glyphs to illustrate life course transitions I was familiar with.

When I finished my classes at SUNY, I then worked at Troy as a crime analyst while finishing my dissertation. I doubt any of the coffee shops were the same from your time, but I did like walking over to Famous hotdogs for lunch every now and then.

Most of my work at the PD was making time series graphs and maps. No regression, so most of my stats training was not particularly useful. Even my mapping course I took focused on areal data analysis was not terribly relevant.

I tried to do similar projects to your glyph life-courses with interval censored crime data, but I was never really successful with that, they always ended up being too complicated with even moderately large crime datasets, see https://andrewpwheeler.com/2013/02/28/interval-graph-for-viz-temporal-overlap-in-crime-events/ and https://andrewpwheeler.com/2014/10/02/stacking-intervals/ for my attempts.

What was much more helpful was simply doing monitoring metrics over time, simple running means, and then I just inverted the PDF of the Poisson to give error bars, e.g. https://andrewpwheeler.com/2016/06/23/weekly-and-monthly-graphs-for-monitoring-crime-patterns-spss/. Then cases that were outside the error bands signified an anomalous pattern. In Troy there was an arrest of a single prolific person breaking into cars, and the trend went from a creeping 10 year high to a 10 year low instantly in those graphs.

So there again we have your work on the Poisson distribution and operations research in that JQC article. Also sometime in there I saw a comment you made on Andrew Gelman’s blog pointing to your work with error bands for BJS. Took that ‘fan chart’ idea later on and provided error bands for city level and USA level homicide trends, e.g. https://apwheele.github.io/MathPosts/FanChart_NewOrleans.html. Most of popular discussion of large scale crime trends is misguided over-interpreting short term noise in my opinion.

So all my degrees are in criminal justice, but I have been focusing more on linear programming over time borrowing from operations researchers as well, https://andrewpwheeler.com/2020/05/29/an-intro-to-linear-programming-for-criminologists/. I’ve found that taking outputs from a predictive model and then applying a decision analysis to specifically articulate strategies CJ agencies should take is much more fruitful than the typical way academic research is done.

Thank you again for sharing your story and best, Andy Wheeler