For a bit of background, Loki, a computer science student in India, was asking me about my solution to the DrivenData algae bloom competition. Much of our back and forth was specific to my coding solution and “how I knew how to do that” (in particular I used a machine learning variant of doubly robust estimation in part of the solution, which I am sure others have used before but is not real common that I see, it is more often “causal inference” motivated). As for more general advice in learning, I said:
Only advice is to learn stats – not just for competitions but for real-world jobs. Many people are just copy-pasting code, and don’t know what they are doing. Understanding selection bias is important in many real-world scenarios. Often times it is just knowing a little about the scientific scenario you are modeling, and correctly formulating a model.
In response Loki asks:
I decided to take your suggestion and strengthen my grasp on statistics. I consider myself somewhere between beginner to intermediate in stats. I came across several resources on the internet, but feel confused about what to go with. I am wondering if “The Elements of Statistical Learning” by Trevor Hastie and Robert Tibishirani is a good one to start with. Or could you please suggest any books/lectures/courses that have practical applications to solidify my understanding of statistics that you have personally read or liked?
Which I think is a good piece to expand to the readers on my blog in general. Here is my response:
I would not start with that book. It is a mistake to start with too advanced of material. (I don’t learn anything that way anyway.)
Starting from the basics, no joke Gonick’s Cartoon Guide to Statistics is in my opinion the best intro to statistics and probability book. After that, it is important to understand causality – like really understand it – selection bias lurks everywhere. (I am not sure I have great advice for books that focus on causality, Pearl’s book is quite tough, maybe Shadish, Cook, Campbell Experimental and Quasi-Experimental Designs and/or Mostly Harmless Econometrics).
After that, follow questions on https://stats.stackexchange.com, it is high quality on average (many internet sources, like Medium articles or https://datascience.stackexchange.com, are very low quality on average – they can have gems but more often than not they are bad for anything besides copy/pasting code). Andrew Gelman’s blog is another good source for contemporary discussion around stats/research/pitfalls, https://statmodeling.stat.columbia.edu.
In terms of more advanced modeling, after having the basics down, I would suggest Harrell’s Regression Modeling Strategies before the Hastie book. You can interpret pretty much all of machine learning in terms of regression models. For small datasets, understanding how to do simpler regression modeling the right way is the best approach.
When moving onto machine learning, then maybe the Hastie book is a good resource (I didn’t find it all that much useful at this point beyond web resources). Statquest videos are very good walkthroughs of more complicated ML algorithms, https://www.youtube.com/@statquest, trees/boosting/neural-networks.
This is a hodge-podge – I don’t tend to learn things just to learn them – I have a specific project in mind and try to tackle that project the best I can. Many of these resources are items I picked up along the way (Gonick I got to review intro stats books for teaching, Harrell’s I picked up to learn a bit more about non-linear modeling with splines, Statquest I reviewed when interviewing for data science positions).
It is a long road to get to where I am. It was not via picking a book and doing intense study, it was a combination of applied projects and learning new things over time. I learned a crazy lot from the Cross Validated site when I was in grad school. (For those interested in optimization, the Operations Research site is also very high quality.) That was more broad learning though – seeing how people tackled problems in different domains.