Musings on Project Organization, Books and Courses

Is there a type of procrastination via which people write lists of things? I have that condition.

I have been recently thinking about project organization. At work we have been using the Cookie Cutter Data Science project set up – and I really hate it. I have been thinking about this more recently, as I have taken over several other data scientists models at work. The Cookie Cutter Template is waaaay too complicated, and mixes logic of building python packages (e.g., a LICENSE folder) with data science in production code (who makes their functions pip installable for a production pipeline?). Here is the Cookie Cutter directory structure (even slightly cut off):

Cookie cutter has way too many folders (data folder in source, and data folder itself), multiple nested folders (what is the difference between external data, interim, and raw data?, what is the difference between features and data in the src folder?) I can see cases for individual parts of these needed sometimes (e.g. an external data file defining lookups for ICD codes), but why start with 100 extra folders that you don’t need. I find this very difficult taking over other peoples projects in that I don’t know where there are things and where there are not (most of these folders are empty).

So I’ve reorganized some of my projects at work, and they now look like this:

├──           <- High level overview of project + any special notes
├── requirements.txt    <- Default python libraries we often use (eg sklearn, sqlalchemy)
├                         + special instructions for conda environments in our VMs
├── .gitignore          <- ignore `models/*.pkl`, `*.csv`, etc.
├── /models             <- place to store trained and serialized models
├── /notebooks          <- I don't even use notebooks very often, more like a scratch/EDA folder
├── /reports            <- Powerpoint reports to business (using HMS template)
├── /src                <- Place to store functions

And then depending on the project, we either use secret environment variables, or have a YAML file that has database connection strings etc. (And that YAML is specified in .gitignore.)

And then over time in the root folder it will typically have shell scripts call whatever production pipeline or API we are building. All the function files in source is fine, although it can grow to more modules if you really want it to.

And this got me thinking about how to teach this program management stuff to new data scientists we are hiring, and if I was still a professor how I would structure a course to teach this type of stuff in a social science program.


So in my procrastination I made a generic syllabi for what this software developement course would look like, Software & Project Development For Social Scientists. It would have a class/week on using the command prompt, then a week on github, then a few weeks building a python library, then ditto for an R package. And along the way sprinkle in literate programming (notebooks and markdown and Latex), unit testing, and docker.

And here we could discuss how projects are organized. And social science students get exposed to way more stuff that is relevant in a typical data science role. I have over the years also dreamt up other data science related courses as well.

Stats Programming for CJ. This goes through the basics of data manipulation using statistical programming. I would likely have tutorials for R, python, SPSS, and Stata for this. My experience with students is that even if they have had multiple stats classes in grad school, if you ask them “take this incident dataset with dates, and prepare a weekly level file with counts of crimes per week” they don’t know how to do even that simple task (an aggregation). So students need an entry level data manipulation course.

Optimization for Criminal Justice (or alt title Operations Research and Machine Learning for CJ). This one is not as developed as some of my other courses, but I think I could make it work for a semester. I think learning linear programming is a really great skill not taught at all in any CJ program I am aware of. I have some small notes on machine learning in my Research Design class for PhD students, but that could be expanded out (week for decision trees/forests, week for boosting, week for neural networks, etc.).

And last, I have made syllabi for the one credit entry level course for undergrad students, and the equivalent course for the new PhD students, College Prep. These classes I had I don’t think did a very good job. My intro one at Bloomsburg for undergrad had a textbook lol! The only thing I remember about my PhD one was fear mongering over publications (which at that point I had no idea what was going on), and spending the last class with Julie Horney and David McDowell at whatever the place next to the Washington Tavern in Albany was called (?Gingerbread?).

These are of course just in my head at the moment. I have posted my course materials over the years that I have delivered.

I have pitched to a few programs to hire me as a semi teaching professor (and still keep my private sector gig). This set up is not that uncommon in comp sci departments, but no CJ ones I think are interested. Even though I like musing about courses, adjunct pay is way too low to justify this investment, and should be paid to both develop the material as well as deliver the class.


I have similarly made outlines for books over the years as well. One is Data Science for Crime Analysis with Python. I think there is an opening in the crime analysis market to advance to more professional coding, and so a python book would be good. But the market is overall tiny, my high end guesstimates are only around 800, so hard to justify the effort. (It would be mainly just a collection of my blog posts, but all in a nicer format for everyone to walk through/replicate.)

Another is a reader book, Handbook of Advanced Crime Analysis. That may not be needed though, as Cory Haberman and Liz Groff did a recent book that has quite a bit of overlap (can’t find it at the moment, maybe it is not out yet). Many current advanced techniques are scattered and sometimes difficult to replicate, I figured a reader that also includes code walkthroughs would help quite a few PhD students.

And again if I was still in the publishing game I would like to turn my Poisson course notes into a little Sage green book.

If I was still a professor, this would go hand in hand with developing courses. I know Uni’s do sometimes have grants to develop open source teaching materials, and these would probably best fit those molds. These aren’t going to generate revenue directly from sales.

So complaints and snippets on blog posts are all you are going to get for now from me.