Musings on Project Organization, Books and Courses

Is there a type of procrastination via which people write lists of things? I have that condition.

I have been recently thinking about project organization. At work we have been using the Cookie Cutter Data Science project set up – and I really hate it. I have been thinking about this more recently, as I have taken over several other data scientists models at work. The Cookie Cutter Template is waaaay too complicated, and mixes logic of building python packages (e.g. setup.py, a LICENSE folder) with data science in production code (who makes their functions pip installable for a production pipeline?). Here is the Cookie Cutter directory structure (even slightly cut off):

Cookie cutter has way too many folders (data folder in source, and data folder itself), multiple nested folders (what is the difference between external data, interim, and raw data?, what is the difference between features and data in the src folder?) I can see cases for individual parts of these needed sometimes (e.g. an external data file defining lookups for ICD codes), but why start with 100 extra folders that you don’t need. I find this very difficult taking over other peoples projects in that I don’t know where there are things and where there are not (most of these folders are empty).

So I’ve reorganized some of my projects at work, and they now look like this:

├── README.md           <- High level overview of project + any special notes
├── requirements.txt    <- Default python libraries we often use (eg sklearn, sqlalchemy)
├                         + special instructions for conda environments in our VMs
├── .gitignore          <- ignore `models/*.pkl`, `*.csv`, etc.
├── /models             <- place to store trained and serialized models
├── /notebooks          <- I don't even use notebooks very often, more like a scratch/EDA folder
├── /reports            <- Powerpoint reports to business (using HMS template)
├── /src                <- Place to store functions

And then depending on the project, we either use secret environment variables, or have a YAML file that has database connection strings etc. (And that YAML is specified in .gitignore.)

And then over time in the root folder it will typically have shell scripts call whatever production pipeline or API we are building. All the function files in source is fine, although it can grow to more modules if you really want it to.

And this got me thinking about how to teach this program management stuff to new data scientists we are hiring, and if I was still a professor how I would structure a course to teach this type of stuff in a social science program.

Courses

So in my procrastination I made a generic syllabi for what this software developement course would look like, Software & Project Development For Social Scientists. It would have a class/week on using the command prompt, then a week on github, then a few weeks building a python library, then ditto for an R package. And along the way sprinkle in literate programming (notebooks and markdown and Latex), unit testing, and docker.

And here we could discuss how projects are organized. And social science students get exposed to way more stuff that is relevant in a typical data science role. I have over the years also dreamt up other data science related courses as well.

Stats Programming for CJ. This goes through the basics of data manipulation using statistical programming. I would likely have tutorials for R, python, SPSS, and Stata for this. My experience with students is that even if they have had multiple stats classes in grad school, if you ask them “take this incident dataset with dates, and prepare a weekly level file with counts of crimes per week” they don’t know how to do even that simple task (an aggregation). So students need an entry level data manipulation course.

Optimization for Criminal Justice (or alt title Operations Research and Machine Learning for CJ). This one is not as developed as some of my other courses, but I think I could make it work for a semester. I think learning linear programming is a really great skill not taught at all in any CJ program I am aware of. I have some small notes on machine learning in my Research Design class for PhD students, but that could be expanded out (week for decision trees/forests, week for boosting, week for neural networks, etc.).

And last, I have made syllabi for the one credit entry level course for undergrad students, and the equivalent course for the new PhD students, College Prep. These classes I had I don’t think did a very good job. My intro one at Bloomsburg for undergrad had a textbook lol! The only thing I remember about my PhD one was fear mongering over publications (which at that point I had no idea what was going on), and spending the last class with Julie Horney and David McDowell at whatever the place next to the Washington Tavern in Albany was called (?Gingerbread?).

These are of course just in my head at the moment. I have posted my course materials over the years that I have delivered.

I have pitched to a few programs to hire me as a semi teaching professor (and still keep my private sector gig). This set up is not that uncommon in comp sci departments, but no CJ ones I think are interested. Even though I like musing about courses, adjunct pay is way too low to justify this investment, and should be paid to both develop the material as well as deliver the class.

Books

I have similarly made outlines for books over the years as well. One is Data Science for Crime Analysis with Python. I think there is an opening in the crime analysis market to advance to more professional coding, and so a python book would be good. But the market is overall tiny, my high end guesstimates are only around 800, so hard to justify the effort. (It would be mainly just a collection of my blog posts, but all in a nicer format for everyone to walk through/replicate.)

Another is a reader book, Handbook of Advanced Crime Analysis. That may not be needed though, as Cory Haberman and Liz Groff did a recent book that has quite a bit of overlap (can’t find it at the moment, maybe it is not out yet). Many current advanced techniques are scattered and sometimes difficult to replicate, I figured a reader that also includes code walkthroughs would help quite a few PhD students.

And again if I was still in the publishing game I would like to turn my Poisson course notes into a little Sage green book.

If I was still a professor, this would go hand in hand with developing courses. I know Uni’s do sometimes have grants to develop open source teaching materials, and these would probably best fit those molds. These aren’t going to generate revenue directly from sales.

So complaints and snippets on blog posts are all you are going to get for now from me.

My online course lab materials and musings about online teaching

I often refer folks to the courses I have placed online. Just for an update for everyone, if you look at the top of my website, I have pages for each of my courses at the header of my page. Several of these are just descriptions and syllabi, but the few lab based courses I have done over the years I have put my materials entirely online. So those are:

And each of those pages links to a GitHub page where all the lab goodies are stored.

The seminar in research focuses on popular quasi-experimental designs in CJ, and has code in R/Stata/SPSS for the weekly lessons. (Will need to update with python, I may need to write my own python margins library though!)

Grad GIS is mostly old ArcGIS tutorials (I don’t think I will update ArcPro, will see when Eric Piza’s new book comes out and just suggest that probably). Even though the screenshots are perhaps old at this point though the ideas/workflow are not. (It also has some tutorials on other open source tools, such as CrimeStat, Jerry’s Near Repeat Calculator, GeoDa, spatial regression analysis in R, and Mallesons/Andresens SPPT tool are examples I remember offhand.)

Undergrad Crime Analysis is mostly focused on number crunching relevant to crime analysts in Excel, although has a few things in Access (making SQL queries), and making a BOLO in publisher.

So for folks self-learning of course use those resources however you want. My suggestion is to skim through the syllabus, see if you want to learn about any particular lesson, and then jump right to that one. No need to slog through the whole course if you are just interested in one specific thing.

They are also freely available to any instructors who want to adapt those materials for their own courses as well.


One of the things that has disappointed me about the teaching response to Covid is instead of institutions taking the opportunity to really invest in online teaching, people are just running around with their heads cut off and offering poor last minute hybrid courses. (This is both for the kiddos as well as higher education.)

If you have ever taken a Coursera course, they are a real production! And the ones I have tried have all been really well done; nice videos, interactive quizzes with immediate feedback, etc. A professor on their own though cannot accomplish that, we would need investment from the University in filming and in scripting the webpage. But once it is finished, it can be delivered to the masses.

So instead of running courses with a tiny number of students, I think it makes more sense for Universities to actually pony up resources to help professors make professional looking online courses. Not the nonsense with a bad recorded lecture and a discussion board. It is IMO better to give someone a semester sabbatical to develop a really nice online course than make people develop them at the last minute. Once the course is set up, you really only need to administer the course, which takes much less work.

Another interested party may be professional organizations. For example, the American Society of Criminology could make an ad-hoc committee to develop a model curriculum for an intro criminology course. You can see in my course pages I taught this at one point – there is no real reason why every criminology teacher needs to strike out on their own. This is both more work for the individual teacher, as well as introduces quite a bit of variation in the content that crim/cj students receive.

Even if ASC started smaller, say promoting individual lessons, that would be lovely. Part of the difficulty in teaching a broad course like Intro to Criminology is that I am not an expert on all of criminology. So for example if someone made a lesson plan/video for bio-social criminology, I would be more apt to use that. Think instead of a single textbook, leveraging multi-media.


It is a bit ironic, but one of the reasons I was hired at HMS was to internally deliver data science training. So even though I am in the private sector I am still teaching!

Like I said previously, you are on your own for developing teaching content at the University. There is very little oversight. I imagine many professors will cringe at my description, but one of the things I like at HMS is the collaboration in developing materials. So I initially sat down with my supervisor and project manager to develop the overall curricula. Then for individual lessons I submit my slides/lab portion to my supervisor to get feedback, and also do a dry run in front of one of my peers on our data science team to get feedback. Then in the end I do a recorded lecture – we limit to something like 30 people on WebEx so it is not lagging, but ultimately everyone in the org can access the video recording at a later date.

So again I think this is a better approach. It takes more time, and I only do one lecture at a time (so take a month or two to develop one lecture). But I think that in the end this will be a better long term investment than the typical way Uni’s deliver courses.

New course in the spring – Crime Science

This spring I will be teaching a new graduate level course, Crime Science. A better name for the course would be evidence based policing tactics to reduce crime — but that name is too long!

Here you can see the current syllabus. I also have a page for the course, which I will update with more material over the winter break.

Given my background it has a heavy focus on hot spots policing (different tactics at hot spots, time spent at hot spots, crackdowns vs long term). But the class covers other policing strategies; such as chronic offenders, the focused deterrence gang model, and CPTED. We also discuss the use of technology in policing (e.g. CCTV, license plate readers, body-worn-cameras).

I will weave in ethical discussions throughout the course, but I reserved the last class to specifically talk about predictive policing strategies. In particular the two main concerns are increasing disproportionate minority contact through prediction, and privacy concerns with police collecting various pieces of information.

So take my course!

Spatial analysis course in CJ (graduate) – Spring 2016 SUNY Albany

This spring I am teaching a graduate level GIS course for the school of criminal justice on the downtown SUNY campus. There are still seats available, so feel free to sign up. Here is the page with the syllabus, and I will continue to add additional info./resources to that page.

Academics tend to focus on regression of lattice/areal data (e.g. see Matt Ingrams course over in Poli. Sci.), and in this course I tried to mix in more things I regularly encountered while working as a crime analyst that I haven’t seen coverage of in other GIS courses. For example I have a week devoted to the journey to crime and geographic offender profiling. I also have a week devoted to introducing the current most popular models used to forecast crime.

I’ve started a specific wordpress page for courses, which I will update with additional courses I prepare.