Dashboards are often not worth the effort

When end users see dashboards, they often think “this is really whiz-bang cool, lets do that for our data”. There are two issues though I commonly see with dashboards. One is the nature of the task to be accomplished with the dashboard is not well defined, and so even a visually well done dashboard mostly goes unused. The second, there are a ton of headaches deploying dashboards with real data – and the effort to do it right is not worth it.

For the first part, consider a small agency that has a simple crime trends dashboard. The intent is to identify anomalous upticks of crime. This requires someone log into the dashboard, click around the different crime trends, and visually seeing if they are higher than expected. It would be easier to either have an automated alert when some threshold is met, or a standardized report (e.g. once a week) that is emailed to review.

This post is not going to even be about ‘most dashboards show stupid data’ or ‘the charts show the data in totally inappropriate ways’. Even in cases in which you can build a nice dashboard, the complexity level is IMO not worth it in many situations I have encountered. Automating alerts and building regular standard reports for the vast majority of situations is in my opinion a better solution for data products. The whiz-bang being able to interactively click stuff will only be used by a very small number of people (who can often build stuff like that for themselves anyway).

So onto the second part, deploying dashboards with real data that others can interact with. Many of the cool example dashboards you see online do tricks that would be inappropriate for a production dashboard. So even if someone is like ‘yeah I can do a demo dashboard in a day on my laptop’, there is still a ton of more work to expose that dashboard to different individuals, which is probably the ultimate goal.

Dashboards have two parts: A) connecting to a source database, then B) exposing the user-interface to outside parties. Seems simple right? Wrong.

Lets start with part A, connecting to a source database. Many dashboard examples you see online cheat at this stage – they embed a static version of the data in the dashboard itself. If Gary needs to re-upload the data to the dashboard every week, what happens when Gary goes on vacation or takes a day off? You now have an out of date dashboard.

It is quite hard to manage connections between live source data and a dashboard. In the case you have a public facing dashboard, I would just host a public file somewhere on the internet (so anyone can download the source data), and point to that source, and then have some automated process update that source data. This solves the Gary went on vacation factor at least, and there is no security risk. You intentionally only upload data that can be disseminated to the public.

One potential alternative with the public facing dashboard is to make a serverless file. This will often embed the dashboard (and maybe data) into the webpage itself (e.g. pyscript wasm), so it may be slow to start up, but will work reasonably well if you can handle a one minute lag. You don’t need to worry about malicious actors in that scenario, as the heavy computation is done on the clients computer, not your server. I have an example on CRIME De-Coder (note it is currently not up to date, my code is running fine, but Dallas post the cyber-attack has not been updating their public data).

Managing a direct connection to your source data in a public facing dashboard is not a good idea, as malicious actors can spam your dashboard. This denial of service attack will not only make your dashboard unresponsive, but will also eat up your database processing. (Big companies often have a reporting database server vs a production database server in the best case scenario, but this requires resources most public sector agencies do not have.)

The solution to this is to limit who can access the dashboard, part B above. Unfortunately, the majority of dashboard software when you want to make a live connection and/or limit who can see the information, you are in the ‘need to pay for this scenario’. A very unfortunate aspect of the ‘you need to pay for this’ is that most of these vendors charge per viewer – it isn’t a flat fee. PowerBI is maybe OK if your organization already pays for Sharepoint, Tableau licensing is so painful I wouldn’t suggest it.

So what is the alternative? You are now in charge of spinning up your own server so people can click a dropdown menu and generate a line graph. Again you need to worry about security, but at least if you can hide behind a local network or VPN it is probably doable for most police departments.

I don’t want to say dashboards are never worth it. I think public facing dashboards are a good thing for police transparency, and if done right are easier to implement than private ones. Private ones are doable as well (especially if limiting to intranet applications). But like I said, this level of effort is over the top compared to just doing a regular report.

Updates on CRIME De-Coder and ASEBP

So I have various updates on my CRIME De-Coder consulting site, as well as new posts on the American Society of Evidence Based Policing Criminal Justician series.

CRIME De-Coder

Blog post, Don’t use percent change for crime data, use this stat instead. I have written a bunch here about using Poisson Z-scores, so if you are reading this it is probably old news. Do us all a favor and in your Compstat reports drop ridiculous percent change metrics with low baselines, and use 2 * ( sqrt(Current) - sqrt(Past) ).

Blog post, Dashboards should be up to date. I will have a more techy blog post here on my love/hate relationship with dashboards (most of the time static reports are a better solution). But one scenario they do make sense is for public facing dashboards, but they should be up to date. The “free” versions of popular tools (Tableau, PowerBI) don’t allow you to link to a source dataset and get auto-updated, so you see many old dashboards out of date online. If you contract with me, I can automate it so it is up to date and doesn’t rely on an analyst manually updating the information.

Demo page – that page currently includes demonstrations for:

The WDD Tool is pure javascript – picking up more of that slowly (the Folium map has a few javascript listener hacks to get it to look the way I want). As a reference for web development, I like Jon Duckett’s three books (HTML, javascript, PHP).

Ultimately too much stuff to learn, but on the agenda are figuring out google cloud compute + cloud databases a bit more thoroughly. Then maybe add some PHP to my CRIME De-Coder site (a nicer contact me form, auto-update sitemap, and rss feed). I also want to learn how to make ArcGIS dashboards as well.

Criminal Justician

The newest post is Situational crime prevention and offender planning – discussing one of my favorite examples of crime prevention through environmental design (on suicide prevention) and how it is a signal about offender behavior that goes beyond simplistic impulsive behavior. I then relate this back to current discussion of preventing mass shootings.

If you have ideas about potential posts for the society (or this blog or crime de-coders blog), always feel free to make a pitch

Hacking folium for nicer legends

I have accumulated various code to hack folium based maps over several recent projects, so figured would share. It is a bit too much to walk through the entire code inline in a blog post, but high level the extras this code does:

  • Adds in svg elements for legends in the layer control
  • Has a method for creating legends for choropleth maps
  • Inserts additional javascript to make a nice legend title (here a clickable company logo) + additional attributions

Here is the link to a live example, and below is a screenshot:

So as a quick rundown, if you are adding in an element to a folium layer, e.g.:

folium.FeatureGroup(name="Your Text Here",overlay=True,control=True)

You can insert arbitrary svg code into the name parameter, e.g. you can do something like <span><svg .... /svg>HotSpots</span>, and it will correctly render. So I have functions to make the svg icon match the passed color. So you can see I have a nice icon for city boundary’s, as well as a blobby thing for hotspots.

There in the end are so many possible parameters, I try to make reasonable functions without too crazy many parameters. So if someone wanted different icons, I might just make a different function (probably wouldn’t worry about passing in different potential svg).

I have a function for choropleth maps as well – I tend to not like the functions that you pass in a continuous variable and it does a color map for you. So here it is simple, you pass in a discrete variable with the label, and a second dictionary with the mapped colors. I tend to not use choropleth maps in interactive maps very often, as they are more difficult to visualize with the background map. But there you have it if you want it.

The final part is using javascript to insert the Crime Decoder logo (as a title in the legend/layer control), as well as the map attribution with additional text. These are inserted via additional javascript functions that I append to the html (so this wouldn’t work say inline in a jupyter notebook). The logo part is fairly simple, the map attribution though is more complicated, and requires creating an event listener in javascript on the correct elements.

The way that this works, I actually have to save the HTML file, then I reread the text back into python, add in additional CSS/javascript, and then resave the file.

If you want something like this for your business/website/analysts, just get in contact.

For next tasks, I want to build a demo-dashboard for Crime De-Coder (probably a serverless dashboard using wasm/pyscript). But in terms of leaflet extras, the ability to embed SVG into different elements, you can create charts in popups/tooltips, which would be a cool addition to my hotspots (click and it has a time series chart inside).

Saving data files in wheel packages

As a small update, I continue to use my retenmod python package as a means to illustrate various python packaging and CICD tricks. Most recently, I have added in an example of saving local data files inside of the wheel package. So instead of just packaging the .py files in the wheel package, it also bundles up two data files: a csv file and a json data file for illustration.

The use case I have seen for this, sometimes I see individual .py files in peoples packages that have thousands of lines – they just are typically lookup tables. It is is better to save those lookup tables in more traditional formats than it is to coerce them into python objects.

It is not too difficult, but here are the two steps you need:

Step 1, in setup.cfg in the root, I have added this package_data option.

* = *.csv, *.json

Step 2, create a set of new functions to load in the data. You need to use pkg_resources to do this. It is simple enough to just copy-paste the entire data_funcs.py file here in a blog post to illustrate:

import json
import numpy as np
import pkg_resources

# Reading the csv data
def staff():
    stream = pkg_resources.resource_stream(__name__, "agency.csv")
    df = np.genfromtxt(stream, delimiter=",", skip_header=1)
    return df

# Reading the metadata
def metaf():
    stream = pkg_resources.resource_stream(__name__, "notes.json")
    res = json.load(stream)
    return res

# just having it available as object
metadata = metaf()

So instead of doing something like pd.read_csv('agency.csv') (or here I use numpy, as I don’t have pandas as a package dependency for retenmod). You create a stream object, and the __name__ is just the way for python to figure out all of the relative path junk. Depending on the different downstream modules, you may need to stream.read(), but here for both json and numpy you can just pass them to their subsequent read functions and it works as intended.

And again you can checkout the github actions to see in the works of testing the package, and generating the wheel file all in one go.

If you install the latest via the github repo:

pip install https://github.com/apwheele/retenmod/blob/main/dist/retenmod-0.0.1-py3-none-any.whl?raw=true

And to test out this, you can do:

from retenmod import data_funcs

data_funcs.metadata # dictionary from json
data_funcs.staff() # loading in numpy array from CSV file

If you go check out wherever you package is installed to on your machine, you can see that it will have the agency.csv and the notes.json file, along with the .py files with the functions.

Next on the todo list, auto uploading to pypi and incrementing minor tags via CICD pipeline. So if you know of example packages that do that already let me know!

Machine learning models and the market for lemons

The market for lemons is an economic concept in which buyers of a good cannot distinguish between quality products and poor products (the lemons). This lack of knowledge makes it so that people selling lemons can always underbid people with higher quality products. In the long run, all quality vendors are driven out, and only cheap lemon sellers remain.

I believe people selling predictive models (or machine learning models, or forecasting products, or artificial intelligence, to round out all the SEO terms) are highly susceptible to this. This occurs in markets in which the predictive models cannot be easily evaluated.

What reminded me of this is I recently saw a vendor saying they have the “most accurate” population health predictive models. This is a patently absurd assertion (even if you hosted a kaggle style competition, it would only apply to that kaggle dataset, not a more general claim to that particular institutions population). But the majority of buyers (different healthcare systems), likely have no way to evaluate my companies claims vs this vendors.

ChatGPT is another recent example. Although it can generate on its face “quality” answers, don’t use it to diagnose your illnesses. ChatGPT is very impressive at generating grammatically correct responses, so to a layman may appear to be high quality, but really it is very superficial in most domains (no different than using google searches to do anything complicated, which can be useful sometimes but is very superficial).

So what is the solution? From a consumer perspective, here is my advice. You should ask the vendor to do a demonstration on your own data. So you ask the vendor, “here is my data for 2019, can you forecast the 2020 data?”, or something along those lines where you provide a training set and a test set. Then you have the vendor generate predictions for test set, and you do the evaluation yourself to see if there predictions are worth the cost of the product.

This is a situation in which academic peer review has some value as well (as well as data competitions). You can see that the method a particular group used was validated by its peers, but ultimately the local tests on your own data will be needed. Even if my recidivism model is accurate for Georgia, it won’t necessarily generalize to your state.

If you are in a situation in which you do not have data to validate the results in the end, you need to rely on outside experts and understanding the methodology used to generate the estimates. A good example of this is people selling aggregate crime data (that literally make numbers up). I have slated a blog post about that in the near future to go into more detail, but in short there is no legitimate seller of second hand crime data in the US currently.

If you are interested in building or evaluating predictive models, please get in touch with my consulting services. While I say that markets for lemons can drive prices down, I still see quite a few ridiculous SaaS prices, like $900k for a black box, unevaluated early intervention system for police.

At least so far many of these firms are using the Joel Spolsky 6 figure sales approach for crappy products. My consulting firm can easily beat a six digit price tag, so the lemons have not driven me out yet.

A statistical perspective on year-to-date metrics

Jerry Ratcliffe, and now more recently Jeff Asher, have written about how volatile early year projection of year-to-date (YTD) percent changes. I am going to write about this is not the right way to frame the problem in my opinion – I will present a better behaved estimate that is less volatile, but clearly doesn’t give police departments what they want.

Going to the end advice first – people find me irksome for the suggestion, but you shouldn’t be using percent changes at all. A simple alternative I have stated for low count crime data is a Poisson Z-score, which is simply 2*(sqrt(Current) - sqrt(Past)) – a value of greater than 3 or 4 is a signal the two processes are significantly different (under the null hypothesis that the counts have a Poisson distribution).

A Better YTD estimate

So here I am going to present a more accurate YTD percent change metric – but don’t take that as advice you should be using YTD percent change. It is more of an exercise to say why you shouldn’t be using this metric to begin with. Year end percent change is defined as:

(Current - Past)/Past = % Change

Note that you can rewrite this as:

Current/Past - Past/Past  = % Change
Current/Past - 1          = % Change

So really it is only the ratio of Current/Past that we care about estimating, the translating to a percent doesn’t matter. In the above equations, I am writing these as cumulative totals for the whole year. So lets do breakdowns via subscripts, and shorten Current and Past to C and P respectively. So say we have data through January, people typically estimate the YTD percent change then as:

(C_January - P_January)/P_January = % Change January

To make it easier, I am going to write e subscript for early, and l subscript for later. So if we then estimate YTD for February, we then have C_January + C_February = C_e. Also note that C_e + C_l = Current, the early observed values plus the later unobserved values equals the year totals. This identifies a clear error when people use only subsets of the data to do YTD year end projections (what both Jerry and Jeff did in their posts to argue against early YTD estimates). You should not just use P_e in your estimate, you should use the full prior year counts.

Lets go back to our year end estimate, writing in early/later form:

[C_e + C_l - (P_e + P_l)]/(P_e + P_l) = % Change

This only has one unknown in the equation – C_l, the unknown rest of year projection. You should not use (C_e - P_e)/P_e, as this introduces several stochastic elements where none are needed. P_e is not necessarily a good estimate of P_e + P_l. So lets do a simple example, imagine we had homicide totals:

     Past Current
Jan    2     1
Feb    0      
Mar    1      
Apr    1      
May    1      
Jun    1      
Jul    1      
Aug    1      
Sep    1      
Oct    1      
Nov    1      
Dec    1      
Tot   12

The naive way of doing YTD estimates, we would say our January YTD estimates are (1 - 2)/2 = -50%. Whereas I am saying, you should use (1 + C_l)/12 – filling in whatever value you project to the rest of the year totals C_l. Simple ones you can do in a spreadsheet are ‘no change’, just fill in the prior year which here would be C_l = 10, and would give a YTD percent change estimate of (11 - 12)/12 ~ -8%. Or another simple one is extrapolate, which would be C_l = C_e*(1/year_proportion) = 1*12, so (12 - 12)/12 = 0%. (You would really want to fit a model with seasonal and trend components and project out the remaining part of the year, which will often be somewhere between these two simpler methods.)

So far this is just theoretical “should be a better estimator” – lets show with some actual data. Python code to replicate here, but I took open data from Cary, NC, which goes back to 2000, so we have a sample of 22 years. Estimates of the error, broken down by month and version, are below. The naive estimate is how it is typically done (equivalent to Jeff/Jerry’s blog posts), the running estimate is taking prior to fill in C_l, and extrapolate is using the current months to fill in. The error metrics are | (estimated % change) - (actual year end % change) |, and the stats show the mean (standard deviation) of the sample (n=22). Here are the metrics for larceny, which average 123 per month over the sample:

       Naive   Running  Extrapolate
Jan   12 (7)    6 (4)     10 (7)
Feb    8 (6)    6 (4)     11 (7)
Mar    9 (6)    5 (3)      8 (6)
Apr    9 (7)    5 (3)      8 (5)
May    7 (6)    5 (3)      6 (4)
Jun    6 (4)    4 (3)      4 (3)
Jul    5 (3)    4 (3)      4 (3)
Aug    4 (3)    3 (2)      3 (2)
Sep    3 (2)    3 (2)      2 (2)
Oct    2 (1)    2 (1)      2 (1)
Nov    1 (1)    1 (1)      1 (1)
Dec    0 (0)    0 (0)      0 (0)

And here are the metrics for burglary, which average 28 per month over the sample. Although these have higher error metrics (due to lower/more volatile baseline counts), my estimator is still better than the naive one for the majority of the year.

       Naive   Running  Extrapolate
Jan   34 (25)   12 (8)    24 (23)
Feb   15 (14)   11 (7)    16 (13)
Mar   15 (14)   12 (7)    15 (11)
Apr   15 (11)   10 (7)    13 ( 8)
May   14 (10)   10 (7)    10 ( 7)
Jun   11 ( 8)   10 (7)     8 ( 6)
Jul    9 ( 7)    9 (7)     7 ( 5)
Aug    7 ( 5)    8 (5)     6 ( 3)
Sep    6 ( 4)    6 (5)     4 ( 3)
Oct    6 ( 4)    5 (4)     3 ( 3)
Nov    3 ( 3)    3 (3)     2 ( 2)
Dec    0 ( 0)    0 (0)     0 ( 0)

Running tends to do better for earlier in the year (and for smaller N samples). Both the running and extrapolate estimates are closer to the true year end percent change compared to the naive estimate in around 70% of the observations in this sample. (And tends to be even more pronounced in the smaller crime count categories, closer to 80% to 90% of the time better.)

In Jerry’s and Jeff’s posts, they use a metric +/- 5 to say “it is close” – this corresponds to in my tables absolute errors in the range of 5 percentage points. You meet that criteria on average in this sample for my estimator in March for Larcenies (running) and September (extrapolate) for Burglaries.

To be clear though, even with the more accurate projections, you should not use this estimate.

What do police departments want?

So Jeff may literally want an end-of-year projection for when he writes a Times article – similar to how a government might give a year end projection for GDP growth. But this is not what most police departments want when they calculate YTD metrics. So saying in turn “you shouldn’t use YTD because the error is high” to me misses the boat a bit. I can give a metric that has lower error rates, but you still shouldn’t use YTD percent change.

What police departments want to examine is the more general question “are my numbers high?” – you can further parse this into “are my numbers high consistently over the past date range” (of which the past year is just a convenient demarcation) or “are my numbers anomalous high right now”. The former is asking about long term trends, and the latter is asking about short term increases. Part of why I don’t like YTD is that it masks these two metrics – a spike early in the year can look like a perpetual long term upward trend later in the year.

I have training material showing off two different types of charts I like to use in lieu of YTD metrics. These can identify anomalous short term and long term trends. Here is an example weekly chart showing trends (in black line) and short term spikes (outside the error intervals):

So this is an uber nerd post – I hope it has general lessons though. One is that if you need to estimate Y, and you can write Y as a function of other variables, some that are variable and some that are not, e.g. Y = f(x1,c), then maybe you should just focus on estimating x1 in this scenario, not model Y directly.

In terms of more general statistical modeling of crime trends, I have debated in the past examining more thoroughly seasonal-trend decomposition techniques, but I think the examples I give above are quite sufficient for most analysis (and can be implemented in a spreadsheet).

Ask me anything: Advice for learning statistics?

For a bit of background, Loki, a computer science student in India, was asking me about my solution to the DrivenData algae bloom competition. Much of our back and forth was specific to my coding solution and “how I knew how to do that” (in particular I used a machine learning variant of doubly robust estimation in part of the solution, which I am sure others have used before but is not real common that I see, it is more often “causal inference” motivated). As for more general advice in learning, I said:

Only advice is to learn stats – not just for competitions but for real-world jobs. Many people are just copy-pasting code, and don’t know what they are doing. Understanding selection bias is important in many real-world scenarios. Often times it is just knowing a little about the scientific scenario you are modeling, and correctly formulating a model.

In response Loki asks:

I decided to take your suggestion and strengthen my grasp on statistics. I consider myself somewhere between beginner to intermediate in stats. I came across several resources on the internet, but feel confused about what to go with. I am wondering if “The Elements of Statistical Learning” by Trevor Hastie and Robert Tibishirani is a good one to start with. Or could you please suggest any books/lectures/courses that have practical applications to solidify my understanding of statistics that you have personally read or liked?

Which I think is a good piece to expand to the readers on my blog in general. Here is my response:

I would not start with that book. It is a mistake to start with too advanced of material. (I don’t learn anything that way anyway.)

Starting from the basics, no joke Gonick’s Cartoon Guide to Statistics is in my opinion the best intro to statistics and probability book. After that, it is important to understand causality – like really understand it – selection bias lurks everywhere. (I am not sure I have great advice for books that focus on causality, Pearl’s book is quite tough, maybe Shadish, Cook, Campbell Experimental and Quasi-Experimental Designs and/or Mostly Harmless Econometrics).

After that, follow questions on https://stats.stackexchange.com, it is high quality on average (many internet sources, like Medium articles or https://datascience.stackexchange.com, are very low quality on average – they can have gems but more often than not they are bad for anything besides copy/pasting code). Andrew Gelman’s blog is another good source for contemporary discussion around stats/research/pitfalls, https://statmodeling.stat.columbia.edu.

In terms of more advanced modeling, after having the basics down, I would suggest Harrell’s Regression Modeling Strategies before the Hastie book. You can interpret pretty much all of machine learning in terms of regression models. For small datasets, understanding how to do simpler regression modeling the right way is the best approach.

When moving onto machine learning, then maybe the Hastie book is a good resource (I didn’t find it all that much useful at this point beyond web resources). Statquest videos are very good walkthroughs of more complicated ML algorithms, https://www.youtube.com/@statquest, trees/boosting/neural-networks.

This is a hodge-podge – I don’t tend to learn things just to learn them – I have a specific project in mind and try to tackle that project the best I can. Many of these resources are items I picked up along the way (Gonick I got to review intro stats books for teaching, Harrell’s I picked up to learn a bit more about non-linear modeling with splines, Statquest I reviewed when interviewing for data science positions).

It is a long road to get to where I am. It was not via picking a book and doing intense study, it was a combination of applied projects and learning new things over time. I learned a crazy lot from the Cross Validated site when I was in grad school. (For those interested in optimization, the Operations Research site is also very high quality.) That was more broad learning though – seeing how people tackled problems in different domains.

ASEBP blog posts, and auto screenshotting websites

I wanted to give an update here on the Criminal Justician series of blogs I have posted on the American Society of Evidence Based Policing (ASEBP) website. These include:

  • Denver’s STAR Program and Disorder Crime Reductions
    • Assessing whether Denver’s STAR alternative mental health responders can be expected to decrease a large number of low-level disorder crimes.
  • Violent crime interventions that are worth it
    • Two well-vetted methods – hot spots policing and focused deterrence – are worth the cost for police to implement to reduce violent crime.
  • Evidence Based Oversight on Police Use of Force
    • Collecting data in conjunction with clear administrative policies has strong evidence it overall reduces officer use of force.
  • We don’t know what causes widespread crime trends
    • While we can identify whether crime is rising or falling, retrospectively identifying what caused those ups and downs is much more difficult.
  • I think scoop and run is a good idea
    • Keeping your options open is typically better than restricting them. Police should have the option to take gun shot wound victims directly to the emergency room when appropriate.
  • One (well done) intervention is likely better than many
    • Piling on multiple interventions at once makes it impossible to tell if a single component is working, and is likely to have diminishing returns.

Going forward I will do a snippet on here, and refer folks to the ASEBP website. You need to sign up to be able to read that content – but it is an organization that is worth joining (besides for just reading my takes on science around policing topics).

So my CRIME De-Coder LLC has a focus on the merger of data science and policing. But I have a bit of wider potential application. Besides statistical analysis in different subject areas, one application I think will be of wider interest to public and private sector agencies is my experience in process automation. These often look like boring things – automating generating a report, sending an email, updating a dashboard, etc. But they can take substantial human labor, and automating also has the added benefit of making a process more robust.

As an example, I needed to submit my website as a PDF file to obtain a copyright. To do this, you need to take screenshots of your website and all its subsequent pages. Googling on this for selenium and python, the majority of the current solutions are out of date (due to changes in the Chrome driver in selenium over time). So here is the solution I scripted up the morning I wanted to submit the copyright – it took about 2 hours total in debugging. Note that this produces real screenshots of the website, not the print to pdf (which looks different).

It is short enough for me to just post the entire script here in a blog post:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
from PIL import Image
import os

home = 'https://crimede-coder.com/'

url_list = [home,
            home + 'about',
            home + 'blog',
            home + 'contact',
            home + 'services/ProgramAnalysis',
            home + 'services/PredictiveAnalytics',
            home + 'services/ProcessAutomation',
            home + 'services/WorkloadAnalysis',
            home + 'services/CrimeAnalysisTraining',
            home + 'services/CivilLitigation',
            home + 'blogposts/2023/ServicesComparisons']

res_png = []

def save_screenshot(driver, url, path, width):
    # Ref: https://stackoverflow.com/a/52572919/
    original_size = driver.get_window_size()
    #required_width = driver.execute_script('return document.body.parentNode.scrollWidth')
    required_width = width
    required_height = driver.execute_script('return document.body.parentNode.scrollHeight')
    #driver.save_screenshot(path)  # has scrollbar
    driver.find_element(By.TAG_NAME, 'body').screenshot(path)  # avoids scrollbar
    driver.set_window_size(original_size['width'], original_size['height'])

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

for url in url_list:
    if url == home:
        name = "index.png"
        res_url = url.replace(home,"").replace("/","_")
        name = res_url + ".png"


# Now appending to PDF file
images = [Image.open(f).convert('RGB') for f in res_png if f[-3:] == 'png']
i1 = images.pop(0)
i1.save(r'Website.pdf', save_all=True, append_images=images)

# Now removing old PNG files
for f in res_png:

One of the reasons I want to expand knowledge of coding practices into policing (as well as other public sector fields) is that this simple of a thing doesn’t make sense for me to package up and try to monetize. The IP involved in a 2 hour script is not worth that much. I realize most police departments won’t be able to take the code above and actually use it – it is better for your agency to simply do a small contract with me to help you automate the boring stuff.

I believe this is in large part a better path forward for many public sector agencies, as opposed to buying very expensive Software-as-a-Service solutions. It is better to have a consultant to provide a custom solution for your specific agency, than to spend money on some big tool and hope your specific problems fit their mold.

An alt take on opioid treatment coverage in North Carolina

The Raleigh News & Observer has been running multiple stories on the recent Medicaid expansion in North Carolina, with one recently about expanded opioid treatment coverage. Myself and Kaden Call have worked in the past on developing an algorithm to identify underprovided estimates (see background blog post, and Kaden’s work at Gainwell while an intern).

I figured I would run our algorithm through to see what North Carolina looks like. So here is an interactive map, with the top 10 zipcodes that have need for service (in red polygons), and CMS certified opioid treatment providers (in blue pins). (Below is a static image)

My initial impression was that this did not really jive with the quotes in the News & Observer article that suggested NC was a notorious service dessert – there are quite a few treatment providers across the state. So the cited Rural HealthInfo source disagrees with this. I cannot find their definition offhand, but I am assuming this is due to only counting in-patient treatment providers, whereas my list of CMS certified providers is mostly out-patient.

So although my algorithm identified various areas in the state that likely could use expanded services, this begs the question of whether NC is really a service dessert. It hinges on whether you think people need in-patient or out-patient treatment. Just a quick sampling of those providers, maybe half say they only take private, so it is possible (although not certain) that the recent Medicaid expansion will open up many treatment options to people who are dependent on opioids.

SAMHSA estimates that of those who get opioid treatment, around 5% get in-patient services. So maybe in the areas of high need I identify there is enough demand to justify opening new in-patient service centers – it is close though I am not sure the demand justifies opening more in-patient (as opposed to making it easier to access out-patient).

Asking folks with a medical background at work, it seems out-patient has proven to be as effective as in-patient, and that the biggest hurdle is to get people on buprenorphine/methadone/naltrexone (which the out-patient can do). So I am not as pessimistic as many of the health experts that are quoted in the News & Observer article.

The serenity prayer and being a senior developer

The serenity prayer, for those who don’t know it is:

God, grant me the serenity to accept the things I cannot change, courage to change the things I can, and wisdom to know the difference.

I think this is an important concept that distinguishes good senior developers from junior developers (or data scientists, or crime analysts, the title doesn’t really matter).

Many very green junior developers tend to err on the ‘I cannot change anything’ side. Or put another way, they are told ‘we are going to do XYZ’, and instead of saying ‘we don’t need to do Y, we can just do XZ’ they just go with the flow and do what others tell them to do. For a more concrete example, close to every project at my workplace that uses Hadoop, it is probably unnecessary. So often groups come in and say ‘we need to go from DatabaseX -> Hadoop -> Machine Learning Model -> DatabaseY’. So people go on this path, even though you could just chunk up the data into more memory safe ways and cut out Hadoop entirely.

Another common data science one I come across is ‘the business wants a ranking of priority claims that places them into bins of 1/2/3’. Instead of making a proper utility derived decision rule, the data scientist gives the business what they ask for, using ad-hoc and clearly suboptimal rules to make the bins. It is similar to the XY problem, juniors just need to recognize they have agency to go back to the business partners and say ‘we should actually do it like this instead’.

For a crime analysis example, when I worked at Troy PD and implemented these weekly metrics, the Chief at the time asked me to remove the error bars on the weekly forecasts. I simply explained to him that I used those to tell if a recent uptick was anomalous (if inside the bars it is what we would expect), and he said OK I understand now why you do that. I do things on occasion because a higher up asks that I don’t prefer, but you should push back in data science roles to nudge people to the right metrics (who often do not have as much expertise as you). It takes courage as the prayer goes.

I use the condition good senior developer earlier in the post, as I know senior people who fall into the trap of just going with the flow too much as well. But another typology for seniors is the ‘accept the things I cannot change’. I have come across this less often, but there are a few people who are very zealous about different tools/methods – kubernetes, everything needs to be CICD, agile – even when they are not possible to coerce to the particular situation. Many of these methods could be fine if they could be applied easily to the project at hand, but if it takes 2 years to develop your kubernetes or CICD pipeline, whereas I can log into a virtual machine, do a one time set up and be done in a much shorter period of time, you should probably rethink your approach.

Often the developers don’t realize it will take 2 years (or there are fundamental problems with the approach that makes it not feasible). That is why good seniors have the wisdom to know the difference between things they can change and things they cannot.

I am going to be annoying and plug my consulting firm, CRIME De-Coder LLC for a bit here on the blog. So please check my work and get in touch if you or your agency/business have any needs for statistical analysis, process automation, program analysis, predictive analytics, etc.