Using DuckDB WASM + Cloudflare R2 to host and query big data (for almost free)

The motivation here, prompted by a recent question Abigail Haddad had on LinkedIn:

For the machines, the context is hosting a dataset of 150 million rows (in another post Abigail stated it was around 72 gigs). And you want the public to be able to make ad-hoc queries on that data. Examples where you may want to do this are public dashboards (think a cities open data site, just puts all the data on R2 and has a front end).

This is the point where traditional SQL databases for websites probably don’t make sense. Databases like Supabase Postgres or MySQL can have that much data, given the cost of cloud computing though and what they are typically used for, it does not make much sense to put 72 gigs and use them for data analysis type queries.

Hosting the data as static files though in an online bucket, like Cloudflare’s R2, and then querying the data makes more sense for that size. Here to query the data, I also use a WASM deployed DuckDB. What this means is I don’t really have to worry about a server at all – it should scale to however many people want to use the service (I am just serving up HTML). The client’s machine handles creating the query and displaying the resulting data via javascript, and Cloudflare basically just pushes data around.

If you want to see it in action, you can check out the github repo, or see the demo deployed on github pages to illustrate generating queries. To check out a query on my Cloudflare R2 bucket, you can run SELECT * FROM 'https://data-crimedecoder.com/books.parquet' LIMIT 10;:

Cloudflare is nice here, since there are no egress charges (big data you need to worry about that). You do get charged for different read/write operations, but the free tiers seem quite generous (I do not know quite how to map these queries to Class B operations in Cloudflare’s parlance, but you get 10 million per month and all my tests only generated a few thousand).

For some notes on this set-up. On Cloudflare, to be able to use DuckDB WASM, I needed to expose the R2 bucket via a custom domain. Using the development url did not work (same issue as here). I also set my CORS Policy to:

[
  {
    "AllowedOrigins": [
      "*"
    ],
    "AllowedMethods": [
      "GET",
      "HEAD"
    ],
    "AllowedHeaders": [
      "*"
    ],
    "ExposeHeaders": [],
    "MaxAgeSeconds": 3000
  }
]

While my Crime De-Coder site is PHP, all the good stuff happens client-side. So you can see some example demo’s of the GSU book prices data.

One of the annoying things about this though, with S3 you can partition the files and query multiple partitions at once. Here something like SELECT * FROM read_parquet('https://data-crimedecoder.com/parquet/Semester=*/*') LIMIT 10; does not work. You can union the partitions together manually. So not sure if there is a way to set up R2 to work the same way as the S3 example (set up a FTP server? let me know in the comments!).

For pricing, for the scenario Abigail had of 72 gigs of data, we then have:

  • $10 per year for the domain
  • 0.015*72*12 = $13 for storage of the 72 gigs

So we have a total cost to run this of $23 per year. And it can scale to a crazy number of users and very large datasets out of the box. (My usecase here is just $10 for the domain, you get 10 gigs for free.)

Since this can be deployed on a static site, there are free options (like github pages). So the page with the SQL query part is essentially free. (I am not sure if there is a way to double dip on the R2 custom domain, such as just putting the HTML in the bucket. Yes, you can just put the HTML in the bucket and it will render like normal.)

While this example only shows generating a table, you can do whatever additional graphics client side. So could make a normal looking dashboard with dropdowns, and those just execute various queries and fill in the graphs/tables.

Year in Review 2024

Past year in review posts I have made focused on showing blog stats. Writing this in early December, but total views will likely be down this year – I am projecting around 140,000 views in total for this site. But I have over 25k views for the Crime De-Coder site, so it is pretty much the same compared to 2023 combining the two sites.

I do not have a succinct elevator speech to tell people what I am working on. With the Crime De-Coder consulting gig, it can be quite eclectic. That Tukey quote being a statistician you get to play in everyone’s backyard is true. Here is a rundown of the paid work I conducted in the past year.

Evidence Based CompStat: Work with Renee Mitchell and the American Society of Evidence Based Policing on what I call Evidence Based CompStat. This mostly amounts to working directly with police departments (it is more project management than crime analysis) to help them get started with implementing evidence based practices. Reach out if that sounds like something your department would be interested in!

Estimating DV Violence: Work supported by the Council on CJ. I forget exactly the timing of events. This was an idea I had for a different topic (to figure out why stores and official reports of thefts were so misaligned). Alex approached me to help with measuring national level domestic violence trends, and I pitched this idea (use local NIBRS data and NCVS to get better local estimates).

Premises Liability: I don’t typically talk about ongoing cases, but you can see a rundown of some of the work I have done in the past. It is mostly using the same stats I used as a crime analyst, but in reference to civil litigation cases.

Patrol Workload Analysis: I would break workload analysis for PDs down into two categories, advanced stats and CALEA reports. I had one PD interested in the simpler CALEA reporting requirement (which I can do for quite a bit cheaper than the other main consulting firm that offers these services).

Kansas City Python Training: Went out to Kansas City for a few days to train their analysts up in using python for Focused Deterrence. If you think the agenda in the pic below looks cool get in touch, I would love to do more of these sessions with PDs. I make it custom for the PD based on your needs, so if you want “python and ArcGIS”, or “predictive models” or whatever, I will modify the material to go over those advanced applications. I have also been pitching the same idea (short courses) for PhD programs. (So many posers in private sector data science, I want more social science PhDs with stronger tech skills!)

Patterson Opioid Outreach: Statistical consulting with Eric Piza and Kevin Wolff on a street outreach intervention intended to reduce opioid overdose in Patterson New Jersey. I don’t have a paper to share for that at the moment, but I used some of the same synthetic control in python code I developed.

Bookstore prices: Work with Scott Jacques, supported by some internal GSU money. Involves scraping course and bookstore data to identify the courses that students spend the most on textbooks. Ultimate goal in mind is to either purchase those books as unlimited epubs (to save the students money), or encourage professors to adopt better open source materials. It is a crazy amount of money students pour into textbooks. Several courses at GSU students cumulatively spend over $100k on course materials per semester. (And since GSU has a large proportion of Pell grant recipients, it means the federal government subsidizes over half of that cost.)

General Statistical Consulting: I do smaller stat consulting contracts on occasion as well. I have an ongoing contract to help with Pam Metzger’s group at the SMU Deason center. Did some small work for AH Datalytics on behind the scenes algorithms to identify anomalous reporting for the real time crime index. I have several times in my career consulted on totally different domains as well, this year had a contract on calculating regression spline curves for some external brain measures.

Data Science Book: And last (that I remember), I published Data Science for Crime Analysis with Python. I still have not gotten my 100 sales I would consider it a success – so if you have not bought a copy go do that right now. (Coupon code APWBLOG will get you $10 off for the next few weeks, either the epub or the paperback.)

Sometimes this seems like I am more successful than I am. I have stopped counting the smaller cold pitches I make (I should be more aggressive with folks, but most of this work is people reaching out to me). But in terms of larger grant proposals or RFPs in that past year, I have submitted quite a few (7 in total) and have landed none of them to date! Submitted a big one to support surveys that myself and Gio won the NIJ competition on for place based surveys to NIJ in their follow up survey solicitation, and it was turned down for example. So it goes.

In addition to the paid work, I still on occasion publish peer reviewed articles. (I need to be careful with my time though.) I published a paper with Kim Rossmo on measuring the buffer zone in journey to crime data. I also published the work on measuring domestic violence supported by the Council on CJ with Alex Piquero.

I took the day gig in Data Science at the end of 2019. Citations are often used as a measure of a scholars influence on the field – they are crazy slow though.

I had 208 citations by the end of 2019, I now have over 1300. Of the 1100 post academia, only a very small number are from articles I wrote after I left (less than 40 total citations). A handful for the NIJ recidivism competition paper (with Gio), and a few for this Covid and shootings paper in Buffalo. The rest of the papers that have a post 2019 publishing date were entirely written before I left academia.

Always happy to chat with folks on teaming up on papers, but it is hard to take the time to work on a paper for free if I have other paid work at the moment. One of the things I need to do to grow the business is to get some more regular work. So if you have a group (academic, think tank, public sector) that is interested in part time (or fractional I guess is what the cool kids are calling it these days), I would love to chat and see if I could help your group out.

Aoristic analysis, ebooks vs paperback, website footer design, and social media

For a few minor updates, I have created a new Twitter/X account to advertise Crime De-Coder. I do not know if there is some setting that people ignore all unverified accounts, but would appreciate the follow and reshare if you are still on the platform.

I also have an account on LinkedIn, and sometimes comment on the Crime Analysis Reddit.

I try to share cool data visualizations and technical posts. I know LinkedIn in particular can be quite vapid self-help guru type advice, which I avoid. I know being more technical limits the audience but that is ok. So appreciate the follow if you are on those platforms and resharing the work.

Ebooks vs Paperbacks

Part of the reason to start X account back up is to just try more advertising. I have sold not quite 80 to date (including pre-sales). My baseline goal was 100.

For the not pre-sales, I have sold 35% ebooks and 65% paperbacks. So spending some time to distribute your book paperback seems to me to be worth it.

Again feel like most academics who publish technical books self-publishing is a very good idea. So read the above linked post about some of the logistics of self-publishing.

Aoristic analysis in python

On the CRIME De-Coder blog, check out my post on Aoristic analysis. It has links to python code on github for those who just want the end result. It has several methods though to do hour of day and hour by day of week breakdowns. With the ability to do it by categories in data. And post hoc generate a few graphs. I like the line graphs the best:

But the more common heatmap I can understand why people like it

Website Design

I have a few minor website design updates. The homepage is more svelte. Wife suggested that it should be easier to see what I do right when you are on homepage, so put the jumbotron box at the bottom and the services tiles (with no pictures) at the top.

It does not look bad on mobile either (I only recently figured out that in Chrome’s DevTools they have a button to do turn on mobile view, very helpful!)

Final part is that I made a footer for my pages:

I am not real happy with this. One of the things you notice when you start doing web-design is everyone’s web-page looks the same. There are some basic templates for WordPress or Wix (and probably other CMS generators). Here is Supabases’s footer for example:

And now that I have shown you, you will see quite a few websites have that design. So I did the svg links to social media, but I may change that. (And go with no footer again, there is not a real obvious need for it.) So open to suggestions!

In intentionally made many of the decisions for the way the Crime De-Coder site looks not only to make it usable but to make it at least somewhat different. Avoid super long scrolls, sticky header (that still works quite well on phones). The header is quite dense with many sub-pages (I like it though).

I think alot of public sector agencies that are doing data dashboards now do not look very nice. Many are just iframed Tableau dashboards. If you want help with those data visualizations embedded in a more organic way in your site, that is something Crime De-Coder can help with.

LinkedIn posting and link promotion: impression vs reality

For folks who are interested in following my work, my advice is either email or RSS. This site you should see ‘follow blog via email’ and the RSS link on the right hand side. I sometimes post a note here on crimede-coder stuff, but not always, so just do the same (RSS, or use if-this-than-that service to turn RSS into email) on that site if you want to keep abreast of all my posts.

Another way to follow my work though is on LinkedIn. So feel free to connect with me or follow my content:

I post short form blogs/reactions on occasion (plus share my other posts/work). Social media promoting your work is often cringy, but I try to post informative and technical content (and not totally vapid self-help stuff). And I write things for people to view them, so I think it is important to promote my work.

One of the most recent things I have heard a few influencers mention how embedding links directly in LinkedIn posts they think de-promotes their work. See this discussion on HackerNews, or this person’s advice for two examples.

I formed a few opinions based on my regular postings over the past year+, but impressions of things over extended periods can often be wrong. So I actually downloaded the data to see! In terms of the thing about links and being de-promoted, I don’t see that in my data at all – this is a table of impressions broken down by the domain I linked to (for domains with at least 2+ posts over the prior year):

I did notice however two different domains – youtube and newsobserver (the Raleigh newspaper) tend to not have much engagement. So it may be certain domains are not as promoted. It is of course possible that particular content was not popular (I thought my crim observations on the Mark Rober glitterbombs would be more popular, but maybe not). But I think this is a large enough sample to at least give a good hint that they are not promoted in the same way my other links are. My no URL posts have slightly less engagement than my posts to this blog or the crimede-coder site, so overall the idea that links are penalized doesn’t appear to me to be true without more conditional statements.

Data is important, as again I think impressions can be bad for things that repeatedly happen over a long period of time. So offhand I though Tue/Thu I had less engagement, so stopped posting on those days. What does the data say?

| Day | Avg Impressions | Number Posts |
----------------------------------------
| Sun |         1,860   |         32   |
| Mon |         1,370   |         44   |
| Tue |         1,220   |         35   |
| Wed |         1,273   |         41   |
| Thu |         1,170   |         34   |
| Fri |         1,039   |         39   |
| Sat |         1,602   |         38   |

The data says Sun/Sat have higher impressions, and days of the week are lower. If anything Friday is the low day, not Tuesday/Thursday.

I have had other examples of practitioners argue with me in crime analysis or academic circles in my career that strike me as similar. In that perceptions (that people strongly believed in), did not align with actual data. So I just don’t think this idea of ‘taking the average of impressions over posts over the past year’ is something that you can really know just based on passive observation. Your perceptions are likely to be dominated by a few examples, which may be off the mark. Ditto for knowing how much crime happens at a particular location, or knowing how much different things impact survival rates for gunshots.

It is definately possible that my small page experience (currently at a few over 2700 followers on LinkedIn) is not the same as the large influencers. But without looking at actual data, I don’t trust peoples instincts on aggregate metrics all that much.

Another meta LinkedIn tip (I received from Rob Fornango) is to post tall images, so when people are scrolling your content stays on the screen longer. Here is an example post from Rob’s

It is hard for me to test this though, the links on LinkedIn sometimes expand the link to bigger images and sometimes not (and sometimes I edit the image it displays as well). And I think after a while they turn them into tiny images as well. Someone tell the folks on LinkedIn to allow us to use markdown!

So I mean I could spend a full time job tinkering, but looking at the data I have at hand I don’t plan on changing much. Just posting links to my work, and having an occasional comment as well if I think it will be of interest to more people than myself. Content over micro optimization that is (since the algorithm could change tomorrow anyway).

One of the things I have debated on is buying adverts to promote my python book. I think they are just on the cusp of a net loss though given clickthrough rates and margins on my book. So for example, LinkedIn estimates if I spend $140 to promote a post, I will get 23-99 clicks. My buy rate on the site is around 5%, so that would generate 1-5 book sales. My margins are not that high on a sale, so I would not make money on that.

I have been wondering if I posted direct adverts on Reddit for the book to the learn python forum how that would go. But I think it would be much of the same as LinkedIn (too low of clickthrough to make it worth it). But if I do those tests in the future will write up a blog post on my experience!


LinkedIn I can only find how to download my stats on the company crimede-coder page, not my personal page. Here is the script I used to convert the LinkedIn short urls back to the original domains I linked, plus the analysis:

'''
python Code to parse the domains from my
crimede-coder linkedin posts
run on 7/24/2024, so only has posts
from that date through the prior year
'''

import requests
import traceback
import pandas as pd
import time
from urllib.parse import urlparse

errors = {}

def get_link(url):
    time.sleep(2)
    try:
        res = requests.get(url)
    except Exception:
        er = traceback.format_exc()
        print(f'Error message is \n\n{er}')
        return ''
    if res.ok:
        it = res.text.split()
        it = [i for i in it if i[:4] == 'href']
        rl = it[3]
    else:
        print(f'Not ok, {url}, response: {r2.reason}')
        errors[url] = res
        return ''
    return rl[6:].replace('/">','').replace('">','')

# more often than not, linkedin converts the link in the post
# to a lnkd.in short url
def get_refer(txt):
    rs = txt.split()
    rs = [i for i in rs if i[:8] == 'https://']
    if rs:
        url = rs[0]  # if more than one link, only grabs the first
        if url[:15] == 'https://lnkd.in':
            return get_link(url)
        else:
            return url
    else:
        return ''


# this is data exported from LinkedIn on my Crime De-Coder page only goes back one year
df = pd.read_excel('crime-de-coder_content_1721834275879.xls',sheet_name='All posts',header=1)

# only need to keep a few columns
keep_cols = ['Post title','Post link','Created date','Impressions','Clicks','Likes','Comments','Reposts']
df = df[keep_cols].copy()

df['url'] = df['Post title'].apply(get_refer)

def domain(url):
    if url == '':
        return 'NO URL'
    else:
        pu = urlparse(url)
        return pu.netloc

df['domain'] = df['url'].apply(domain)

# caching out file, so do not need to reget url info
df.to_csv('ParseInfo.csv',index=False)

# Can aggregate to domain
agg_stats = df.groupby('domain',as_index=False)['Impressions'].describe()
agg_stats.sort_values(by=['count','mean'],ascending=False,ignore_index=True,inplace=True)
count_cols = list(agg_stats)[1:]
agg_stats[count_cols] = agg_stats[count_cols].fillna(0).astype(int)

# This is a nice way to print/view the results in terminal
print('\n\n' + agg_stats.head(22).to_markdown() + '\n\n')

Year in Review 2023: How did CRIME De-Coder do?

In 2023, I published 45 pages on the blog. Cumulative site views were slightly more than last year, a few over 150,000.

I would have had pretty much steady cumulative views from last year (site views took a dip in April, the prior year had quite a bit of growth, I suspect something to do with the way WordPress counts stats changed), but in December my post Forecasts need to have error bars hit front page on Hackernews. This generated about 12k views for that post over two days. (In 2022 I had just shy of 140k views in total.)

It was very high on the front page (#1) for most of that day. So for folks who want to guesstimate the “death by Hackernews” referrals, I would guess if your site/app can handle 10k requests in an hour you will be ok. WordPress by default this is fine (my Crime De-Coder Hostinger site is maybe not so good for that, the SLA is 20k requests per day). Also interesting note, about 10% of people who were referred to the forecast post clicked at least one other page on the site.

So I started CRIME De-Coder in February this year. I have published a few over 30 pages on that site during the year, and have accumulated a total of a few more than 11k site views. This is very similar to the first year of my personal blog, with publishing around 30 posts and getting just over 7k total views for the year. This is almost entirely via direct referrals (I share posts on LinkedIn, google searches are just a trickle).

Sometimes people are like “cool you started your own company”, but really I did that same type of consulting since I was in grad school. I have had a fairly consistent set of consulting work (around $20k per year) for quite awhile. That was people cold asking me for help with mostly statistical analysis.

The reason I started CRIME De-Coder was to be more intentional about it – advertise the work I do, instead of waiting for people to come to me. Doing your own LLC is simple, and it is more a website than anything.

So how much money did I make this year for CRIME De-Coder? Not that much more than $30k (I don’t count the data competitions I won in that metric, but actual commissioned work.) I do have substantially more work lined up for next year though already (more on the order of $50k so far, although no doubt some of that will fall through).

I sent out something like 30 some soft pitches during the year to people in my extended network (first or strong second degree). I don’t know the typical rate of something like that, but mine was abysmal – I was lucky to get an email response no thanks. These are just ideas like “hey I could build you an interactive dashboard with your data” or “you paid this group $150k, I would do that same thing for less than $30k”.

Having CRIME De-Coder did however did increase my first degree network to “ask me for stat analysis” more. So it was definitely worth spending time doing the website and creating the LLC. Don’t ask me for advice though about making pitches for consulting work!

The goal is ultimately to be able to go solo, and just do my consulting work as my full time job. It is hard to see that happening though – even if I had 5 times the amount of work lined up, it would still just be short term single projects. I have pitched more consistent retainers, but no one has gone for that. Small police departments if interested in outsourcing crime analysis let me know – that is I believe the best solution for them. Also have pitched to think tanks to hire me part time as well, as well as CJ programs to hire me in part time roles as well. I understand the CJ programs no interest, I am way more expensive than typical adjunct, I am a good deal for other groups though. (I mean I am good deal for CJ programs as well, part of the value add is supervising students for research, but universities don’t value that very high.)

I will ultimately keep at it – sending email pitches is easy. And I am hoping that as the website gets more organic search referrals, I will be able to break out of my first degree network.

Security issues with sending ChatGPT sensitive data

Part of my job as a data scientist is to be a bridge for lay-people interested in applying artificial intelligence and machine learning to their particular applications. Most quant people with a legit background will snicker at the term “artificial intelligence” – it is a buzzword for sure, but it doesn’t matter really. People have potential applications they need help with, and various statistical and optimization techniques can help.

Given the popularity of ChatGPT and other intelligent chatbots, I figured it would be worthwhile articulating the potential security issues with these technologies in criminal justice and healthcare domains. In particular, you should not send sensitive information in internet chatbot prompts. Examples of this include:

  • a crime analyst inputting incident narratives (that include names) and asking a chatbot to summarize them
  • a clinical coder inputting hospital notes and asking for the relevant billing codes
  • a business analyst inputting text from a set of slides, and asking ChatGPT to edit for grammer

The first two examples should be pretty clear why they are sensitive – they contain obviously sensitive and personal identifiable data. The last example is related to intellectual property leakage, which is more fuzzy, but for a general piece of advice if it is not OK to post publicly for everyone to see on the internet, you should not put it into a prompt. (So crime analysts talking about crime trends is probably OK, since that is already public info, but a business analyst with your pitch deck for internal business applications is probably not.)

Why can’t I send ChatGPT sensitive information?

So the way many online APIs work (including ChatGPT) is this:

  1. You go to website, you input information into a webform
  2. This data gets posted to a webpoint (someone elses computer)
  3. Someone elses computer takes that input, does something with that data
  4. That other computer sends information back to your computer

Here is a diagram of that flow:

So there are two potential attack vectors in this diagram. The first are the arrows sending data to/from OpenAIs computer. Someone could potentially intercept that data. This is not really a huge issue as stated, as the data is likely encrypted in transit. The second, and more important issue, is that the red OpenAI computer now has your sensitive data cached in some capacity.

If the red computer becomes compromised it can cause issues. This is not hypothetical, OpenAI has had issues of leaking sensitive information to other users. This is a computer glitch – bad but fixable. It is a risk though you should be aware of.

A more important issue though, the licensing I am aware of, they can use your conversations to improve the product. This is very bad as to my current understanding, as you can have conversations that are prompt leaked to third parties if they are updating models with your conversations downstream.

This is even worse than say Microsoft being able to read your emails – it would be like a potential third non-Microsoft party could become privy to some of your emails. For example, say a crime analyst in Raccoon city inputted crime incident narratives like I said in my prior example. Then I asked ChatGPT “Give me an example crime incident narrative”, and it outputs narratives very similar to the ones Raccoon city crime analyst previous put into ChatGPT. This is a feature under the current licensing, not a bug.

Let me know in the comments if they are offering paid tiers for the “don’t use my data for training and it is always encrypted and we can’t see it” (I don’t know why they do not offer that). Also they would need to have particular HIPPA standards for medical data, and CJIS standards for CJ data to be in security compliance for these example applications.

Now it is important to discuss other chatbots, who are often just calling OpenAI under the hood. The data flow diagram then looks like this:

It is essentially the same attack vectors but just doubled; now we have two computers instead of one that is a potential vulnerability.

Again here the issue is now two different people have your data cached in some capacity (the blue computer and the red computer). We have people making new services all the time now (the blue computers), that are just wrappers on OpenAI. Now you could have your data leaked by the blue computer, in addition to the problems with leaking in OpenAI.

The solution is local hosting, but local hosting is hard

OpenAI is to be commended for making a quality product – its very easy to use APIs are what make having wrapper services on top of it so easy (hence these many chatbot APIs). From a security standpoint though, you just need to do your due diligence now with two (or more) services when using these secondary tools, not just one. There will be malicious apps (so the blue computer is intentionally a bad actor), and there will be cases where the blue computer is compromised (so not intended to be malicious, but the people running the blue computer messed up).

Given that OpenAI as I am aware doesn’t have the necessary licensing to prevent info leakage, as well as the more specific security clearances, the solution like I said is to self host a model. Self hosting here means instead of sending data to the red OpenAI computer, the flow stays entirely in the single black computer you own, or you have your own server (so a second black computer that speaks to the first black computer).

There are open source and freemium models that are reasonable competitors. But, it is painful to self host these models. For neophytes the way these language models work, they take your text input, turn the text into a set of 1,000s of numbers. They then feed those 1,000s of numbers into a model with billions of parameters to get the final output. You can just think of it as doing several billion mathematical operations you individually could do on your hand-held calculator.

This takes a computer with a large memory and a GPU to do anything that doesn’t take hours. So self hosting a smaller batch process is maybe doable for a normal person or business, but if you want a live chatbot for even one person is hard (let alone a chatbot for multiple people to use at the same time).

Several large companies (including OpenAI) are currently even using up the majority of cloud infrastructure that has machines that can host and run these models, so even if you have money to pay AWS for one of their large GPU computers (it is expensive, think 5 digit costs per month), you maybe can’t even get a slot to get one of those cloud resources. And it is questionable how many people can even use that single machine.

I think eventually OpenAI will solve some of these security issues, and offer special paid tiers to accomodate use cases in healthcare and CJ. But until that happens, please do not post sensitive data into ChatGPT.

Updates on CRIME De-Coder and ASEBP

So I have various updates on my CRIME De-Coder consulting site, as well as new posts on the American Society of Evidence Based Policing Criminal Justician series.

CRIME De-Coder

Blog post, Don’t use percent change for crime data, use this stat instead. I have written a bunch here about using Poisson Z-scores, so if you are reading this it is probably old news. Do us all a favor and in your Compstat reports drop ridiculous percent change metrics with low baselines, and use 2 * ( sqrt(Current) - sqrt(Past) ).

Blog post, Dashboards should be up to date. I will have a more techy blog post here on my love/hate relationship with dashboards (most of the time static reports are a better solution). But one scenario they do make sense is for public facing dashboards, but they should be up to date. The “free” versions of popular tools (Tableau, PowerBI) don’t allow you to link to a source dataset and get auto-updated, so you see many old dashboards out of date online. If you contract with me, I can automate it so it is up to date and doesn’t rely on an analyst manually updating the information.

Demo page – that page currently includes demonstrations for:

The WDD Tool is pure javascript – picking up more of that slowly (the Folium map has a few javascript listener hacks to get it to look the way I want). As a reference for web development, I like Jon Duckett’s three books (HTML, javascript, PHP).

Ultimately too much stuff to learn, but on the agenda are figuring out google cloud compute + cloud databases a bit more thoroughly. Then maybe add some PHP to my CRIME De-Coder site (a nicer contact me form, auto-update sitemap, and rss feed). I also want to learn how to make ArcGIS dashboards as well.

Criminal Justician

The newest post is Situational crime prevention and offender planning – discussing one of my favorite examples of crime prevention through environmental design (on suicide prevention) and how it is a signal about offender behavior that goes beyond simplistic impulsive behavior. I then relate this back to current discussion of preventing mass shootings.

If you have ideas about potential posts for the society (or this blog or crime de-coders blog), always feel free to make a pitch

Crime De-Coder LLC Website

So I have created CRIME De-Coder LLC, a firm to do my consulting work with police departments. Check out my website, crimede-coder.com.

Feedback is welcome. In particular check out the services pages, and my first blog post on what distinguishes my services from most firms. Providing computer code to generate the end product is “teaching a man a fish”, whereas most firms just drop a final report and leave.

And of course feel free to reach out to consult@crimede-coder.com if you are interested in pursuing a project. Going forward I plan on making a new post around once a month, so sign up in your feed reader or using a service like IFTTT.


Setting up a stand alone website is not that hard in the end. Currently it is a static site with some custom javascript (hosted on Hostinger). I should do a PHP server for the new blog posts and RSS feed eventually, but for now this is fine. I suggest for those interested in the same get the Jon Duckett books (HTML/Javascript/PHP) for overview of the tech, and then check out Dani Kross’s youtube tutorials (for random things like editing the htaccess file).

I am not doing a newsletter for the blog-posts, as I am concerned it will get my email on random block lists. But if there is demand for it in the future I will figure out some other service I guess to do that.

I wanted a more bare-metal setup (not a hosted wordpress like this site), as in the future I will likely do demo’s of dashboards, host some pyscript, make a sign in for paid content, etc. I just wanted flexibility from the start. So stay tuned for more content from CRIME De-Coder!

Surpassed 100k views in 2022

For the first time, yearly view counts have surpassed 100,000 for my blog.

I typically get a bump of (at best) a few hundred views when I first post a blog. But the most popular posts are all old ones, and I get the majority of my traffic via google searches.

Around March this year monthly bumped up from around 9k to 11k views per month. Not sure of the reason (it is unlikely due to any specific inidividual post, as you can see, none of the most popular posts were posted this year). A significant number of the views are likely bots (what percent overall though I have no clue). So it is possible my blog was scooped up in some other aggregators/scrapers around that time (I would think those would not be counted as search engine referrals though).

One interesting source for the blog, when doing academic style posts with citations, my blog gets picked up by google scholar (see here for example). It is not a big source, but likely a more academic type crowd being referred to the blog (I can tell people have google scholar alerts – when scholar indexes a post I get a handful of referrals).

I have some news coming soon about writing a more regular criminal justice column for an organization (readers will have to wait alittle over a week). But I also do Ask Me Anything, so always feel free to send me an email or comment on here (started AMA as I get a trickle of tech questions via email anyway, and might as well share my response with everyone).

I typically just blog generally about things I am working on. So maybe next up is that auto-ml libraries often have terrible defaults for hypertuning random forests, or maybe an example of data envelopment analysis, or quantile regression for analyzing response times, or monitoring censored data are all random things I have been thinking about recently. But no guarantees about any those topics in particular!

A bunch of random shout outs

Busy, busy, busy! Hopefully I will have some time in the near future to write up some more data science posts. But for now, here is a small python snippet to help you build interaction variables between two sets of numpy arrays/dataframes.

import numpy as np
def np_int(a,b):
    rows = a.shape[0]
    cols = a.shape[1]*b.shape[1]
    return np.einsum('ij,ik->ijk', a, b).reshape((rows,cols))

This works for pytorch as well (just replace np.einsum with torch.einsum). So coming up (eventually) I will illustrate encoding interaction between hidden layers in a deep learning model. But for now some quicker updates.

Shout out #1: Scott Jacques has continued to push the charge for open access to criminology journals. He has two recent posts about post-prints, and how our main journal (Criminology) has an excessive policy of not allowing authors to post post prints for over two years (whereas the majority of criminology journals allow you to post immediately).

Several aspects of open science are tricky – posting pre-prints/post-prints is not. If we can come together as a group this is an easy, no cost way to greatly improve the accessibility of our work to the greater public.

Shout out #2: The folks at Police Rewired have hosted a hackathon intended to Hack Hate. It is too late to participate, but they will be displaying the results this Sunday. I have not had the chance to participate in any code hackathons, I will need to make a concerted effort in the future to give at least one a shot. (It seems hard, how can you do any work in only a day or a week or two!? But the proof is in the pudding so to speak, I’ve have seen some pretty cool things come out of various hackathons in the past.)

Shout out #3: My workplace, HMS, is involved in a data sharing collaborative called the Digital Health DRC. They also have a hackathon coming up, but this is related to Telehealth use. The Digital Health DRC is pretty cool though, it is basically a way for HMS (and several other private sector entities) to share various datasets with researchers over the globe.

The scope of HMS’s data is somewhat outside the realm of my old stomping grounds of criminology (but not entirely, a big part of my job is identifying potentially fraudulent patterns in claims data). But for folks who have a research question that could be answered using health insurance claims data, this is a good resource to look into. (HMS has pretty good coverage of Medicare claims across the US.)

Finally, I experimented a few days on the site with hosting ads. I managed to serve up a few thousand and make 10 cents. So I will turn that off for now. I debated on putting the button for folks to donate a coffee, but even that is not necessary. (I can afford the few bucks for the domain, and I use dropbox to back up my files anyway, so hosting extra materials is not a big deal.) I rather folks just take my nerdy notes and make your own cool stuff (and share them with me!) I may need to figure out a better hosting solution for images though — google photos is continuing to give me troubles I see (so if you see an image is not coming through feel free to let me know in the comments or send me an email).