Identifying excess rounding

Given the hubbub about Blue Cross and paying for anesthesiologists, there was an interesting paper making the rounds, Comparison of Anesthesia Times and Billing Patterns by Anesthesia Practitioners. (Shared via Crémieux.)

Most medical claims are billed via what are call CPT codes (Current Procedural Terminology). So if you go to the ER, you will get billed for a code 99281 to 99285. The final digit encodes different levels of complexity for the case, with 5’s being more complex. It was news to me that anesthesiologists actually bill for time directly, but the above linked paper showed (pretty plainly) that there is strong evidence they round up to every 5 minutes.

Now the paper just selected the anesthesiologists that have the highest proportion of billed times ending in 5’s. Here I will show a better way to flag specific problematic anesthesiologists (using repeated binomial tests and false discovery rate corrections).

Here I simulate 1,000 doctors, and select 40 of them to be bad, and those 40 round all of their claims up to the nearest 5 minute mark. Whereas the other docs are just billing the time as is. And they have varying total number of claims, between 100 and 500.

import numpy as np
from scipy.stats import gamma,uniform
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import binomtest,false_discovery_control

np.random.seed(10)

docs = 1000
# pick a random set 
bad_docs = np.random.choice(np.arange(docs),40,replace=False)
doci = []
claims = []

for i in range(docs):
    totn = int(np.ceil(uniform.rvs(100,500))) # number of claims
    doci += [i]*totn
    g = gamma.rvs(6,scale=12,size=totn)
    if i in bad_docs:
        g = np.ceil(g/5)*5
    else:
        g = g.round()
    claims += g.tolist()

dat = pd.DataFrame(zip(doci,claims),columns=['Doc','Time'])

# Histogram
fig, ax = plt.subplots(figsize=(8,4))
dat['Time'].hist(bins=range(201),alpha=0.8,color='k',ax=ax)
plt.savefig('Histogram.png',dpi=500,bbox_inches='tight')

You can see my gamma distribution is not as heavy tailed as the JAMA paper, but qualitatively has the same spike traits. Based on this, you can see the spike is relative to the other density, and so that shows in the JAMA paper there are more anesthesiologists rounding at 60 minutes than there are at the other 5 minute intervals.

In this particular example, it would be trivial to spot the bad docs, since they round 100% of the time, and you would only expect around 20% (since billing in minute intervals).

dat['Round'] = dat['Time'] % 5 == 0
dat['N'] = 1
gs = dat.groupby('Doc',as_index=False)[['N','Round']].sum()
gs['Perc'] = gs['Round']/gs['N']
gs.sort_values(by='Perc',ascending=False,ignore_index=True,inplace=True)
gs

# Upper Quantiles
np.quantile(gs['Perc'],[0.75,0.8,0.85,0.9,0.95,0.99])

But you can see a problem with using a top 5% quantile cut off here. Since I only have 4% bad doctors, using that hard cut-off will result in a few false positive flags. My suggested approach is to create a statistical test (Chi-Square, binominal, KS, whatever makes sense for individual doctors). Run the test for each doctor, get the p-value, and then run a false discovery rate correction on the p-values.

The above where doctors round 100% of the time is too easy, so here I simulate 40 doctors will round up 10% to 30% of the time. I also have fewer cases (more cases and more rounding make it much easier to spot).

# Redoing sim, but it is a smaller percentage of the stats are bad
for i in range(docs):
    totn = int(np.ceil(uniform.rvs(100,500))) # number of claims
    doci += [i]*totn
    g = gamma.rvs(6,scale=12,size=totn)
    if i in bad_docs:
        randperc = int(np.round(totn*uniform.rvs(0.1,0.2)))
        badn = np.random.choice(np.arange(totn),randperc,replace=False)
        g[badn] = np.ceil(g[badn]/5)*5
        g = g.round()
    else:
        g = g.round()
    claims += g.tolist()

dat = pd.DataFrame(zip(doci,claims),columns=['Doc','Time'])
dat['Round'] = dat['Time'] % 5 == 0
dat['N'] = 1
gs = dat.groupby('Doc',as_index=False)[['N','Round']].sum()
gs['Perc'] = gs['Round']/gs['N']
gs.sort_values(by='Perc',ascending=False,ignore_index=True,inplace=True)

Now we can apply the binomial test to each doctor, then adjust for false discovery rate.

# Calculating binom test
def bt(x):
    k,n = x.iloc[0],x.iloc[1]
    b = binomtest(k,n,p=0.2,alternative='greater')
    return b.pvalue

gs['p'] = gs[['Round','N']].apply(bt,axis=1)
gs['q'] = false_discovery_control(gs['p'],method='by')

# Captures 28 out of 40 bad docs, no false positives
gs['BadDocs'] = gs['Doc'].isin(bad_docs)
gs[gs['q'] < 0.05]

If you check out the doctors via gs.head(50), you can see that a few of the bad-docs where adjusted to have q-values of 1, but they ended up being low N and in the range you would expect.

While anesthesiologists billing is different, this same approach would be fine for CPT codes that have the 1-5 modifier (you might use a leave one out strategy and a Chi-Square test). Anesthesiologists if they know they will be scrutinized with exact 5 minutes, they will likely adjust and round up, but to not regular numbers. If that is the case, the default distribution will be expected to be uniform on the 0-9 digits (sometimes people use a Benford like test for the trailing digits, this is the same idea). That will be harder to fake.

I don’t have an issue with Blue Cross saying they will only bill for pre-allotted times. But even without making an explicit policy like that, they can identify bad actors and pursue investigations into those problematic anesthesiologists even without making widespread policy changes.

Question Sets and All Paths

I was nerd-sniped with a question at work the other day. The set up was like this, imagine a survey where all of the questions have yes-no answers. Some of the answers are terminating, so if you answer No to a particular question, the questions stop. But some if you reach that part of the tree will keep going.

The ask was a formula to articulate the total number of potential unique answer sets. Which I was not able to figure out, I did however write python code to generate all of the unique sets. So here is that function, where I create a networkx directed graph in a particular format, and then find all of the potential start to end paths in that graph. You just add a dummy begin node, and end nodes at the terminating locations and the final question. You then just search for all of the paths from the begin to end nodes.

import networkx as nx

def question_paths(totq,term):
    '''
    totq -- positive integer, total number of questions
    term -- list of integers, where ints are questions that terminate
    '''
    nodes = [f'Q{i}{r}' for i in range(totq) for r in ['N','Y']]
    edges = []
    for i in range(totq-1):
        edges.append([f'Q{i}Y',f'Q{i+1}N'])
        edges.append([f'Q{i}Y',f'Q{i+1}Y'])
        if i not in term:
            edges.append([f'Q{i}N',f'Q{i+1}N'])
            edges.append([f'Q{i}N',f'Q{i+1}Y'])
    # adding in begin/end
    nodes += ['Begin','End']
    edges += [['Begin','Q0N'],['Begin','Q0Y']]
    for t in term:
        edges.append([f'Q{t}N','End'])
    edges += [[f'Q{totq-1}N','End'],[f'Q{totq-1}Y','End']]
    # Now making graph
    G = nx.DiGraph()
    G.add_nodes_from(nodes)
    G.add_edges_from(edges)
    # Getting all paths
    paths = []
    for p in nx.all_simple_paths(G,source='Begin',target='End'):
        nicer = [v[-1] for v in p[1:-1]]
        paths.append(nicer)
    return paths

And now we can check out where all paths are terminating, you only get further up the tree if you answer all yes for prior questions.

>>> question_paths(3,[0,1,2])
[['N'],
 ['Y', 'N'],
 ['Y', 'Y', 'N'],
 ['Y', 'Y', 'Y']]

So in that scenario we have four potential answer sets. For all binary, we have the usual 2^3 number of paths:

>>> question_paths(3,[])
[['N', 'N', 'N'],
 ['N', 'N', 'Y'],
 ['N', 'Y', 'N'],
 ['N', 'Y', 'Y'],
 ['Y', 'N', 'N'],
 ['Y', 'N', 'Y'],
 ['Y', 'Y', 'N'],
 ['Y', 'Y', 'Y']]

And then you can do a mixture, here just question 2 terminates, but 1/3 are binary:

>>> question_paths(3,[1])
[['N', 'N'],
 ['N', 'Y', 'N'],
 ['N', 'Y', 'Y'],
 ['Y', 'N'],
 ['Y', 'Y', 'N'],
 ['Y', 'Y', 'Y']]

One of the things doing this exercise taught me is that it matters where the terminating nodes are in the tree. Earlier terminating nodes results in fewer potential paths.

# can different depending on where the terminators are
len(question_paths(10,[1,5])) # 274
len(question_paths(10,[6,8])) # 448
len(question_paths(10,[0,1])) # 258
len(question_paths(10,[8,9])) # 768

So the best in terms of a formula for the total number of paths I could figure out was 2^(non-terminating questions) <= paths <= 2^(questions) (which is not a very good bound!) I was trying to figure out a product formula but was unable (any suggestions let me know!)

This reminds me of a bit of product advice from Jennifer Pahlka, you should have eliminating questions earlier in the form. So instead of filling out 100 questions and at the end being denied for TANF, you ask questions that are the most likely to eliminate people first.

It works the same for total complexity of your application. So asking the terminating questions earlier reduces the potential number of permutations you ultimately have in your data as well. Good for the user and good for the developers.

Suits, Money Laundering, and Linear Programming

Currently watching Suits with the family, and an interesting little puzzle came up in the show. In Season 1, episode 8, there is a situation with money laundering from one account to many smaller accounts.

So the set up is something like “transfer out 100 million”, and then you have some number of smaller accounts that sum to 100 million.

When Lewis brought this up in the show, and Mike had a stack of papers to go through to identify the smaller accounts that summed up to the larger account, my son pointed out it would be too difficult to go through all of the permutations to figure it out. The total number of permutations would be something like:

N = Sum(b(A choose k) for 1 to k)

Where A is the total number of accounts to look over, and k is the max number of potential accounts the money was spread around. b(A choose k) is my cheeky text format for binomial coefficients.

So this will be a very large number. 10,000 choose 4 for example is 416 416 712 497 500 – over 416 trillion. You still need to add in b(10,000 choose 2) + b(10,000 choose 3) + ….. b(10,000 choose k). With even a small number of accounts, the number of potential combinations will be very large.

But, I said in response to my son I could write a linear program to find them. And this is what this post shows, but doing so made me realize there are likely too many permutations in real data to make this a feasible approach in conducting fraud investigations. You will have many random permutations that add up to the same amount. (I figured some would happen, but it appears to me many happen in random data.)

(Note I have not been involved with any financial fraud examinations in my time as an analyst, I would like to get a CFA time permitting, and if you want to work on projects let me know! So all that said, I cannot speak to the veracity of whether this is a real thing people do in fraud examinations.)

So here I use python and the pulp library (and its default open source CBC solver) to show how to write a linear program to pick bundles of accounts that add up to the right amount. First I simulate some lognormal data, and choose 4 accounts at random.

import pulp
import numpy as np

# random values
np.random.seed(10)
simn = 10000
x = np.round(np.exp(np.random.normal(loc=5.8,scale=1.2,size=simn)),2)

# randomly pick 4
slot = np.arange(0,simn)
choice = np.random.choice(slot,size=4,replace=False)
tot = x[choice].sum()
print(tot,choice)

So we are expecting the answer to be accounts 2756 5623 5255 873, and the solution adds up to 1604.51. So the linear program is pretty simple.

# make a linear program
P = pulp.LpProblem("Bundle",pulp.LpMinimize)
D = pulp.LpVariable.dicts("D",slot,cat='Binary')

# objective selects smallest group
P += pulp.lpSum(D)

# Constraint, equals the total
P += pulp.lpSum(D[i]*x[i] for i in range(simn)) == tot

# Solving with CBC
P.solve()

# Getting the solution
res = []
for i in range(simn):
    if D[i].varValue == 1:
        res.append(i)

In practice you will want to be careful with floating point (later I convert the account values to ints, another way though is instead of the equal constraint, make it <= tot + eps and >= tot - eps, where eps = 0.001. But for the CBC solver this ups the time to solve by quite a large amount.)

You could have an empty objective function (and I don’t know the SAT solvers as much, I am sure you could use them to find feasible solutions). But here I have the solution by the minimal number of accounts to look for.

So here is the solution, as you can see it does not find our expected four accounts, but two accounts that also add up to 1604.51.

You can see it solved it quite fast though, so maybe we can just go through all the potential feasible solutions. I like making a python class for this, that contains the “elimination” constraints to prevent that same solution from coming up again.

# Doing elimination to see if we ever get the right set
class Bundle:
    def __init__(self,x,tot,max_bundle=10):
        self.values = (x*100).astype('int')
        self.totn = len(x)
        self.tot = int(round(tot*100))
        self.pool = []
        self.max_bundle = max_bundle
        self.prob_init()
    def prob_init(self):
        P = pulp.LpProblem("Bundle",pulp.LpMinimize)
        D = pulp.LpVariable.dicts("D",list(range(self.totn)),cat='Binary')
        # objective selects smallest group
        P += pulp.lpSum(D)
        # Constraint, equals the total
        P += pulp.lpSum(D[i]*self.values[i] for i in range(simn)) == self.tot
        # Max bundle size constraint
        P += pulp.lpSum(D) <= self.max_bundle
        self.prob = P
        self.d = D
    def getsol(self,solver=pulp.PULP_CBC_CMD,solv_kwargs={'msg':False}):
        self.prob.solve(solver(**solv_kwargs))
        if self.prob.sol_status != -1:
            res = []
            for i in range(self.totn):
                if self.d[i].varValue == 1:
                    res.append(i)
            res.sort()
            self.pool.append(res)
            # add in elimination constraints
            self.prob += pulp.lpSum(self.d[r] for r in res) <= len(res)-1
            return res
        else:
            return -1
    def loop_sol(self,n,solver=pulp.PULP_CBC_CMD,solv_kwargs={'msg':False}):
        for i in range(n):
            rf = self.getsol(solver,solv_kwargs)
            if rf != 1:
                print(f'Solution {i}:{rf} sum: {self.values[rf].sum()}')
            if rf == -1:
                print(f'No more solutions at {i}')
                break

And now we can run through and see if the correct solution comes up:

be = Bundle(x=x,tot=tot)
be.loop_sol(n=10)

Uh oh – I do not feel like seeing how many potential two solutions there are in the data (it is small enough I could just enumerate those directly). So lets put a constraint in that makes it so it needs to be four specific accounts that add up to the correct amount.

# add in constraint it needs to be 4 selected
choice.sort()
be.prob += pulp.lpSum(be.d) == 4
print(f'Should be {choice}')
be.loop_sol(n=10)

And still no dice. I could let this chug away all night and probably come up with the set I expected at some point. But if you have so many false positives, this will not be very useful from an investigative standpoint. So you would ultimately need more constraints.

Lets say our example the cumulative total will be quite large. Will that help limit the potential feasible solutions?

# Select out a sample of high
sel2 = np.random.choice(x[x > 5000],size=4,replace=False)
tot2 = sel2.sum()
ch2 = np.arange(simn)[np.isin(x,sel2)]
ch2.sort()
print(tot2,ch2)

be2 = Bundle(x=x,tot=tot2)
be2.loop_sol(n=20)

There are still too many potential feasible solutions. I thought maybe a problem with real data is that you will have fixed numbers, say you are looking at transactions, and you have specific $9.99 transactions. If one of those common transactions results in a total sum, it will just be replicated in the solution over and over again. I figured with random data the sum would still be quite unique, but I was wrong for that.

So I was right in that I could write a linear program to find the solution. I was wrong that there would only be one solution!

Types of websites and trade-offs

For some minor updates on different fronts. I have a new blog post on Crime De-Coder about how to figure out the proper ODBC connection string to query your record management system. I have done this song and dance with maybe a dozen different PDs at this point (and happened to do it twice in the prior week), so figured a post would make sense.

Two, article with Kim Rossmo has been published, The journey-to-crime buffer zone: Measurement issues and methodological challenges. Can check out the github repo and CrimRXiv version for free.

Main reason I wanted to make a post today about the types of websites. I have seen several influencers discuss using GenAI to create simple apps. These tools are nice, but many seem to make bad architecture decisions from the start (many people should not be making python served websites). So I will break down a few of the different options for creating a website in this post.

The most basic is a static html site – this just requires you create the HTML and place it somewhere on a server. A free option is github pages. You can still use javascript apps, but they are run client side and there is no real ability to limit who can see the site (e.g. you cannot make it password protected to log in). These can handle as much traffic as you want. If you have data objects in the site (such as a dashboard) the objects just need to be stored directly on the server (e.g. in json files or csv if you want to parse them). You can build data dashboards using D3 or other WASM apps.

The other types of websites you can do anything you can in HTML, so I focus more on what makes them different.

PHP sites – this probably requires you to purchase an online server (ignoring self hosting in your closet). There are many low cost vendors (my Crime De-Coder site is PHP on Hostinger. But they are pretty low price, think $5 per month. These do have the ability to create password protected content and have server side functions hidden. My $5 a month website has a service level agreement to handle 20k responses per day, it also has a built in MySQL database that can hold 2 gig of data for my plan. (These cheap sites are not bad if all you want is a smallish database.) WordPress is PHP under the hood (although if you want a custom site, I would just start from scratch and not modify WordPress. WordPress is good to use a GUI to style the site with basic templates.)

IMO if you need to protect stuff behind a server, and have fairly low traffic requirements, using a PHP site is a very good and cheap option. (For higher traffic I could pay under $20 a month for a beefier machine as well, we are talking crazy site traffic like well over 100k visits per day before you need to worry about it.)

Node.js – this is a server technology, popular for various apps. It is javascript under the hood, but you can have stuff run server side (so can be hidden from end user, same as PHP). The tech to host a site is a bit more involved than the PHP hosting sites. You can either get a VPS (for typically less than $10 a month, and can probably handle close to the same amount of traffic as the cheap PHP), and write some code to host it yourself. Think of a VPS as renting a machine (so can be used for various things, not just webhosting.) Or use some more off the shelf platform (like FlyIO, which has variable pricing). You typically need to think about a separate database hosting as well with these tools though. (I like Supabase.)

Python – python has several libraries, e.g. django, flask, as well as many different dashboard libraries. Similar to Node, you will need to either host this on a VPS (Hostinger has VPS’s as well, and I know DigitalOcean is popular), or some other service. Which is more expensive than the cheaper PHP options. It is possible to have authentication in python apps, but I do not tend to see many examples of that. Most python websites/dashboards I am familiar with are self-hosted, and so limit who can seem them intrinsically to the companies network (e.g. not online in a public website for everyone to see).

Personal story, at one point my Dallas crime dashboard (which is WASM + python) on my Crime De-Coder site (which is served on the PHP site, so takes a few extra seconds to install), was broken due to different library upgrades. So I hosted the python panel dashboard on Google cloud app while I worked on fixing the WASM. I needed one up from the smallest machine due to RAM usage (maybe 2 gigs of RAM). Google cloud app was slower than WASM on start up, sometimes would fail, and cost more than $100 per month with very little traffic. I was glad I was able to get the WASM version fixed.

Dallas Crime Dashboard

It is all about trade-offs though in the architecture. So the WASM app you can right click and see how I wrote the code to do that. Even though it is on a PHP site, it is rendered client side. So there is no way to protect that content from someone seeing it. So imagine I wanted you to pay $5 a month to access the dashboard – someone could right click and copy the code and cancel the subscription (or worse create their own clone for $4 per month). For another example, if I was querying a private database (that you don’t want people to be able to see), someone could right click and see that as well. So the WASM app only makes sense for things that don’t need to be private. Google cloud app though that is not a limitation.

The mistake I see many people make is often picking Node/Python where PHP would probably be a better choice. Besides costs, you need to think about what is exposed to the end user and the level of effort to create/manage the site. So if you say to GenAI “I want to build a dashboard website” it may pop out a python example, but many of the examples I am seeing it would have been better to use PHP and say “I have a php website, build a function to query my database and return an array of crimes by month”, and then as a follow up question say, “ok I have that array, create a line chart in PHP and javascript using D3.js”.

So to me GenAI does not obviate the need to understand the technology, which can be complicated. You need a basic understanding of what you want, the constraints, and then ask the machine for help writing that code.

Stacking and clustering matplotlib bar charts

Just a quick post – recently was trying to see examples of both stacking and clustering matplotlib + pandas bar charts. The few examples that come up online do not use what I consider to be one of the simpler approaches. The idea is that clustering is the hard part, and you can replicate the stacked bars fairly easily by superimposing multiple bars on top of each other. For just a quick example, here is a set of similar data to a project I am working on, cases cleared by different units over several years:

import pandas as pd
import matplotlib.pyplot as plt

# data of cases and cleared over time
years = [20,21,22]
acases = [180,190,200]
aclosed = [90,90,90]
bcases = [220,200,220]
bclosed = [165,150,165]
ccases = [90,110,100]
cclosed = [70,100,90]

df = pd.DataFrame(zip(years,acases,aclosed,bcases,bclosed,ccases,cclosed),
                  columns=['Year','Cases_A','Closed_A','Cases_B','Closed_B',
                           'Cases_C','Closed_C'])
print(df)


# Transpose
dfT = df.set_index('Year').T
dfT.index = dfT.index.str.split("_",expand=True) # multi-index

The biggest pain in doing these bar charts is figuring out how to shape the data. My data initially came in the top format, where each row is a year. For this plot though, we want years to be the columns and rows to be the different units. So you can see the format after I transposed the data. And this is pretty much the first time in my life a multi-index in pandas was useful.

To superimpose multiple pandas plots, you can have the first (which is the full bar in the background) return its ax object. Then if you pass that to the subsequent bar calls in superimposes on the plot. Then to finish I need to redo the legend.

# Now can plot
ax = dfT.loc['Cases',:].plot.bar(legend=False,color='#a6611a',edgecolor='black')
dfT.loc['Closed',:].plot.bar(ax=ax,legend=False,color='#018571',edgecolor='black')

# Redo legend
handler, labeler = ax.get_legend_handles_labels()
hd = [handler[0],handler[-1]]
lab = ['Not Closed','Closed']
ax.legend(hd, lab)
ax.set_title('Cases Assigned by Unit 21-23')

plt.show()

To make this a percentage filled bar plot, you just need to change the input data. So something like:

den = dfT.loc['Cases',:]
ax = (den/den).plot.bar(legend=False,color='#a6611a',edgecolor='black')
(dfT.loc['Closed',:]/den).plot.bar(ax=ax,legend=False,color='#018571',edgecolor='black')

# Redo legend
handler, labeler = ax.get_legend_handles_labels()
hd = [handler[0],handler[-1]]
lab = ['Not Closed','Closed']
ax.legend(hd, lab,loc='lower left')
ax.set_title('Proportion Cases Closed by Unit 21-23')

plt.show()

This does not have any way to signal that each subsequent bar in the clustered subsequent is a new year. So I just put that in the title. But I think the part is pretty simple to interpret even absent that info.

Crime De-Coder Updates

For other Crime De-Coder updates. I was interviewed by Jason Elder on his law enforcement analysts podcast.

Also I am still doing the newsletter, most recent talks about my social media strategy (small, regular posts), an interesting research job analyzing administrative datasets at Chase, and some resources I have made for Quarto (which I much prefer over Jupyter notebooks).

Aoristic analysis, ebooks vs paperback, website footer design, and social media

For a few minor updates, I have created a new Twitter/X account to advertise Crime De-Coder. I do not know if there is some setting that people ignore all unverified accounts, but would appreciate the follow and reshare if you are still on the platform.

I also have an account on LinkedIn, and sometimes comment on the Crime Analysis Reddit.

I try to share cool data visualizations and technical posts. I know LinkedIn in particular can be quite vapid self-help guru type advice, which I avoid. I know being more technical limits the audience but that is ok. So appreciate the follow if you are on those platforms and resharing the work.

Ebooks vs Paperbacks

Part of the reason to start X account back up is to just try more advertising. I have sold not quite 80 to date (including pre-sales). My baseline goal was 100.

For the not pre-sales, I have sold 35% ebooks and 65% paperbacks. So spending some time to distribute your book paperback seems to me to be worth it.

Again feel like most academics who publish technical books self-publishing is a very good idea. So read the above linked post about some of the logistics of self-publishing.

Aoristic analysis in python

On the CRIME De-Coder blog, check out my post on Aoristic analysis. It has links to python code on github for those who just want the end result. It has several methods though to do hour of day and hour by day of week breakdowns. With the ability to do it by categories in data. And post hoc generate a few graphs. I like the line graphs the best:

But the more common heatmap I can understand why people like it

Website Design

I have a few minor website design updates. The homepage is more svelte. Wife suggested that it should be easier to see what I do right when you are on homepage, so put the jumbotron box at the bottom and the services tiles (with no pictures) at the top.

It does not look bad on mobile either (I only recently figured out that in Chrome’s DevTools they have a button to do turn on mobile view, very helpful!)

Final part is that I made a footer for my pages:

I am not real happy with this. One of the things you notice when you start doing web-design is everyone’s web-page looks the same. There are some basic templates for WordPress or Wix (and probably other CMS generators). Here is Supabases’s footer for example:

And now that I have shown you, you will see quite a few websites have that design. So I did the svg links to social media, but I may change that. (And go with no footer again, there is not a real obvious need for it.) So open to suggestions!

In intentionally made many of the decisions for the way the Crime De-Coder site looks not only to make it usable but to make it at least somewhat different. Avoid super long scrolls, sticky header (that still works quite well on phones). The header is quite dense with many sub-pages (I like it though).

I think alot of public sector agencies that are doing data dashboards now do not look very nice. Many are just iframed Tableau dashboards. If you want help with those data visualizations embedded in a more organic way in your site, that is something Crime De-Coder can help with.

Reducing folium map sizes

Recently for a crimede-coder project I have been building out a custom library to make nice leaflet maps using the python folium library. See the example I have posted on my website. Below is a screenshot:

This map ended up having around 3000 elements in it, and was a total of 8mb. 8mb is not crazy to put on a website, but is at the stage where you can actually notice latency when first rendering the map.

Looking at the rendered html code though it was verbose in a few ways for every element. One is that lat/lon are in crazy precision by default, e.g. [-78.83229390597961, 35.94592660794455]. So a single polygon can have many of those. Six digits of precision for lat/lon is still under 1 meter of precision, which is plenty sufficient for my mapping applications. So you can reduce 8+ characters per lat/lon and not really make a difference to the map (you can technically have invalid polygons doing this, but this is really pedantic and should be fine).

A second part of the rendered folium html map for every object is given a full uuid, e.g. geo_json_a19eff2648beb3d74760dc0ddb58a73d.addTo(feature_group_2e2c6295a3a1c7d4c8d57d001c782482);. This again is not necessary. I end up reducing the 32 length uuids to the first 8 alphanumeric characters.

A final part is that the javascript is not minified – it has quite a bit of extra lines/spaces that are not needed. So here are my notes on using python code to take care of some of those pieces.

To clean up the precision for geometry objects, I do something like this.

import re

# geo is the geopandas dataframe
redg = geo.geometry.set_precision(10**-6).to_json()
# redg still has floats, below regex clips values
rs = r'(\d{2}\.|-\d{2}\.)(\d{6})(\d+)'
re.sub(rs,r'\1\2',redg)

As most of my functions add the geojson objects to the map one at a time (for custom actions/colors), this is sufficient to deal with that step (for markers, can round lat/lon directly). It may make more sense for the set precision to be 10**-5 and then clip the regex. (For these regex’s I am showing there is some risk they will replace something they should not, I think it will be pretty safe though.)

Then to clean up the UUID’s and extra whitespace, what I do is render the final HTML and then use regex’s:

# fol is the folium object
html = fol.get_root()
res = html.script.get_root().render()
# replace UUID with first 8
ru = r'([0-9a-f]{8})[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{12}'
res = re.sub(ru,r'\1',res)
# clean up whitespace
rl = []
for s in res.split('\n'):
    ss = s.strip()
    if len(ss) > 0:
        rl.append(ss)
rlc = '\n'.join(rl)

There is probably a smarter way to do this directly with the folium object for the UUID’s. For whitespace though it would need to be after the HTML is written. You want to be careful with the cleaning up the whitespace step – it is possible you wanted blank lines in say a leaflet popup or tooltip. But for my purposes this is not really necessary.

Doing these two steps in the Durham map reduces the size of the rendered HTML from 8mb to 4mb. So reduced the size of the file by around 4 million characters! The savings will be even higher for maps with more elements.

One last part is my map has redundant svg inserted for the map markers. I may be able to use css to insert the svg, e.g. something like in css .mysvg {background-image: url("vector.svg");}, and then in the python code for the marker svg insert <div class="mysvg"></div>. For dense point maps this will also save quite a few characters. Or you could add in javascript to insert the svg as well (although that would be a bit sluggish I think relative to the css approach, although sluggish after first render if the markers are turned off).

I have not done this yet, as I need to tinker with getting the background svg to look how I want, but could save another 200-300 characters per marker icon. So will save a megabyte in the map for every 3000-5000 markers I am guessing.

The main reason I post webdemo’s on the crimede-coder site is that there a quite a few grifters in the tech space. Not just for data analysis, but for front-end development as well. I post stuff like that so you can go and actually see the work I do and its quality. There are quite a few people now claiming to be “data viz experts” who just embed mediocre Tableau or PowerBI apps. These apps in particular tend to produce very bad maps, so here you can see what I think a good map should look like.

If you want to check out all the interactions in the map, I posted a YouTube video walking through them

Durham hotspot map walkthrough of interactions

Wake LP talk on LPRs and javascript hacks in WooCommerce

For some Crime De-Coder news, I will be giving a tech talk on automated license plate readers for the Wake County Libertarian Party on Wednesday July 17th in Raleigh.

See my slides on the CRIME De-Coder website to get a preview of what I will be talking about.

This post will just be a quick one that is hopefully helpful to others. So I use WooCommerce + LuLu to distribute shipping for my print book. For those who have purchased a copy thank you! Many of the paperback purchases will be arriving at your homes very soon. There have been two hiccups with my store website for individuals.

Problem 1 is an error nonce is invalid popping up after trying to add the book to the cart. It is difficult to replicate and is an underlying cache error with WordPress as best I can tell. My advice to fix is go to the individual book page directly to add the book to your cart (paperback, ebook). It seems to mostly happen to me when I am on the main shop page and add directly to cart. If the problem persists for you let me know.

The second problem is that for the print book, to do shipping on LuLu’s end you need to input a phone number. As far as I can tell, most website advice in WooCommerce suggest to pay for an additional plug in to edit this option. I was able to use javascript (and the WPCode free plugin) though to force the phone number to be filled in for free though. Sharing here as I hope it will be helpful to others.

Check out the onload function first, that sets the attribute for either billing-phone or shipping-phone to be required. If you are fast and are looking at the exact right spot of the checkout page, you would be able to see this change from “Phone (Optional)” to “Phone”.

The change_label event listener is to modify the error message when you don’t put in a phone number (by default it confusingly says “Please enter a valid phone number (optional)”. So that part is a bit hacky with attaching the event listener to the entire webpage, but unless you are trying to purchase from a Commodore 64 it should be fine.

<script>
// This script forces the billing/shipping
// phone number to be filled in and not optional
// Andy Wheeler, done via WPCode Extension in Footer
function change_label(){
    var xl = document.getElementById("billing-phone");
    if (xl){
        var ll = xl.nextSibling;
        var nd = ll.nextSibling;
        if (nd) {
            if (nd.getAttribute('role') == 'alert') {
                nd.firstChild.innerText = "Please enter a valid phone number"
            };
        };
    };
    var xs = document.getElementById("shipping-phone");
    if (xs){
        var ls = xs.nextSibling;
        var ns = ls.nextSibling;
        if (ns) {
            if (ns.getAttribute('role') == 'alert') {
                ns.firstChild.innerText = "Please enter a valid phone number"
            };
        };
    };
};

// So click is not working when people
// just use tabs/keyboard select
// not sure how to fix that, but just results in a
// bad red note that says "optional" (but still need
// to fill in
document.addEventListener('click',change_label);

window.onload = function() {
    var x = document.getElementById("billing-phone");
    if (x) {
        var lab = x.nextSibling;
        lab.innerText = "Phone";
        x.setAttribute('aria-label','Phone')
        x.setAttribute('required','')
        // These don't seem to work unfortunately!
        //x.addEventListener("change",change_label);
        //x.setAttribute("onChange","change_label();")
    };
    var x2 = document.getElementById("shipping-phone");
    if (x2) {
        var lab2 = x2.nextSibling;
        lab2.innerText = "Phone";
        x2.setAttribute('aria-label','Phone')
        x2.setAttribute('required','')
    };
};
</script>

Because there is no phone verification, you could technically put in a fake number for these FYI and no one would know. (I have a google voice number I use for instances in which I don’t really want to give out personal.)

Thanks again for those who have purchased a copy – appreciate the support.

Some notes on self-publishing a tech book

So my book, Data Science for Crime Analysis with Python, is finally out for purchase on my Crime De-Coder website. Folks anywhere in the world can purchase a paperback or epub copy of the book. You can see this post on Crime De-Coder for a preview of the first two chapters, but I wanted to share some of my notes on self publishing. It was some work, but in retrospect it was worth it. Prior books I have been involved with (Wheeler 2017; Wheeler et al. 2021) plus my peer review experience I knew I did not need help copy-editing, so the notes are mostly about creating the physical book and logistics of selling it.

Academics may wish to go with a publisher for prestige reasons (I get it, I was once a professor as well). But it is quite nice once you have done the legwork to publish it yourself. You have control of pricing, and if you want to make money you can, or have it cheap/free for students.

Here I will detail some of the set up of compiling the book, and then the bit of work to distribute it.

Compiling the documents

So the way I compiled the book is via Quarto. I posted my config notes on how to get the book contents to look how I wanted on GitHub. Quarto is meant to run code at the same time (so works nicely for a learning to code book). But even if I just wanted a more typical science/tech book with text/images/equations, I would personally use Quarto since I am familiar with the set up at this point. (If you do not need to run dynamic code you could do it in Pandoc directly, not sure if there is a way to translate a Quarto yaml config to the equivalent Pandoc code it turns into.)

One thing that I think will interest many individuals – you write in plain text markdown. So my writing looks like:

# Chapter Heading

blah, blah blah

## Subheading

Cool stuff here ....

In a series of text files for each chapter of the book. And then I tell Quarto quarter render, and it turns my writing in those text files into both an Epub and a PDF (and other formats if you cared, such as word or html). You can set up the configuration for the book to be different for the different formats (for example I use different fonts in the PDF vs the epub, nice fonts in one look quite bad in the other). See the _quarto.yml file for the set up, in particular config options that are different for both PDF and Epub.

One thing is that ebooks are hard to format nicely – if I had a book I wanted to redo to be an epub, I would translate it to markdown. There are services online that will translate, they will do a bad job though with scientific texts with many figures (and surely will not help you choose nice fonts). So just learn markdown to translate. Folks who write in one format and save to the other (either Epub/HTML to PDF, or PDF to Epub/HTML) are doing it wrong and the translated format will look very bad. Most advice online is for people who have just books with just text, so science people with figures (and footnotes, citations, hyperlinks, equations, etc.) it is almost all bad advice.

So even for qualitative people, learning how to write in markdown to self-publish is a good skill to learn in my opinion.

Setting up the online store

For awhile I have been confused how SaaS companies offer payment plans. (Many websites just seem to copy from generic node templates.) Looking at the Stripe API it just seems over the top for me to script up all of my own solution to integrate Stripe directly. If I wanted to do a subscription I may need to figure that out, but it ended up being for my Hostinger website I can set up a sub-page that is WordPress (even though the entire website is not), and turn on WooCommerce for that sub-page.

WooCommerce ends up being easy, and you can set up the store to host web-assets to download on demand (so when you purchase it generates a unique URL that obfuscates where the digital asset is saved). No programming involved to set up my webstore, it was all just point and click to set things up one time and not that much work in the end.

I am not sure about setting up any DRM for the epub (so in reality people will purchase epub and share it illegally). I don’t know of a way to prevent this without using Amazon+Kindle to distribute the book. But the print book should be OK. (If there were a way for me to donate a single epub copy to all libraries in the US I would totally do that.)

I originally planned on having it on Amazon, but the low margins on both plus the formatting of their idiosyncratic kindle book format (as far as I can tell, I cannot really choose my fonts) made me decide against doing either the print or ebook on Amazon.

Print on Demand using LuLu

For print on demand, I use LuLu.com. They have a nice feature to integrate with WooCommerce, the only thing I wish shipping was dynamically calculated. (I need to make a flat shipping rate for different areas around the globe the way it is set up now, slightly annoying and will change the profit margins depending on area.)

LuLu is a few more dollars to print than Amazon, but it is worth it for my circumstance I believe. Now if I had a book I expected to get many “random Amazon search buys” I could see wanting it on Amazon. I expect more sales will be via personal advertising (like here on the blog, social media, or other crime analyst events). My Crime De-Coder site (and this blog) will likely be quite high in google searches for some of the keywords fairly quickly, so who knows, maybe just having on personal site is just as many sales.

LuLu does has an option to turn on distribution to other wholesalers (like Barnes & Noble and Amazon) – have not turned that on but maybe I will in the future.

LuLu has a pricing calculator to see how much to print on their website. Paperback and basically the cheapest color option for letter sized paper (which is quite large) is just over $17 for my 310 page book (Amazon was just over $15). For folks if you are less image heavy and more text, you could get away with a smaller size book (and maybe black/white) and I suspect will be much cheaper. LuLu’s printing of this book is higher quality compared to Amazon as well (better printing of the colors and nicer stock for the paperback cover).

Another nice thing about print on demand is I can go in and edit/update the book as I see fit. No need to worry about new versions. Not sure what that exactly means for citing the work (I could always go and change it), you can’t have a static version of record and an easy way to update at the same time.

Other Random Book Stuff

I purchased ISBNs on Bowker, something like 10 ISBNs for $200. (You want a unique ISBN for each type of the book, so you may want three in the end if you have epub/paperback/hardback.) Amazon and LuLu though have options to have them give you an ISBN though, so that may have not been necessary. I set the imprint to be my LLC though in Bowker, so CRIME De-Coder is the publisher.

You don’t technically need an ISBN at all, but it is a simple thing, and there may be ways for me to donate to libraries in the future. (If a University picks it up as a class text, I have been at places you need at least one copy for rent at the Uni library.)

I have not created an index – I may have a go at feeding my book through LLMs and seeing if I can auto-generate a nice index. (I just need a list of key words, after that can just go and find-replace the relevent text in the book to fill in so it auto-compiles an index.) I am not sure that is really necessary though for a how-to book, you should just look at the table of contents to see the individual (fairly small) sections. For epub you can just doing a direct text search, so not sure if people use an index at all in epubs.

Personal Goals

So I debated on releasing the book open source, I do want to try and see if I can make some money though. I don’t have this expectation, but there is potential to get some “data science” spillover, and if that is the case sales could in theory be quite high. (I was surprised in searching the “data science python” market on Amazon, it is definitely not saturated.) Personally I will consider at least 100 sales to be my floor for success. That is if I can sell at least 100 copies, I will consider writing more books. If I can’t sell 100 copies I have a hard time justifying the effort – it would just be too few of people buying the book to have the types of positive spillovers I want.

To make back money relative to the amount of work I put in, I would need more than 1000 sales (which I think is unrealistic). I think 500 sales is about best case, guesstimating the size of the crime analyst community that may be interested plus some additional sales for grad students. 1000 sales it would need to be in the multiple professors using it for a class book over several years. (Which if you are a professor and interested in this for a class let me know, I will give your class a discount.)

Another common way for individuals to make money off of books is not for sales, but to have training’s oriented around the book. I am hoping to do more of that for crime analysts directly in the future, but those opportunities I presume will be correlated with total sales.

I do enjoy writing, but I am busy, so cannot just say “I am going to drop 200 hours writing a book”. I would like to write additional python topics oriented towards crime analysts/criminology grad students like:

  • GIS analysis in python
  • Regression
  • Machine Learning & Optimization
  • Statistics for Crime Analysis
  • More advanced project management in python

Having figured out much of this grunt work definitely makes me more motivated, but ultimately in the end need to have a certain level of sales to justify the effort. So please if you like the blog pick up a copy and tell a friend you like my work!

References

My word template for Quarto

I have posted on Github my notes on creating a word template to use with quarto. And since Quarto is just feeding into pandoc, those who are just using pandoc (so not doing intermediate computations), should maybe find that template worthwhile as well.

So first, why word? Quarto by default looks pretty nice for HTML. That is fine for them to prioritize that, but the majority of reports I want to use quarto for HTML is not the best format. Many times I want a report that can be emailed in PDF and/or printed. And sometimes I (or my clients) want a semi-automated report that can be edited after the fact. In those cases word is a good choice.

Editing LaTeX is too hard, and I am pretty happy with the this template for small reports. I will be sharing my notes on writing my python book in Quarto soonish, but for now wanted to share how I created a word template.

Note some of the items may seem gratuitous (why so many CRIME De-Coder logos?). Part of those are just notes though (like how to insert an image after your author name, I have done this to insert my signature in reports for example). The qmd file has most of the things I am interested in doing in documents, such as how to format markdown tables in python, doings sections/footnotes, references, table/figure captions, etc.

I do like my logo though in the header (it is hyperlinked even, so in subsequent PDFs if you click the logo it will go to my website), and the footer page numbers I commonly need in reports as well. And my title page and TOC do not look bad as well IMO. I am not one to discuss fonts, but I like small caps for titles and the Verdana font is nice to make it look somewhat different.

Creating the Template

So first, you can do from the command line:

quarto pandoc -o custom-reference-doc.docx --print-default-data-file reference.docx

From there, you should edit that reference.docx file to get what you want. So for example, if you want to change the font used for code snippets, in Word you can open up Styles, and on the right hand side select different elements and edit them:

Here for example to change the font for code snippets, you modify the HTML code style (I like Consolas):

There ended up being a ton of things I edited, I did not keep a list. Offhand you will want to modify the Title, Headings 1 & 2, First Paragraph, Body Text. And then you can edit things like the page numbers and header/footer.

So when rendering a document, you can sometimes click on the element in the rendered document and figure out what style it inherits from. Here for example you can see in the test.docx file that the quote section uses the “Block Text” style:

This does not always work though, and it can take some digging/experimentation in the original template file to get the right style modifier. (If you are having a real hard problem, convert the word document format to .zip, and dig into the XML documents. You can see the style formats in inherits from in the XML tree.) It doesn’t work for the code segments for example. Do not render a document and edit the style in that document, only edit the original --print-default-data-file reference.docx that was generated from the command line to update your template.

I have placed a few notes in the readme on Github, but one of my main things was making tables look nice. So this plays nicely with markdown tables, which I can use python to render directly. Here is an example of spreading tables across multiple pages.

One thing to note though is that this has limits – different styles are interrelated, so sometimes I would change one and it would propagate errors to different elements. (I can’t figure out how to change the default bullets to squares instead of circles for example without having bullets in places they should not be in tables – try to figure that one out. I also cannot figure out how to change the default font in tables, I would use monospace, without changing the font for other text elements in normal blocks.) So this template was the best I could figure without making other parts broken.

I have a few notes in the qmd file as well, showing how to use different aspects of markdown, as well as some sneaky things to do extra stuff (like formatting fourth level headings to produce a page break, I do not think I will need that deep of headings).

Even for those not using Quarto for computational workflows, writing in markdown is a really useful skill. You write in plain text, and can then have the output in different formats. Even for qualitative folks (or people in industry creating documents), I think many people would be well served by writing content in plain text markdown, and then rendering to whatever output they wanted.