Python f string number formatting and SPSS break long labels

Another quick blog post, as moving is not 100% crazy all the time now, but I need a vacation after all that work. So two things in this blog post: formatting numeric f strings in python, and breaking long labels in SPSS meta-data.

Python f-string numeric formatting

This is super simple, but I can never remember it (so making a quick blog post for my own reference). As of python 3.6, you can use f-strings to do simple text substitution. So if you do:

x = 2/3
sub_str = f'This proportion is {x}'
print(sub_str)

Then we will get printed out This proportion is 0.6666666666666666. So packing global items inside of {} expands within the f string. While for more serious string subsitution (like creating parameterized SQL queries), I like to use string templates, these f-strings are very nice to print short messages to the console or make annotations in graphs.

Part of this note is that I never remember how to format these strings. If you are working with integers it is not a big deal, but as you can see above I often do not want to print out all those decimals inside my particular message. A simple way to format the strings are:

f'This proportion is {x:.2f}'

And this prints out to two decimal places 'This proportion is 0.67'. If you have very big numbers (say revenue), you can do something like:

f'This value is ${x*10000:,.0f}'

Which prints out 'This value is $6,667' (so you can modify objects in place, to say change a proportion to a percentage).

Note also to folks that you can have multi-line f-strings by using triple quotes, e.g.:

f'''This is a super
long f-string for {x:.2f}
on multiple lines!'''

But one annoying this is that you need to keep the whitespace correct inside of functions even inside the triple string. So those are cases I like using string templates. But another option is to break up the string and use line breaks via \n.

long_str = (f'This is line 1\n'
            f'Proportion is {x:.1f}\n'
            f'This is line 3')
print(long_str)

Which prints out:

This is line 1
Proportion is 0.7
This is line 3

You could do the line breaks however, either at the beginning of each line or at the end of each line.

SPSS break long labels

This was in reference to a project where I was working with survey data, and for various graphs I needed to break up long labels. So here is an example to illustrate the problem.

* Creating simple data to illustrate.
DATA LIST FREE / X Y(2F1.0).
BEGIN DATA
1 1
2 2
3 3
4 4
END DATA.
DATASET NAME LongLab.
VALUE LABELS X
  1 'This is a reallllllllly long label'
  2 'short label'
  3 'Super long unnecessary label that is long'
  4 'Again another long label what is up with this'
.
VARIABLE LABELS
  X 'Short variable label'
  Y 'This is also a super long variable label that is excessive!'
.
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="g" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("g"))
  DATA: X=col(source(s), name("X"), unit.category())
  DATA: Y=col(source(s), name("Y"))
  COORD: rect(dim(1,2), transpose())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Value"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(X*Y))
END GPL.

So you can see, SPSS shrinks the data to accommodate the long labels. (I don’t know how to control the behavior in the graph or the chart template itself, so not sure why only this gets wrapped for the first label.) So we can use the \n line break trick again in SPSS to get these to split where we prefer. Here are some python functions to do the splitting (which I am sure can be improved upon), as well as to apply the splits to the current SPSS dataset. You can decide the split where you want the line to be broken, and so if a word goes above that split level it wraps to the next line.

* Now some python to wrap long labels.
BEGIN PROGRAM PYTHON3.
import spss, spssaux

# Splits a long string with line breaks
def long_str(x,split):
    split_str = x.split(" ")
    cum = len(split_str[0])
    cum_str = split_str[0]
    for s in split_str[1:]:
        cum += len(s) + 1
        if cum <= split:
            cum_str += " " + s
        else:
            cum_str += r"\n" + s
            cum = len(s)
    return cum_str

# This grabs all of the variables in the current SPSS dataset
varList = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]

# This looks at the VALUE LABELS and splits them up on multiple lines
def split_vallab(vList, lsplit):
    vardict = spssaux.VariableDict()
    for v in vardict:
        if v in vList:
            vls= v.ValueLabels.keys()
            if vls:
                for k in vls:
                    ss = long_str(v.ValueLabels[k], lsplit)
                    if ss != v.ValueLabels[k]:
                        vn = v.VariableName
                        cmd = '''ADD VALUE LABELS %(vn)s %(k)s \'%(ss)s\'.''' % ( locals() )
                        spss.Submit(cmd)

# I run this to split up the value labels
split_vallab(varList, 20)

# This function is for VARIABLE LABELS
def split_varlab(vList,lsplit):
    for i,v in enumerate(vList):
        vlab = spss.GetVariableLabel(i)
        if len(vlab) > 0:
            slab = long_str(vlab, lsplit)
            if slab != vlab:
                cmd = '''VARIABLE LABELS %(v)s \'%(slab)s\'.''' % ( locals() )
                spss.Submit(cmd)

# I don't run this right now, as I don't need it
split_varlab(varList, 30)
END PROGRAM.

And now we can re-run our same graph command, and it is alittle nicer:

GGRAPH
  /GRAPHDATASET NAME="g" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("g"))
  DATA: X=col(source(s), name("X"), unit.category())
  DATA: Y=col(source(s), name("Y"))
  COORD: rect(dim(1,2), transpose())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Value"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: interval(position(X*Y))
END GPL.

And you can also go to the variable view to see my inserted line breaks:

SPSS still does some auto-intelligence when to wrap lines in tables/graphs (so if you do DISPLAY DICTIONARY. it will still wrap the X variable label in my default tables, even though I have no line break). But this gives you at least a slight bit of more control over charts/tables.

Some ACS download helpers and Research Software Papers

The blog has been a bit sparse recently, as moving has been kicking my butt (hanging up curtains and recycling 100 boxes today!). So just a few quick notes.

Downloading ACS Data

First, I have posted some helper functions to work with American Community Survey data (ACS) in python. For a quick overview, if you import/define those functions, here is a quick example of downloading the 2019 Texas micro level files (for census tracts and block groups) from the census FTP site. Can pipe in another year (if available) and and whatever state into the function.

# Python code to download American Community Survey data
base = r'??????' #put your path here where you want to download data
temp = os.path.join(base,'2019_5yr_Summary_FileTemplates')
data = os.path.join(base,'tables')

get_acs5yr(2019,'Texas',base)

Some locations have census tract data to download, I think the FTP site is the only place to download block group data though. And then based on those files you downloaded, you can then grab the variables you want, and here I show selecting out the block groups from those fields:

interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B11003_013','B14006_002','B01001_003','B23025_005',
            'B22010_002','B16002_004','GEOID','NAME']
labs, comp_tabs = merge_tabs(interest,temp,data)
bg = comp_tabs['NAME'].str.find('Block Group') == 0

Then based on that data, I have an additional helper function to calculate proportions given two lists of the numerators and denominators that you want:

top = ['B17010_002',['B11003_016','B11003_013'],'B08141_002']
bot = ['B17010_001',        'B11002_001'       ,'B08141_001']
nam = ['PovertyFamily','SingleHeadwithKids','NoCarWorkers']
prep_sdh = prop_prep(bg, top, bot, nam)

So here to do Single Headed Households with kids, you need to add in two fields for the numerator ['B11003_016','B11003_013']. I actually initially did this example with census tract data, so not sure if all of these fields are available at the block group level.

I have been doing some work on demographics looking at the social determinants of health (see SVI data download, definitions), hence the work with census data. I have posted my prior example fields I use from the census, but criminologists may just use the social-vulnerability-index from the CDC – it is essentially the same as how people typically define social disorganization.

Peer Review for Criminology Software

Second, jumping the gun a bit on this, but in the works is an overlay journal for CrimRxiv. Part of the contributions we will accept are software contributions, e.g. if you write an R package to do some type of analysis function common in criminology.

It is still in the works, but we have some details up currently and a template for submission (I need to work on a markdown template, currently just a word doc). High level I wanted something like the Journal of Statistical Software or the Journal of Open Source Software (I do not think the level of detail of JSS is necessary, but wanted an example use case, which JoSS does not have).

Just get in touch if you have questions whether your work is on topic. Aim is to be more open to contributions at first. Really excited about this, as publicly sharing code is currently a thankless prospect. Having a peer reviewed venue for such code contributions for criminologists fills a very important role that traditional journals do not.

Future Posts?

Hopefully can steal some time to continue writing posts here and there, but will definitely be busy getting the house in order in the next month. Hoping to do some work on mapping grids and KDE in python/geopandas, and writing about the relationship between healthcare data and police incident report data are two topics I hope to get some time to work on in the near future for the blog.

If folks have requests for particular topics on the blog though feel free to let me know in the comments or via email!

Minimum detectable effect sizes for place based designs

So I was reading Blattman et al.’s (2018) work on a hot spot intervention in Bogotá the other day. It is an excellent piece, but in a supplement to the paper Blattman makes the point that while his study is very high powered to detect spillovers, most other studies are not. I am going to detail here why I disagree with his assessment on that front.

In appendix A he has two figures, one for the direct effect comparing the historical hot spot policing studies (technically he uses the older 2014 Braga study, but here is the cite for the update Braga et al., 2020).

The line signifies a Cohen’s D of 0.17, and here is the same graph for the spillover estimates:

So you can see Blattman’s study in total number of spatial units of analysis breaks the chart so to speak. You can see however there are plenty of hot spot studies in either chart that reported statistically significant differences, but do not meet the 0.17 threshold in Chris’s chart. How can this be? Well, Chris is goal switching a bit here, he is saying using his estimator the studies appear underpowered. The original studies on the graph though did not necessarily use his particular estimator.

The best but not quite perfect analogy I can think of is this. Imagine I build a car that gets better gas mileage compared to the current car in production. Then someone critiques this as saying the materials that go into production of the car have worse carbon footprints, so my car is actually worse for the environment. It would be fine to argue a different estimate of total carbon footprint is reasonable (here Chris could argue his estimator is better than the ones the originally papers used). It is wrong though to say you don’t actually improve gas mileage. So it is wrong for Chris to say the original articles are underpowered using his estimator, they may be well powered using a different estimator.

Indeed, using either my WDD estimator (Wheeler & Ratcliffe, 2018) or Wilson’s log IRR estimator (Wilson, 2021), I will show how power does not grow with more experimental units, but with a larger baseline number of crimes for those estimators. They both only have two spatial units of analysis, so in Chris’s chart will never gain more power.

One way I think about the issue for spatial designs is this – you could always split up a spatial lattice into ever finer and finer spatial units of analysis. For example Chris could change his original design to use addresses instead of street segments, and split up the spillover buffers into finer slices as well. Do you gain something for doing nothing though? I doubt it.

I describe in my dissertation how finer spatial units of analysis allow you to check for finer levels of spatial spillovers, e.g. can check if crime spills over from the back porch to the front stoop (Wheeler, 2015). But when you do finer spatial units, you get more cold floor effects as well due to the limited nature of crime counts – they cannot go below 0. So designs with lower baseline crime rates tend to show lower power (Hinkle et al., 2013).

MDE for the WDD and log IRR

For minimum detectable effect (MDE) sizes for OLS type estimators, you need to specify the variance you expect the underlying treated/control groups to have. With the count type estimators I will show here, the variance is fixed according to the count. So all I need to specify is the alpha level of the test. Here I will do a default of 0.05 alpha level (with different lines for one-tailed vs two-tailed). The other assumption is the distribution of crime counts between treated/control areas. Here I assume they are all equal, so 4 units (pre/post and treated/control). For my WDD estimator this actually does not matter, for the later IRR estimator though it does (so the lines won’t really be exact for his scenario).

So here is the MDE for mine and Jerry’s WDD estimator:

What this means is that if you have an average of 20 crimes in the treated/control areas for each time period separately, you would need to find a reduction of 15 crimes to meet this threshold MDE for a one-tailed. It is pretty hard when starting with low baselines! For an example close to this, if the treated area went from 24 to 9, and the control area was 24 to 24, this would meet the minimal treated reduction of 15 crimes in this example.

And here is the MDE for the log IRR estimator. The left hand Y axis has the logged effect, and the right hand side has the exponentiated IRR (incident rate ratio).

Since the IRR is commonly thought of as a percent reduction, this suggests even with baselines of 200 crimes, for Wilson’s IRR estimator you need percent reductions of over 20% relative to control areas.

So I have not gone through the more recent Braga et al. (2020) meta-analysis. I do not know if they have the data readily available to draw the points on this plot the same as in the Blattman article. To be clear, it may be Blattman is right and these studies are underpowered using either his or my estimator, I am not sure. (I think they probably are quite underpowered to detect spillover, since this presumably will be an even smaller amount than the direct effect. But that would not explain estimates of diffusion of benefits commonly found in these studies!)

I also do not know if one estimator is clearly better or not – for example Blattman could use my estimator if he simply pools all treated/control areas. This is not obviously better than his approach though, and foregoes any potential estimates of treatment effect variance (I will be damned if I can spell that word starting with het even close enough for autocorrect). But maybe the pooled estimate is OK, Blattman does note that he has cold floor effects in his linear estimator – places with higher baselines have larger effects. This suggests Wilson’s log IRR estimator with the pooled data may be just fine and dandy for example.

Python code

Here is the python code in its entirety to generate the above two graphs. You can see the two functions to calculate the MDE given an alpha level and average crime counts in each area if you are planning your own study, the wdd_mde and lirr_mde functions.

'''
Estimating minimum detectable effect sizes
for place based crime interventions

Andy Wheeler
'''

import numpy as np
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
import os
my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\min_det_effect'
os.chdir(my_dir)

#########################################################
#Settings for matplotlib

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}

matplotlib.rcParams.update(andy_theme)
#########################################################

#########################################################
# Functions for MDE for WDD and logIRR estimator


def wdd_mde(avg_counts,alpha=0.05,tails='two'):
    se = np.sqrt( avg_counts*4 )
    if tails == 'two':
        a = 1 - alpha/2
    elif tails == 'one':
        a = 1 - alpha
    z = norm.ppf(a)
    est = z*se
    return est

def lirr_mde(avg_counts,alpha=0.05,tails='two'):
    se = np.sqrt( (1/avg_counts)*4 )
    if tails == 'two':
        a = 1 - alpha/2
    elif tails == 'one':
        a = 1 - alpha
    z = norm.ppf(a)
    est = z*se
    return est

# Generating regular grid from 10 to 200
cnts = np.arange(10,201)
wmde1 = wdd_mde(cnts, tails='one')
wmde2 = wdd_mde(cnts)

imde1 = lirr_mde(cnts, tails='one')
imde2 = lirr_mde(cnts)

# Plot for WDD MDE
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(cnts, wmde1,color='k',linewidth=2, label='One-tailed')
ax.plot(cnts, wmde2,color='blue',linewidth=2, label='Two-tailed')
ax.set_axisbelow(True)
ax.set_xlabel('Average Number of Crimes in Treated/Control')
ax.set_ylabel('Crime Count Reduction')
ax.legend(loc='upper left')
plt.xticks(np.arange(0,201,20))
plt.yticks(np.arange(10,61,5))
plt.title("WDD MDE alpha level 0.05")
plt.savefig('WDD_MDE.png', dpi=500, bbox_inches='tight')

# Plot for IRR MDE
fig, ax = plt.subplots(figsize=(8,6))
ax2 = ax.secondary_yaxis("right", functions=(np.exp, np.log))
ax.plot(cnts,-1*imde1,color='k',linewidth=2, label='One-tailed')
ax.plot(cnts,-1*imde2,color='blue',linewidth=2, label='Two-tailed')
ax.set_axisbelow(True)
ax.set_xlabel('Average Number of Crimes in Treated/Control')
ax.set_ylabel('log IRR')
ax.set_ylim(-0.16, -1.34)
ax.legend(loc='upper right')
ax.set_yticks(-1*np.arange(0.2,1.31,0.1))
ax2.set_ylabel('IRR')
ax2.grid(False)
plt.xticks(np.arange(0,201,20))
plt.title("IRR MDE alpha level 0.05")
plt.savefig('IRR_MDE.png', dpi=500, bbox_inches='tight')

#########################################################

References

Comparing the WDD vs the Wilson log IRR estimator

So this is maybe my final post on the WDD estimator for the time being (Wheeler & Ratcliffe, 2018). Recently David Wilson had an article in JQC that proposes a different estimator using the same basic information, just pre-post crime counts for treated and control areas (Wilson, 2021). So say we had the table:

         Pre   Post
Treated   50     30
Control   60     55

So in this scenario, my WDD estimate is -20 in the treated area, and -5 in the control area, so the overall estimate is -20 – -5 = -15.

30 - 50 - (55 - 60) = -15

So an estimated reduction of -15 crimes overall. David’s estimator is the logged incident rate ratio (IRR), and so is just like above, except logs all of the values:

log(30) - log(50) - ( log(55) - log(60) ) = -0.4238142

This is a logged incident rate adjustment, so most of the time people exponentiate this value, which is exp(-0.4238142) = 0.6545455. So this suggests crime is reduced by approximately 35% in the treated area relative to the control area in this hypothetical. Or another way to write it is (30/50)/(55/60) = 0.6545455.

So instead of a linear estimate of the total numbers of crimes reduced, this is an estimate of the overall rate reduction. So this begs the question when would you prefer my WDD vs the IRR? I will try to answer that below – in short I think David’s estimator makes sense for meta-analyses (as I have said before in reference to the work in Braga & Weisburd, 2020). But for an individual agency doing an experimental evaluation I much prefer my estimator. The skinny of this logic is that we only really care about the overall crime reduction estimate from a cost-benefit analysis perspective. Backing out this total crime reduction count estimate from David’s IRR estimate can result in some funny business for an individual study.

Identifying Assumptions

So there are really two different assumptions my WDD estimator and David’s IRR estimator make. To generate a standard error estimate around the point estimate for either estimator, both require the data are Poisson distributed. So that makes no difference between the two. The assumption that really distinguishes between the WDD and the IRR estimate is the parallel trends assumption. The WDD assumes parallel trends are on the linear scale, whereas the IRR assumes parallel trends are on the ratio scale.

What exactly does this mean? Imagine we have a treated and control area, but look at the crime trends per time period before the treatment occurred. This set of areas has a set of parallel trends on the linear scale:

Time Treated Control
 0     50      60
 1     40      50
 2     45      55
 3     50      60

When the treated area goes down by 10 crimes, the control area goes down by 10 crimes. That is a parallel on the linear scale. Whereas this scenario is parallel on the ratio scale:

Time Treated Control
 0     50      60
 1     40      48
 2     45      54
 3     50      60

When crime goes down by 20% in the treated area, it goes down by 20% in the control area.

So while this gives a potential way to say you should use the WDD (parallel on the linear scale), or the IRR (parallel on the ratio scale), in practice it is not so simple. For one thing, if you only has the pre/post counts of crime, you cannot distinguish between these two scenarios. You can only tell in the case you have historical data to examine.

For a second part of this, you typically can choose your own control area (see for example the synthetic control estimator). So in most scenarios you could choose a control area to obey the linear or the ratio parallel trends assumption if you wanted to. However it may be in many scenarios there is a natural/easy control area, and you may see what is a better fit in that case for linear/ratio.

A final wee bit of a perverse aspect about this I will mention – pretend we have a treated/control area have approximately the same baseline crime counts/rates:

Time Treated Control
 0      30     30
 1      25     25
 2      20     20
 3      25     25

You actually cannot tell in this scenario whether the parallel trends are on the linear scale for my WDD or the ratio scale for the IRR estimate. They are consistent with either! In practice I think in many cases it will be like this – with noisy data, if you choose a control area that has approximately the same baseline crime counts, it will be quite hard to tell whether the linear parallel trends makes more sense or the ratio parallel trends makes more sense.

There are situations where the linear changes do not make sense, but they tend to be scenarios such as the control area has very little crime (so cannot go below 0 to match larger ups/downs in the treated area). So in that case sure the IRR is plausible and the WDD is not, but those are cases where the control area itself is quite questionable. Also note the IRR is not defined for any cells with 0 crimes – but again it is not good for either of our estimators in that case (although mine won’t fail to spit out a number, the power is so low the number it spits out won’t be worth much).

Bias/Coverage

So I have adapted the same simulation code I used in prior studies/blog posts to evaluate the null distribution and the coverage of David’s IRR estimator. I partly did not pursue it initially back when me and Jerry were discussing this idea, because I thought it would be biased. Generalized linear models are based on maximum likelihood estimators, which are only asymptotically valid. In short it appears I was wrong here and David’s IRR estimator is fine even with just four observations, at least for the handful of scenarios I have tried it (have not looked at very tiny counts of crime, it is undefined if any cell has 0 crimes, as you cannot take the log of 0).

Python code pasted at the very end of the blog post, but for example if we generate a set of null no changes pre/post with a baseline of 50 crimes, the logged irr estimate (converted into a z-score here) is just fine and dandy and has a very close to standard normal distribution based on 10k simulations.

So lets look at the scenario where the control area doesn’t change, but the treated area goes from 50 to 30. We can see again the point estimate in this scenario is spot on the money.

And then we can see the coverage of the logged irr estimator is spot on as well:

So if you are interested in slightly different baseline scenarios, you can use that same simulation code to check out the behavior of David’s estimator and conduct simulated power analysis the same way I have shown for the WDD estimator in prior blog posts.

So if both are unbiased and have good coverage again, why would you prefer the WDD estimator over the IRR estimator (or vice-versa)? Well, lets take the 35% reduction I talked about at the beginning of the post, and the department needs to spend $250k on extra officers to conduct whatever hot spot policing intervention. A 35% reduction may be worth it if we start with a baseline of 200 crimes (so would expect to go down to 130, for a reduction of 70 crimes). If the baseline is 20 crimes, it goes down to 13 crimes (so only a reduction of 7 crimes). The actual benefit of the IRR estimate is entirely dependent on the baseline count of crimes it is applied to.

Even if the IRR estimate is itself unbiased and has proper coverage, for even an individual study backing out the estimated reduction in total crimes from the IRR is biased. So here in this same simulated data (50 to 30 in treated, and 50 to 50 in control areas). The true count reduction is -20, and here is the point estimate on the X axis and the length of the confidence interval for each simulation on the Y axis for my WDD test. You can see they are nicely centered on -20, and the length of the confidence intervals has a very tiny variance – they are mostly just a smidge over 50 in total length. So that is probably tough to wrap your head around, but the variance of the variance estimates for the WDD are small.

Now lets do the same graph for the IRR estimate, but translated back out to a count crime reduction based on the simulated values:

We either have a ton of bias in this estimate (if the estimate of the count reduction is too large, the confidence interval is too small). Or the opposite, the estimate of the count reduction is too small, and the confidence interval is crazy wide. In Andrew Gelman’s terminology, it can result in pretty large type M (magnitude) errors in this simulated example (Gelman & Carlin, 2014). So the variance of the variance estimates in this scenario are quite large.

To be clear – if you are interested in estimating a percent reduction, by all means use David’s IRR estimator. If you however want to translate that percent reduction into an estimate of the total crimes reduced though you should use my WDD estimator in that case. You should not back out a total crimes reduced estimate from the IRR.

Final Thoughts

So I have said a few times I think the IRR estimator makes more sense for meta-analyses. Why do I think that? Well, imagine we have an underlying causal process through which a hot spots policing experiment can randomly deter/prevent a particular proportion of crimes. That underlying causal process suggests an IRR effect. And also the problem I mention with translating back to crime counts I believe should get smaller with tighter estimates.

For a causal process that is more akin to my WDD estimator, imagine some crimes will always be deterred/prevented from a hot spots policing experiment, and some will never be. And we don’t know up-front which is which, so the observed reduction is based on whatever mixture of the two we have at that particular location.

The proportion reduction seems to make more sense to me for active patrol type interventions (which are ephemeral) vs permanent CPTED like interventions which should prevent certain criminal acts in perpetuity. But of course any situation in the real world could have both occurring at the same time.

When you go and look at the meta-analysis of hot spots policing, those interventions are all over the place (Hinkle et al., 2020). I think my WDD estimate would not make sense to mash up into a final meta-analytic estimate. The IRR may not make sense either in the end, but it is plausibly more relevant to compare the IRRs from a study with a baseline of 200 crimes vs one with 40 crimes at baseline. I am not sure it makes sense to compare WDDs in that scenario. But that being said, a few of my blog posts have discussed the WDD normalized per unit area or per unit time. Those normalized estimates are probably more apples to apples in the 200 vs 40 scenario.

A final note I have not discussed here is that David discusses a correction for overdispersion, so that is a potential feather in the cap for his estimator vs the WDD. I’d be a bit hesitant though with that – only four observations to estimate the dispersion term is slicing it a bit thin IMO. But I was wrong about the original estimator, so I may be wrong about that as well. It will take simulation evidence to determine that though – David’s paper just provides the correction term, he doesn’t provide evidence for its utility with small sample data.

And to be fair I have not done simulations to see how my estimator behaves in the presence of overdispersion either. I believe it will simply just cause the standard errors to be too small, so like in Wheeler (2016), I imagine it will just require upping the interval (e.g. use a z-score of 3 instead of 2) to get proper coverage for real crime data.

References

Other Posts of Interest

Python simulation code

Here is a copy-pasted chunk of the entire python simulation code.

'''
Comparing WDD to log(IRR) from Wilson's
recent paper, https://link.springer.com/article/10.1007/s10940-021-09494-w

Andy Wheeler
'''

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import poisson
from scipy.stats import uniform
import matplotlib
import matplotlib.pyplot as plt
import os
my_dir = r'D:\Dropbox\Dropbox\Documents\BLOG\wdd_vs_irr'
os.chdir(my_dir)

#########################################################
#Settings for matplotlib

andy_theme = {'axes.grid': True,
              'grid.linestyle': '--',
              'legend.framealpha': 1,
              'legend.facecolor': 'white',
              'legend.shadow': True,
              'legend.fontsize': 14,
              'legend.title_fontsize': 16,
              'xtick.labelsize': 14,
              'ytick.labelsize': 14,
              'axes.labelsize': 16,
              'axes.titlesize': 20,
              'figure.dpi': 100}

matplotlib.rcParams.update(andy_theme)
#########################################################


#This works for the scipy functions as well
np.random.seed(seed=10)

# A function to generate the WDD estimate for simulated data
def wdd_sim(treat0,treat1,cont0,cont1,pre,post):
    tr_cr_0 = poisson.rvs(mu = treat0, size=int(pre)).sum()
    co_cr_0 = poisson.rvs(mu = cont0, size=int(pre)).sum()
    tr_cr_1 = poisson.rvs(mu = treat1, size=int(post)).sum()
    co_cr_1 = poisson.rvs(mu = cont1, size=int(post)).sum()
    # WDD estimates
    est = ( tr_cr_1/post - tr_cr_0/pre ) - ( co_cr_1/post - co_cr_0/pre )
    post2 = (1/post)**2
    pre2 = (1/pre)**2
    var_est = tr_cr_0*pre2 + tr_cr_1*post2 + co_cr_0*pre2 + co_cr_1*post2
    true_val = ( treat1 - treat0 ) - ( cont1 - cont0 )
    z_score = est / np.sqrt(var_est)
    # Wilson log IRR estimates
    true_logirr = np.log( (treat1*cont0) / (cont1*treat0) )
    est_logirr = np.log( ((tr_cr_1/post)*(co_cr_0/pre)) / ( (co_cr_1/post)*(tr_cr_0/pre) ) )
    se_logirr = np.sqrt( 1/tr_cr_1 + 1/co_cr_0 + 1/co_cr_1 + 1/tr_cr_0 )
    z_logirr = est_logirr / se_logirr
    return (tr_cr_0, co_cr_0, tr_cr_1, co_cr_0, est, var_est, true_val, z_score, true_logirr, est_logirr, se_logirr, z_logirr)

def make_data(n, treat0, treat1, cont0, cont1, pre, post):
    base = pd.DataFrame( range(n), columns=['index'])
    base['treat0'] = treat0
    if treat1 is not None:
        base['treat1'] = treat1
    else:
        base['treat1'] = base['treat0']
    if cont0 is not None:
        base['cont0'] = cont0
    else:
        base['cont0'] = base['treat0']
    if cont1 is not None:
        base['cont1'] = cont1
    else:
        base['cont1'] = base['cont0']
    base.drop(columns='index',inplace=True)
    base['pre'] = pre
    base['post'] = post
    sim_vals = base.apply(lambda x: wdd_sim(**x), axis=1, result_type='expand')
    sim_vals.columns = ['sim_t0','sim_c0','sim_t1','sim_c1','est','var_est','true_val','z_score',
                        'true_logirr','est_logirr','se_logirr','z_logirr']
    return pd.concat([base,sim_vals], axis=1)

# Coverage of the log irr estimate
# Lets look at the coverage rate for a decline from 40 to 20
def cover_logirr(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*data['se_logirr']
    low = data['est_logirr'] - dif
    high = data['est_logirr'] + dif
    cover = ( data['true_logirr'] > low) & ( data['true_logirr'] < high )
    return cover

# Length of ci for WDD
def len_ci(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    return low, high, high - low

# Length of ci for IRR estimate on count scale
# This depends on the baseline estimate to multiply
# The IRR by, using the baseline average of the 
# Treatment area

def len_irr(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*data['se_logirr']
    low = data['est_logirr'] - dif
    high = data['est_logirr'] + dif
    baseline = data['sim_t0']/data['pre']
    # Even if you use hypothetical, the variance is quite high
    #baseline = data['treat0']
    est_count = baseline*np.exp(data['est_logirr']) - baseline
    c1 = baseline*np.exp(low) - baseline
    c2 = baseline*np.exp(high) - baseline
    return est_count, c1, c2, np.abs(c2 - c1)

##########################
# Example with no change, lets look at the null distribution
sim_n = 10000
no_diff = make_data(sim_n, 50, 50, 50, 50, 1, 1)
no_diff['z_logirr'].describe()
##########################

##########################
# Example with equal time periods, a reduction from 50 to 30 and 50 to 50 in control area
sim_dat = make_data(sim_n, 50, 30, 50, 50, 1, 1)
sim_dat[['true_logirr','est_logirr','se_logirr']].describe()

cl = cover_logirr(sim_dat)
cl.mean()

# Compare length of CI for IRR vs WDD

# WDD length
lowdd, highwdd, lwdd = len_ci(sim_dat)
lwdd.describe()

# IRR length on the count scale
est_cnt_irr, lo_irr, hi_irr, ln_irr = len_irr(sim_dat)
ln_irr.describe()

# Scatterplot of estimated count reduction vs
# Length of CI
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(est_cnt_irr, ln_irr, c='k', 
            alpha=0.1, s=4)
ax.set_axisbelow(True)
ax.set_xlabel('Estimated Count Reduction [IRR]')
ax.set_ylabel('Length of CI on count scale [IRR]')
plt.savefig('IRR_Len_Est.png', dpi=500, bbox_inches='tight')
plt.show()

# Lets compare to the WDD estimate
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(sim_dat['est'], lwdd, c='k', 
            alpha=0.1, s=4)
ax.set_axisbelow(True)
ax.set_xlabel('Estimated Count Reduction [WDD]')
ax.set_ylabel('Length of CI on count scale [WDD]')
plt.savefig('WDD_Len_Est.png', dpi=500, bbox_inches='tight')
plt.show()
##########################

Clumpy hotspots

Read an article by Tim Hart the other day (part of a special issue I will have an article in as well here soon). In it he evaluated hot spot methods not only by how well they forecast crime, but also by the clumpiness of the hot spot method. Some hot spot methods, such as risk terrain modeling (Caplan et al., 2020; Fox et al., 2021), machine learning models (Wheeler & Steenbeek, 2020), or self-exciting point process models (Mohler et al., 2018) can by their nature produce discontinuous hot spots. Here is an example of a RTM map I made in Yoo & Wheeler (2019) for homeless related crime in Los Angeles, and you can see this is quite spotty in the ups/downs in the high risk areas:

Other hot spot methods, like hierarchical clustering (Wheeler & Reuter, 2020) or kernel density maps however this is not as big an issue. Here is an example kernel density map also from Yoo & Wheeler (2019) based on the same data:

So you can see how the hot spots in the kernel density map are spatially contiguous, whereas the RTM example can be little hot spots all over the jurisdiction. So it is obviously easier to patrol a single contiguous area than many islands over the entire jurisdiction. So it may make sense to trade off a contiguous area that captures somewhat fewer crimes than speckled areas that are all over the map.

Adepeju et al. (2016) was the first to use a particular statistic, the clumpiness index, to evaluate different hot spot methods. Their figure below is a pretty good depiction of the idea – count up the number of internal edges to a hot spot (when a hot spot grid-cell neighbors another hot spot), and the number of external edges. Then it is just a particular formula to make the index range from -1 to 1 given different sized hot spots.

So here I flip this idea on its head abit – instead of using a particular hot spot technique and see its clumpiness, I formulate a linear program given a prediction to trade off a smaller number of predicted crimes in the hot spot vs making the hot spot areas more clumpy. I illustrate my clumpy hot spots using just prior data to predict future data, in particular thefts from motor vehicles in Raleigh North Carolina.

I have posted the data/code on github here. It is a bit too long to embed the code directly in the blog post, but just see the 00_PrepData.py file. The crime data and Raleigh border I downloaded from the Raleigh open data website.

A Linear Program to Clump Hot Spots

So for some quick and dirty math in text, the linear program I formulate is:

Maximize { Sum[ theta*S_i*Crime_i + (1 - theta)*E_i ] }
Subject To:
    1) Sum( S_i ) = k
    2) E_i <= Sum(S_n for n in neighbors(i) ) for each i
    3) E_i <= S_i for each i
    4) S_i element of {0,1}, E_i >= 0 (and can be continuous)

The idea behind this is that if theta=1, this is the same as just taking whatever your input areas are and ranking them to pick the top k areas. So if you have 10000 500 by 500 foot grid cells as your spatial units of analysis, and you wanted the top 1% of the city, that is 100 grid cells. So you would choose k=100 in that scenario. Crime_i here I use as prior counts of crime in the grid cell, but it could be the predicted value from whatever model as well. That is the first constraint in this model approach – you need to choose the total area you want. S_i are the decision variables for the final selected hot spot areas.

The second and third constraints determine the values for the second set of decision variables, E_i. These are the decision variables that encode the interconnected links when a selected grid cell touches another grid cell. Constraint 2 sets E_i to the total number of neighbors of i that are selected, except constraint 3 says if S_i is 0 E_i needs to be 0 as well.

In this formulation, S_i need to be integer variables, but the E_i are defined by the sum of S_i, so they can be continuous. In this formulation if you have N grid cells (or whatever spatial units of analysis), this results in 2*N decision variables, and 2*N + 1 constraints. You could maybe save a few constraints here by working with an undirected graph instead of a directed one (in essence this double counts, a-b and b-a would count as two links). But this will just make it 1.5*N constraints instead of 2*N. So not a big deal probably. I did have some issues solving this using pulps default coin/GLPK solver. But CPLEX solved it no problem. (My example is a total of 20,986 500 by 500 foot grid cells, and I use rook contiguity like the Adepeju article as well. And using CPLEX it solves the models in just a few seconds.)

In this formulation you can think of theta as trading off crimes in the hot spot vs interior edges. So imagine you had theta=0.9, and you had a solution with 200 crimes and 100 interior edges. The objective function in that scenario would be 0.9*200 + 0.1*100 = 190. Now imagine you had an alternative scenario with 190 crimes, but 200 internal edges, the objective function would be 0.9*190 + 0.1*200 = 191. So you are saying, it is ok to have hot spots capture a smaller number of crimes, if they are more connected.

Normal Hotspots vs Clumpy Ones in Raleigh

The open data I use for Raleigh, North Carolina for the NIBRS dataset goes back to June 2014, and has data updated in the beginning of March 2021. I pull out larcenies from motor vehicles, and for the historical train dataset use car larcenies from 2014 through 2019 (n = 17,681). For the test dataset I use car larcenies in 2020 and what is available so far in 2021 (n = 3,376). Again these are grid cells generated over the city boundaries at 500 by 500 foot intervals. For illustration I grab out the top 1% of the city (209 grid cells). I use a train/test dataset as out of sample test data will typically result in reduced predictions. Here are the PAI stats for train vs test when selecting the top 1%.

For all subsequent selections I always use the historical training data to select the hot spots, and the test dataset to evaluate the PAI.

If we do the typical approach of just taking the highest crime grid cells based on the historical data, here are the results both for the PAI and the CI (clumpy index). For those not familiar, PAI is % Crime Capture/% Area, so if the denominator is 1%, and the PAI (for the test data) is 17, that means the hot spots capture 17% of the total thefts from vehicles. The CI ranges from -1 (spread apart) to 1 (entirely clustered). Here it is just over 0, suggesting these are basically randomly distributed in terms of clustering.

You may think that almost spatial randomness in terms of clumping seems at odds with that crime clusters! But it is not really – a consistent relationship with crime hot spots is that they are intensely localized, and often you can go down the street and be in a low crime area (Harries, 2006). The same idea when people say high crime neighborhoods often are spotty interior – they tend to have mostly low crime areas and just a few specific hot spots.

OK, so now to show off my linear program. So what happens if we use theta=0.9?

The total crime numbers are here for the historical data, and it ends up capturing the exact same number of crimes as the select top 1% does (3,664). But, it switches the selection of one of the areas. So what happens here is that we have ties – even with basically little weight assigned to the interior connections, it will prioritize tied crime areas to be connected to other chosen hot spots (whereas before the ties are just random in the way I chose the top 1%). So if you have many ties at the threshold for your hot spot, this is a great way to prioritize particular tied areas.

What happens if we turn down theta to 0.5? So this is saying you would trade off one for one – one interior edge is equal to one crime.

You can see that it changed the selections slightly more here, traded off 24 areas compared to the original just rank solution. Lets check out the map and the CI:

The CI value is now 0.17 (up from 0.08). You can see some larger blobs, but it is still pretty spread apart. But the reduction in the total number of crimes captured is pretty small, going from a PAI of 17 to now a PAI of 16. How about if we crank down theta even more to 0.2?

This trades off a much larger number of areas and total amount of crime – over half of the chosen grid cells are flipped in this scenario. In the subsequent map you can see the hot spots are much more clumpy now, and have a CI of 0.64.

The PAI of 12.6 is a bit of a hit as well, but is not too shabby still. I typically take a PAI of 10 to be the ballpark of what is reasonable based on Weisburd’s Law of Crime Concentration – 5% of the areas contain 50% of the crime (which is a PAI of 10).

So this shows one linear programming approach to trade off clumpy chosen areas vs disconnected speckles over the map. It may be the case though that other approaches are more reasonable, such as using some type of clustering to begin with. E.g. I could use DBSCAN on the gridded predicted values (Wheeler & Reuter, 2020) as see how clumpy those hot spots are. This approach is nice though if you have a fixed area you want to cover though.

Why Raleigh?

For a bit of personal news, I will be moving to the Raleigh area here shortly. I recently negotiated to be 100% remote at my job – so I will still be at HMS (or since we were recently purchased I might be employed by Gainwell I guess by the time I move). So looking forward to the new adventure back on the east coast but still in more temperate climates than PA or NY!

References

Transforming predicted variables in regression

The other day on LinkedIn I made a point about how I think scikits TransformedTargetRegressor is very likely to mislead folks. In fact, the example use case in the docs for this function is a common mistake, fitting a model for log(y), then getting predictions phat, and then simply exponentiating those predictions exp(phat).

On LinkedIn I gave an example of how this is problematic for random forests, but here is a similar example for linear regression. For simplicity pretend we only have 3 potential residuals (all equally likely), either a residual of -1, 0, or 1.

Now pretend our logged prediction is 5, so if we simply do exp(5) we get about 148. Now what are our predictions is we consider those 3 potential residuals?

Resid  Pred-Resid Modified_Pred LinPred
  -1     5 - -1        exp(6)     403
   0     5 -  0        exp(5)     148
   1     5 -  1        exp(4)      55

So if we take the mean of our LinPred column, we then get a prediction of about 202. The prediction using this approach is much higher than the naive approach of simply exponentiating 5. The difference is that the exp(5) estimate is the median, and the above estimate taking into account residuals is the mean estimate.

While there are some cases you may want the median estimate, in that case it probably makes more sense to use a quantile estimator of the median from the get go, as opposed to doing the linear regression on log(y). I think for many (probably most) use cases in which you are predicting dollar values, this underestimate can be very problematic. If you are using these estimates for revenue, you will be way under for example. If you are using these estimates for expenses, holy moly you will probably get fired.

This problem will happen for any non-linear transformation. So while some transformations are ok, in scikit for example minmax or standardnormal scalars are ok, things like logs, square roots, or box-cox transformations are not. (To know if it is a linear transformation, if you do a scatterplot of original vs transformed, if it is a straight line it is ok, if it is a curved line it is not!)

I had a friend go back and forth with me for a bit after I posted this. I want to be clear this is not me saying the model of log(y) is the wrong model, it is just to get the estimates for the mean predictions, you need to take a few steps. In particular, one approach to get the mean estimates is to use Duan’s Smearing estimator. I will show how to do that in python below using simulated data.

Example Duan’s Smearing in python

So first, we import the libraries we will be using. And since this is simulated data, will be setting the seed as well.

######################################################
import pandas as pd
import numpy as np
np.random.seed(10)

from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
######################################################

Next I will create a simple linear model on the log scale. So the regression of the logged values is the correct one.

######################################################
# Make a fake dataset, say these are housing prices
n = (10000,1)
error = np.random.normal(0,1,n)
x1 = np.random.normal(10,3,n)
x2 = np.random.normal(5,1,n)
log_y = 10 + 0.2*x1 + 0.6*x2 + error
y = np.exp(log_y)

dat = pd.DataFrame(np.concatenate([y,x1,x2,log_y,error], axis=1),
                   columns=['y','x1','x2','log_y','error'])
x_vars = ['x1','x2']

# Lets look at a histogram of y vs log y
dat['y'].hist(bins=100)
dat['log_y'].hist(bins=100)
######################################################

Here is the histogram of the original values:

And here is the histogram of the logged values:

So although the regression is the conditional relationship, if you see histograms like this I would also by default use a regression to predict log(y).

Now here I do the same thing as in the original function docs, I fit a linear regression using the log as the function and exponential as the inverse function.

######################################################
# Now lets see what happens with the usual approach
tt = TransformedTargetRegressor(regressor=LinearRegression(),
                                func=np.log, inverse_func=np.exp)
tt.fit(dat[x_vars], dat['y'])
print( (tt.regressor_.intercept_, tt.regressor_.coef_) ) #Estimates the correct values

dat['WrongTrans'] = tt.predict(dat[x_vars])

dat[['y','WrongTrans']].describe()
######################################################

So here we estimate the correct simulated values for the regression equation:

But as we will see in a second, the exponentiated predictions are not so well behaved. To illustrate how the WrongTrans variable behaves, I show its distribution compared to the original y value. You can see that on average it is a much smaller estimate. Our sample values have a mean of 7.5 million, and the naive estimate here only has a mean of 4.6 million.

Now here is a way to get an estimate of the mean value. In a nutshell, what you do is take the observed residuals, pretty much like that little table I did in the intro of this blog post, generate predictions given those residuals, and then back transform them and take the mean.

Although this example is using logged regression, I’ve made it pretty general. So if you used any box cox transformation instead of the logged (e.g. sklearns power_transform, it will work.

######################################################
# Duan's smearing, non-parametric approach via residuals

# Can make this general for any function inside of 
# TransformedTargetRegressor
f = tt.get_params()['func']              #function
inv_f = tt.get_params()['inverse_func']  #and inverse function

# Non-parametric approach, approximate via residuals
# Using numpy broadcasting
log_pred = f(dat['WrongTrans'])
resids = f(dat['y']) - log_pred
resids = resids.values.reshape(1,n[0])
dp = inv_f(log_pred.values.reshape(n[0],1) + resids)
dat['DuanPreds'] = dp.mean(axis=1)

dat[['y','WrongTrans','DuanPreds']].describe()
######################################################

So you can see that the Duan Smeared predictions are looking better, at least the mean of the predictions is much closer to the original.

I’ve intentionally done this example without using train/test, as we know the true answers. But in that case, you will want to use the residuals from the training dataset to apply this transformation to the test dataset.

So the residuals and the Duan smearing estimator do not need to be the same dimension. So for example if you have a big data application, you may want to do something like resids = resids.sample(1000) above.

Also another nice perk of this is you can use dp above to give you prediction intervals, so np.quantile(dp,[0.025,0.975], axis=1).T would give you a 95% prediction interval of the mean on the linear scale as well.

Extra, Parametric Estimation

Another approach, which may make sense given the application, is instead of using the observed residuals to give a non-parametric estimate, you can estimate the distribution of the residuals, and then use that to make either an integral estimate of the Smeared estimate back on the original scale. Or in the case of the logged regression there is a closed form solution.

I show how to construct the integral estimator below, again trying to be more general. The integral approach will work for say any box-cox transformation.

######################################################
# Parametric approach, approximating residuals via normal

from scipy.stats import norm
from scipy.integrate import quad

# Look at the residuals again
resids = f(dat['y']) - f(tt.predict(dat[x_vars]))

# Check to make sure that the residuals are really close to normal
# Before doing this
resids.hist(bins=100)

# Fit to a normal distribution 
loc, scale = norm.fit(resids)

# Define integral
def integrand(x,pred):
    return norm.pdf(x, loc, scale)*inv_f(pred - x)

# Pred should be the logged prediction
# -50,50 should be changed if the residuals are scaled differently
def duan_param(pred):
    return quad(integrand, -50, 50, args=(pred))[0]

# This takes awhile to apply to the whole data frame!
dat['log_pred'] = f(tt.predict(dat[x_vars]))
sub_dat = dat.head(100).copy()
sub_dat['DuanParam'] = sub_dat['log_pred'].apply(duan_param)

# Can see that these are very similar to the non-parametric
print( sub_dat[['DuanPreds','DuanParam']].head(10) )

And you can see that this normal based approximation works just fine here, since by construction the model residuals are pretty well behaved in my simulation.

It happens to be the case that there is a simpler estimate than the integral approach (which you can see in my notes takes awhile to estimate).

###########
# Easier way, but only applicable to log transform
# https://en.wikipedia.org/wiki/Smearing_retransformation
test_val = np.log(5000000)

# Integral approach
print( duan_param(test_val) ) 

# Approach for just log transformed
mult = np.exp(0.5*resids.var())
print( np.exp(test_val)*mult )
##########

So you can see the integral vs the closed form function are very close:

The differences could be due to the the integral is simply an estimate (and you can see I did not do negative to positive infinity, but chopped it off, I do not know if there is a better function to estimate the integral or general approach here).

It wouldn’t surprise me if there are closed form solutions for box-cox transforms as well, but I am not familiar with them offhand. Again the integral approach (or the non-parametric approach) will work for whatever function you want. The function itself could be whatever crazy/discontinuous function you want. But this parametric Duan’s Smearing approach relies on the residuals being normally distributed. (I suppose you could use some other types of continuous distribution estimate if you have reason to, I have only seen normal distribution estimates though in practice.)

Other Notes

While this focuses on regression, I do not think this will perform all that badly for other types of models (such as random forests or xgboost). But for forests it may make sense to simply pull out the individual tree estimates, back transform them, and get the mean of that backtransformed estimate. I have a different blog post that has a function showing how to scoop up the individual predictions from a random forest model.

It should also apply the same to any regression model with regularization. But if you want to do this, there are of course other alternative models you may consider that may be better suited towards your end goals of predictions on the linear/original scale.

For example, if you really want prediction intervals, it may make sense to not transform the data, and estimate a quantile regression model at the 5% and 95% quantiles. This would give you a 90% prediction interval.

Another approach is that it may make sense to use a different model, such as Poisson regression or negative binomial regression (or another generalized linear model in general). Even if your data are not integer counts, you can still use these models! (They just need to be 0 and above, no negative values.)

That Stata blog suggests to use Poisson and then robust standard errors, but that is a bad idea if you are really interested in predictions as well (see Gary Kings comment and linked paper). But you can just do negative binomial models in most cases then, and that is a better default than Poisson for many real world datasets.

Geocoding the CMS NPI Registry (python)

So previously I wrote out creating service deserts. I have since found a nicer data source to use for this, the NPI CMS registry. This data file has over 6 million service providers across the US.

Here I will provide an example of using that data to geocode all the pharmacy’s in Texas, again using the census geocoding API and python.

Chunking up the NPI database

So first, you can again download the entire NPI database from here. So I have already downloaded and unzipped that file, which contains a CSV for the January version, named npidata_pfile_20050523-20210110.csv. So as some upfront, here are the libraries I will be using, and I also set the directory to where my data is located.

###############################
import pandas as pd
import numpy as np
import censusgeocode as cg
import time
from datetime import datetime
import os
os.chdir(r'D:\HospitalData\NPPES_Data_Dissemination_January_2021')
###############################

The file is just a bit too big for me to fit in memory on my machine. On Windows, you can use Get-Content npidata_pfile_20050523-20210110.csv | Measure-Object -Line in powershell to get the line counts, or on Unix use wc -l *.csv for example. So I know the file is not quite 6.7 million rows.

So what I do here is create a function to read in the csv file in chunks, only select the columns and rows that I want, and then return that data frame. In the end, you need to search across all of the Taxonomy codes to pull out the type of service provider you want. So for community pharmacies, the code is 3336C0003X, but it is not always in the first Taxonomy slot (some CVS’s have it in the second slot for example). You can see the big list of taxonomy codes here, so my criminology friends may say be interested in mental health or substance abuse service providers for other examples.

In addition to the taxonomy code, I always select organizations, not individuals (Entity Type = 2). And then I only select out pharmacies in Texas (although I bet you could fit all of the US pharmacies in memory pretty easily, maybe 60k in total?) Caveat emptor, I am not 100% sure how to use the deactivation codes properly in this database, as that data is always NaN for Texas pharmacies.

######################################################################
# Prepping the input data in chunks

keep_col = ['NPI','Entity Type Code','Provider Organization Name (Legal Business Name)',
            'NPI Deactivation Reason Code','NPI Deactivation Date','NPI Reactivation Date',
            'Provider First Line Business Practice Location Address',
            'Provider Business Practice Location Address City Name',
            'Provider Business Practice Location Address State Name',
            'Provider Business Practice Location Address Postal Code']
            
taxon_codes = ['Healthcare Provider Taxonomy Code_' + str(i+1) for i in range(15)]
keep_col += taxon_codes
community_pharm = '3336C0003X'
npi_csv = 'npidata_pfile_20050523-20210110.csv' #Newer files will prob change the name

# This defines the rows I want
def sub_rows(data):
    ec = data['Entity Type Code'] == "2"
    st = data['Provider Business Practice Location Address State Name'] == 'TX'
    ta = (data[taxon_codes] == community_pharm).any(axis=1)
    #ac = data['NPI Deactivation Reason Code'].isna()
    all_together = ec & st & ta #& ac
    sub = data[all_together]
    return sub

def csv_chunks(file,chunk_size,keep_cols,row_sub):
    # First lets get the header and figure out the column indices
    header_fields = list(pd.read_csv(npi_csv, nrows=1))
    header_locs = [header_fields.index(i) for i in keep_cols]
    # Now reading in a chunk of data
    skip = 1
    it_n = 0
    sub_n = 0
    ret_chunk = chunk_size
    fin_li_dat = []
    while ret_chunk == chunk_size:
        file_chunk = pd.read_csv(file, usecols=header_locs, skiprows=skip, 
                     nrows=chunk_size, names=header_fields, dtype='str')
        sub_dat = row_sub(file_chunk)
        fin_li_dat.append( sub_dat.copy() )
        skip += chunk_size
        it_n += 1
        sub_n += sub_dat.shape[0]
        print(f'Grabbed iter {it_n} total sub n so far {sub_n}')
        ret_chunk = file_chunk.shape[0]
    fin_dat = pd.concat(fin_li_dat, axis=0)
    return fin_dat


# Takes about 3 minutes
print( datetime.now() )
pharm_tx = csv_chunks(npi_csv, chunk_size=1000000, keep_cols=keep_col, row_sub=sub_rows)
print( datetime.now() )

# No deactivated codes in all of Texas
print( pharm_tx['NPI Deactivation Reason Code'].value_counts() )
######################################################################

So this ends up returning not quite 6800 pharmacies in all of Texas.

Geocoding using the census API

So first, the address data is pretty well formatted. But for those new to geocoding, if you have end parts of address strings like Apt 21 or Suite C, those endings will typically throw geocoders off the mark. So in just a few minutes, I noted the different strings that marked the parts of the addresses I should chop off, and wrote a function to clean those up. Besides that I just limit the zip code to 5 digits, as that field is a mix of 5 and 9 digit zipcodes.

######################################################################
# Now prepping the data for geocoding

ph_tx = pharm_tx.drop(columns=taxon_codes).reset_index(drop=True)

#['Provider First Line Business Practice Location Address', 'Provider Business Practice Location Address City Name', 'Provider Business Practice Location Address State Name', 'Provider Business Practice Location Address Postal Code']

# I just looked through the files and saw that after these strings are not needed
end_str = [' STE', ' SUITE', ' BLDG', ' TOWER', ', #', ' UNIT',
           ' APT', ' BUILDING',',', '#']

 
def clean_add(address):
    add_new = address.upper()
    for su in end_str:
        sf = address.find(su)
        if sf > -1:
            add_new = add_new[0:sf]
    add_new = add_new.replace('.','')
    add_new = add_new.strip()
    return add_new

# Some examples
clean_add('5700 S GESSNER DR STE G')
clean_add('10701-B WEST BELFORT SUITE 170')
clean_add('100 EAST UNIVERSITY BLVD.')
clean_add('5800 BELLAIRE BLVD BLDG 1')
clean_add('2434 N I-35 # S')

ph_tx['Zip5'] = ph_tx['Provider Business Practice Location Address Postal Code'].str[0:5]
ph_tx['Address'] = ph_tx['Provider First Line Business Practice Location Address'].apply(clean_add)
ph_tx.rename(columns={'Provider Business Practice Location Address City Name':'City',
                      'Provider Business Practice Location Address State Name':'State2'},
             inplace=True)
######################################################################

Next is my function to use the batch geocoding in the census api. Note the census api is a bit finicky – technically the census api says you can do batches of up to 5k rows, but I tend to get kicked off for higher values. So here I have a function that chunks it up into tinier batch portions and submits to the API. (A better function would cache intermediate results and wrap all that jazz in a try function.)

 ######################################################################
 #This function breaks up the input data frame into chunks
 #For the census geocoding api
 def split_geo(df, add, city, state, zipcode, chunk_size=500):
     df_new = df.copy()
     df_new.reset_index(inplace=True)
     splits = np.ceil( df.shape[0]/chunk_size)
     chunk_li = np.array_split(df_new['index'], splits)
     res_li = []
     pick_fi = []
     for i,c in enumerate(chunk_li):
         # Grab data, export to csv
         sub_data = df_new.loc[c, ['index',add,city,state,zipcode]]
         sub_data.to_csv('temp_geo.csv',header=False,index=False)
         # Geo the results and turn back into df
         print(f'Geocoding round {int(i)+1} of {int(splits)}, {datetime.now()}')
         result = cg.addressbatch('temp_geo.csv') #should try/except?
         # May want to dump the intermediate results
         #pi_str = f'pickres_{int(i)}.p'
         #pickle.dump( favorite_color, open( pi_str, "wb" ) )
         #pick_fi.append(pi_str.copy())
         names = list(result[0].keys())
         res_zl = []
         for r in result:
             res_zl.append( list(r.values()) )
         res_df = pd.DataFrame(res_zl, columns=names)
         res_li.append( res_df.copy() )
         time.sleep(10) #sleep 10 seconds to not get cutoff from request
     final_df = pd.concat(res_li)
     final_df.rename(columns={'id':'row'}, inplace=True)
     final_df.reset_index(inplace=True, drop=True)
     # Clean up csv file
     os.remove('temp_geo.csv')
     return final_df
 ######################################################################

And now we are onto the final stage, actually running the geocoding function, and piping the end results to a csv file. (Which you can see the final version here.)

######################################################################
# Geocoding the data in chunks

# Takes around 35 minutes
geo_pharm = split_geo(ph_tx, add='Address', city='City', state='State2', zipcode='Zip5', chunk_size=100)

#What is the geocoding hit rate?
print( geo_pharm['match'].value_counts() )
# Only around 65%

# Now merging back with the original data if you want
# Not quite sorted how I need them
geo_pharm['rowN'] = geo_pharm['row'].astype(int)
gp2 = geo_pharm.sort_values(by='rowN').reset_index(drop=True)

# Fields I want
kg = ['address','match','lat','lon']
kd = ['NPI',
      'Provider Organization Name (Legal Business Name)',
      'Provider First Line Business Practice Location Address',
      'Address','City','State2','Zip5']

final_pharm = pd.concat( [ph_tx[kd], gp2[kg]], axis=1 )

final_pharm.to_csv('Pharmacies_Texas.csv',index=False)
######################################################################

Unfortunately the geocoding hit rate is pretty disappointing, only around 65% in this sample. So if I were using this for a project, I would likely do a round of geocoding using the Google API (which is a bit more unforgiving for varied addresses), or perhaps build my own openstreet map geocoder for the US. (Or in general if you don’t have too many to review, doing it interactively in ArcGIS is very nice as well if you have access to Arc.)

Filled contour plot in python

I’ve been making a chart that looks similar to this for a few different projects at work, so figured a quick blog post to show the notes of it would be useful.

So people often talk about setting a decision threshold to turn a predicted probability into a binary yes/no decision. E.g. do I do some process to this observation if the probability is 20%, 30%, 60%, etc. If you can identify the costs and benefits of making particular decisions, you can set a simple threshold to make that decision. For example, say you are sending adverts in the mail for a product. If the person buys the product, your company makes $50, and the advert only costs $1 to send. In this framework, if you have a predictive model for the probability the advert will be successful, then your decision threshold will look like this:

$50*probability - $1

So in this case you need the predicted probability to be above 2% to have an expected positive return on the investment of sending the advert. So if you have a probability of 10% for 2000 customers, you would expect to make 2000 * (50*0.1 - 1) = 8000. The probabilities you get from your predictive model can be thought of as in the long run averages. Any single advert may be a bust, but if your model is right and you send out a bunch, you should make this much money in the end. (If you have a vector of varying probabilities, in R code the estimated revenue will then look like prob <- runif(2000,0,0.1); pover <- prob > 0.02; sum( (50*prob - 1)*pover ).)

But many of the decisions I work with are not a single number in the benefits column. I am working with medical insurance claims data at HMS, and often determining models to audit those claims in some way. In this framework, it is more important to audit a high dollar claim than a lower dollar claim, even if the higher dollar value claim has a lower probability. So I have been making the subsequent filled contour plot I am going to show in the next section to illustrate this.

python contour plot

The code snippet is small enough to just copy-paste entirely. First, I generate data over a regular grid to illustrate different claim amounts and then probabilities. Then I use np.meshgrid to get the data in the right shape for the contour plot. The revenue estimates are then simply the probability times the claims amount, minus some fixed (often labor to audit the claim) cost. After that is is just idiosyncratic matplotlib code to make a nice filled contour.

# Example of making a revenue contour plot
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import numpy as np

n = 500 #how small grid cells are
prob = np.linspace(0,0.5,n)
dollar = np.linspace(0,10000,n)
#np.logspace(0,np.log10(10000),n) #if you want to do logged

# Generate grid
X, Y = np.meshgrid(prob, dollar)

# Example generating revenue
fixed = 200
Rev = (Y*X) - fixed

fig, ax = plt.subplots()
CS = ax.contourf(X, Y, Rev, cmap='RdPu')
clb = fig.colorbar(CS)
#clb.ax.set_xlabel('Revenue') #Abit too wide
clb.ax.set_title('dollar') #html does not like the dollar sign
ax.set_xlabel('Probability')
ax.set_ylabel('Claim Amount')
ax.yaxis.set_major_formatter(StrMethodFormatter('${x:,.0f}'))
plt.title('Revenue Contours')
plt.xticks(np.arange(0,0.6,0.1))
plt.yticks(np.arange(0,11000,1000))
plt.annotate('Revenue subtracts $200 of fixed labor costs',
(0,0), (0, -50),
xycoords='axes fraction',
textcoords='offset points', va='top')
#plt.savefig('RevContour.png',dpi=500,bbox_inches='tight')
plt.show()

The color bar does nice here out of the box. Next up in my personal learning will be how to manipulate color bars a bit more. Here I may want to either use a mask to not show negative expected returns, or a diverging color scheme (e.g. blue for negative returns).

Buffers and hospital deserts with geopandas

Just a quick blog post today. As a bit of a side project at work I have been looking into medical service provider deserts. Most people simply use a geographic cutoff of say 1 mile (see Wissah et al., 2020 for example for Pharmacy deserts). Also for CJ folks, John Hipp has done some related work for parolees being nearby service providers (Hipp et al., 2009; 2011), measuring nearby as 2 miles.

So I wrote some code to calculate nice sequential buffer areas and dissolve them in geopandas. Files and code to showcase are here on GitHub. First, as an example dataset, I geocode (using the census geocoding API) CMS certified Home Healthcare facilities, so these are hospice facilities. To see a map of those facilities across the US, and you can click on the button to get info, go to here, CMS HOME FACILITY MAP. Below is a screenshot:

Next I then generate sequential buffers in kilometers of 2, 4, 8, 16, and then the leftover (just for Texas). So you can then zoom in and darker areas are at a higher risk of not having a hospice facility nearby. HOSPICE DESERT MAP

Plotting some of these in Folium were giving me fits, so I will need to familiarize myself with that more in the future. The buffers for the full US as well were giving me trouble (these just for Texas result in fairly large files, surprised Github doesn’t yell at me for them being too big).

Going forward, what I want to do is instead of relying on a fixed function of distance, is to fit a model to identify individuals probability of going to the doctor based on distance. So instead of just saying 1+ mile and you are at high risk, fit a function that defines that distance based on behavioral data (maybe using insurance claims). Also I think the distances matter quite a bit for urban/rural and car/no-car. So rural folks traveling a mile is not a big deal, since you need a car to really do anything in rural areas. But for folks in the city relying on public transportation going a mile or two is a bigger deal.

The model then would be similar to the work I did with Gio on gunshot death risk (Circo & Wheeler, 2020), although I imagine the model would spatially vary (so maybe geographically weighted regression may work out well).

References

The WDD test with different pre/post time periods

Eric Piza asked the other day if my and Jerry’s WDD test can be used when the pre/post time periods are different. The answer is yes out of the box, the identification strategy does not rely on equality of time periods. So for example, say we had two years pre and one year post data, and the crime counts in treated/control looked like this:

         Pre  Post 
Treated   80    20
Control  100    50

So then our difference-in-difference Poisson estimate of the treatment effect would be:

(20 - 80) - (50 - 100) =  -10

What the parallel trends assumption means here is that since you saw a decrease in 50 crimes in the control area, you would expect a decrease of 50 crimes in the treated area as well. The variance of this estimate is then 20 + 80 + 50 + 100 = 250, and so the standard error is sqrt(250) ~ 15.8. So this is not a statistically significant effect.

It is hard to interpret this effect size though, since it is not a standard unit of time comparison. Also the variance of the estimate will be larger if you have a longer pre time period, which is the opposite of what you want. We can actually amend the statistic though to be a per-unit-time comparison, which will reduce the variance of the estimate. It ends up being similar to my prior post on adding Harm Weights to the WDD, you can’t just pipe in the per unit time estimates in the spreadsheet I shared, but I will show here how to incorporate them into the estimator (and share some python code to show the estimator behaves as expected in simulations).

So again with a pre-time period of 2 years, and post of 1 year, we could do the prior table as per year estimates.

         Pre  Post 
Treated   40    20
Control   50    50

And here our estimate of the crime reduction effect is different:

(20 - 40) - (50 - 50) =  -20

So with a Poisson variable with a mean of 100, the variance of that variable is also 100. So here we are dividing that 100 by a constant 2 – this changes the variance to 100/(2^2). (Var(X*a) = a^2*Var(X) where X is a random variable and a is a constant.) The post variables are simply divided by one, so does not change their variance. So to carry this forward to our standard error estimate, we would calculate (edit, had some errors in this part, should be using the original counts, not the average counts, but my later python code was correct, so the simulation part is OK):

20/1 + 80/4 + 50/1 + 100/4 = 115

So you can see that our variance estimate here is much smaller, and that the standard error is sqrt(115) ~ 10.7. So here the reduction is not quite a statistically significant reduction in crimes at the 0.05 level. A 95% confidence interval would be -20 +/- 2*10.7 ~ [+1, -41]. Here the WDD estimate is easier to interpret as well, and that confidence interval corresponds to a per year estimate of somewhere between +1 and -41 crimes.

Below I share some python code to conduct simulations similar to the original WDD paper. This code will then establish the estimator has the null distribution as expected (when there are no changes it really is a standard normal distribution) and that the confidence intervals have coverage like you would expect.

Python Simulation Code

For set up, I import the libraries I need (stat distributions, numpy and pandas). I am not going to go into detail into the functions, but it allows you to generate simulated distributions in various ways to conduct analysis of the properties of my time weighted estimator I have specified above.

'''
WDD Simulation with differing time periods
Andy Wheeler
'''

import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import poisson
from scipy.stats import uniform

#This works for the scipy functions
np.random.seed(seed=10)

# A function to generate the WDD estimate for simulated data
def wdd_sim(treat0,treat1,cont0,cont1,pre,post):
    tr_cr_0 = poisson.rvs(mu = treat0, size=int(pre)).sum()
    co_cr_0 = poisson.rvs(mu = cont0, size=int(pre)).sum()
    tr_cr_1 = poisson.rvs(mu = treat1, size=int(post)).sum()
    co_cr_1 = poisson.rvs(mu = cont1, size=int(post)).sum()
    est = ( tr_cr_1/post - tr_cr_0/pre ) - ( co_cr_1/post - co_cr_0/pre )
    post2 = (1/post)**2
    pre2 = (1/pre)**2
    var_est = tr_cr_0*pre2 + tr_cr_1*post2 + co_cr_0*pre2 + co_cr_1*post2
    true_val = ( treat1 - treat0 ) - ( cont1 - cont0 )
    z_score = est / np.sqrt(var_est)
    return (est, var_est, true_val, z_score)

def make_data(n, treat0, treat1, cont0, cont1, pre, post):
    base = pd.DataFrame( range(n), columns=['index'])
    base['treat0'] = treat0
    if treat1 is not None:
        base['treat1'] = treat1
    else:
        base['treat1'] = base['treat0']
    if cont0 is not None:
        base['cont0'] = cont0
    else:
        base['cont0'] = base['treat0']
    if cont1 is not None:
        base['cont1'] = cont1
    else:
        base['cont1'] = base['cont0']
    base.drop(columns='index',inplace=True)
    base['pre'] = pre
    base['post'] = post
    sim_vals = base.apply(lambda x: wdd_sim(**x), axis=1, result_type='expand')
    sim_vals.columns = ['est','var_est','true_val','z_score']
    return pd.concat([base,sim_vals], axis=1)

So for a first example, this code generates treatment/control areas with a Poisson mean of 5 in both the pre/post time periods. But, the pre time period is 4 units of time, and the post time period is only 1 unit. So this means there is no change, and the Z score estimator should on average have a 0 estimate and a standard deviation of 1. I do 10,000 simulations to keep it going a bit faster, but you can up that if you want.

# No change, with baseline of 5 crimes per unit time
sim_dat = make_data(10000, 5, 5, 5, 5, 4, 1)
sim_dat['z_score'].describe()

So here we can see these 10k simulated Poisson data have a mean z-score of 0 and a standard deviation of 1, right like we expected.

So I haven’t extensively tested, but if you have average crime counts well under 5, I would be a bit hesitant to use this estimator. (So you either need larger area aggregations or larger time aggregations.) Although you could do simulations on your own to see how it holds up.

The way I wrote the functions you can also pass in random variables as well, so here is an example with again no change, but the baseline varies uniformily from 5 to 100. And here also the pre time periods are 6, and the post time period is again just 1.

# Can pass in random functions instead of constant values
sim_n = 10000
tf = uniform.rvs(loc=5, scale=100, size=sim_n)

sim_dat2 = make_data(sim_n, tf, None, None, None, 6, 1)
sim_dat2.head()
sim_dat2['z_score'].describe()

So you can see the base simulated dataset pre/post always has the same means, but instead of being a set of constant 5’s, it changes for each row (simulation) in the dataset. And again the null distribution is right on the money with a mean of 0 and standard deviation of 1.

So those are examples of the null distribution of no changes in the time weighted estimator. This establishes that the false positive alpha rates are as you would expect. E.g. if you use the usual p-value < 0.05, if the differences are really 0 you only have a false positive reject the null 5 times out of 100.

But we also want to establish that when there is a difference, the estimator is not biased and that the variance estimates are correct. For the later part looking at the coverage rates of the confidence intervals is one way to do that. So here I show that with my hypothetical example in the intro part of this blog, the 95% and 90% confidence interval coverage rates are exactly as they should be. And the z-score estimate is right about where it should be as well.

# Lets look at the coverage rate for a decline from 40 to 20
def cover(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    cover = ( data['true_val'] > low) & ( data['true_val'] < high )
    return cover

sim_dat3 = make_data(sim_n, 40, 20, 50, 50, 2, 1)
sim_dat3.head()

# This should be centered on 2
sim_dat3['z_score'].describe()

# Should be ~ 0.9
co_90 = cover(sim_dat3, ci=0.9)
co_90.mean()

# Should be ~ 0.95
co_95 = cover(sim_dat3, ci=0.95)
co_95.mean()

So you can see the coverage is right on the money. The estimator is slightly biased downward in this simulation (should get a z-score on average around -2, but here the mean is -1.85). But it is good enough IMO to not worry about much in this situation.

Again, the original estimator without weighted for time is fine, if we do the same motions without doing weighting for different time periods, the coverage is still all fine and dandy.

# Note you can do the same coverage estimate without time weighted
sim_dat4 = make_data(sim_n, 80, 20, 100, 50, 1, 1)
sim_dat4.head()

# This should be around -0.6
sim_dat4['z_score'].describe()

co_90w = cover(sim_dat4, ci=0.9)
co_90w.mean()

co_95w = cover(sim_dat4, ci=0.95)
co_95w.mean()

So you can see again coverage is right on the money, and the z-score estimator actually has less bias than the time weighted one, it is right on the money as expected.

So why would you prefer the time weighted estimator if it shows more bias? It is because it has a lower variance, this code shows the length of the confidence intervals in the simulations.

# Does it make a difference?
def len_ci(data, ci=0.95):
    mult = (1 - ci)/2
    nv = norm.ppf(1 - mult)
    dif = nv*np.sqrt( data['var_est'] )
    low = data['est'] - dif
    high = data['est'] + dif
    return high - low

len4 = len_ci(sim_dat4)
len4.describe()

len3 = len_ci(sim_dat3)
len3.describe()

So you can see here that the non-time weighted estimator tends to have a confidence interval with a length of 62, whereas the time weighted estimator has a confidence interval on average of 42.

So above establishes that the time weighted estimator behaves as you would expect. You can also use this code to conduct some potential power analyses. So for the time weighted estimator we show, even though the reduction is around 50% in the treated area (going from 40 to 20), the power is not great, around 60%.

# Example power analyses, ONE TAILED
def reject_rate(data, alpha=0.05):
    p_vals = norm.cdf(data['z_score'])
    return p_vals < alpha
    
r3 = reject_rate(sim_dat3)
r3.mean()

So this means if you did this experiment in real life and it was that effective, you would still fail to reject the null of no differences 2/5 times.

But what if we say we will get more historical data? So 4 years back instead of just 2? How does that impact our power estimates?

# How about with more historical data
sim_dat5 = make_data(sim_n, 40, 20, 50, 50, 4, 1)
r5 = reject_rate(sim_dat5)
r5.mean()

The power goes up by alittle, to 0.67. The same is true if we up the post period to 4 time periods instead of 1:

# How about with more post data
sim_dat6 = make_data(sim_n, 40, 20, 50, 50, 4, 4)
r6 = reject_rate(sim_dat6)
r6.mean()

So now in this example you have an over 90% power to detect a crime reduction, going from 40 to 20 per time period (where the control has an average of 50 crimes per time period), if you have 4 pre time periods and 4 post time periods.

Future Stuff

So a few caveats with this. For one, you may think that since dividing per time period reduces the variance, why not divide by smaller time slivers. So instead of one year, why not divide by 365 days?

I have not studied extensively this property of the estimator. So I cannot say how it behaves with more/less time aggregation into smaller Poisson estimates. You will need to take that on yourself if you want to examine very fine time units and very small Poisson counts per unit time. Again I think a baseline rule of thumb that they should not be lower the 5 counts per unit time is the best advice I can give without doing simulations for your exact circumstances.

A second part is that with longer time periods comes the risk that the control areas are not as good. This is a problem intrinsic to synthetic control analysis as well (that I don’t believe anyone has a particular answer to). And I don’t have an answer either.

For the pre-time period, you can check the parallel trends assumption by simply plotting the two time series, they should be close to in step with one another. So that is not a big deal. But with the post time period, I think if you monitor long enough they will eventually depart from one another.

So I think it is best to set up a time period at the start you have committed to doing the experiment. And you can use the power analysis simulations like I showed to help you figure out that period. But it may be possible to extend this WDD estimate to continuously monitor an intervention (see here for example).