Ask me anything: Advice for learning statistics?

For a bit of background, Loki, a computer science student in India, was asking me about my solution to the DrivenData algae bloom competition. Much of our back and forth was specific to my coding solution and “how I knew how to do that” (in particular I used a machine learning variant of doubly robust estimation in part of the solution, which I am sure others have used before but is not real common that I see, it is more often “causal inference” motivated). As for more general advice in learning, I said:

Only advice is to learn stats – not just for competitions but for real-world jobs. Many people are just copy-pasting code, and don’t know what they are doing. Understanding selection bias is important in many real-world scenarios. Often times it is just knowing a little about the scientific scenario you are modeling, and correctly formulating a model.

In response Loki asks:

I decided to take your suggestion and strengthen my grasp on statistics. I consider myself somewhere between beginner to intermediate in stats. I came across several resources on the internet, but feel confused about what to go with. I am wondering if “The Elements of Statistical Learning” by Trevor Hastie and Robert Tibishirani is a good one to start with. Or could you please suggest any books/lectures/courses that have practical applications to solidify my understanding of statistics that you have personally read or liked?

Which I think is a good piece to expand to the readers on my blog in general. Here is my response:

I would not start with that book. It is a mistake to start with too advanced of material. (I don’t learn anything that way anyway.)

Starting from the basics, no joke Gonick’s Cartoon Guide to Statistics is in my opinion the best intro to statistics and probability book. After that, it is important to understand causality – like really understand it – selection bias lurks everywhere. (I am not sure I have great advice for books that focus on causality, Pearl’s book is quite tough, maybe Shadish, Cook, Campbell Experimental and Quasi-Experimental Designs and/or Mostly Harmless Econometrics).

After that, follow questions on https://stats.stackexchange.com, it is high quality on average (many internet sources, like Medium articles or https://datascience.stackexchange.com, are very low quality on average – they can have gems but more often than not they are bad for anything besides copy/pasting code). Andrew Gelman’s blog is another good source for contemporary discussion around stats/research/pitfalls, https://statmodeling.stat.columbia.edu.

In terms of more advanced modeling, after having the basics down, I would suggest Harrell’s Regression Modeling Strategies before the Hastie book. You can interpret pretty much all of machine learning in terms of regression models. For small datasets, understanding how to do simpler regression modeling the right way is the best approach.

When moving onto machine learning, then maybe the Hastie book is a good resource (I didn’t find it all that much useful at this point beyond web resources). Statquest videos are very good walkthroughs of more complicated ML algorithms, https://www.youtube.com/@statquest, trees/boosting/neural-networks.

This is a hodge-podge – I don’t tend to learn things just to learn them – I have a specific project in mind and try to tackle that project the best I can. Many of these resources are items I picked up along the way (Gonick I got to review intro stats books for teaching, Harrell’s I picked up to learn a bit more about non-linear modeling with splines, Statquest I reviewed when interviewing for data science positions).

It is a long road to get to where I am. It was not via picking a book and doing intense study, it was a combination of applied projects and learning new things over time. I learned a crazy lot from the Cross Validated site when I was in grad school. (For those interested in optimization, the Operations Research site is also very high quality.) That was more broad learning though – seeing how people tackled problems in different domains.

Getting access to paywalled newspaper and journal articles

So recently several individuals have asked about obtaining articles they do not have access to that I cite in my blog posts. (Here or on the American Society of Evidence Based Policing.) This is perfectly fine, but I want to share a few tricks I have learned on accessing paywalled newspaper articles and journal articles over the years.

I currently only pay for a physical Sunday newspaper for the Raleigh News & Observer (and get the online content for free because of that). Besides that I have never paid for a newspaper article or a journal article.

Newspaper paywalls

Two techniques for dealing with newspaper paywalls. 1) Some newspapers you get a free number of articles per month. To skirt this, you can open up the article in a private/incognito window on your preferred browser (or open up the article in another browser entirely, e.g. you use Chrome most of the time, but have Firefox just for this on occasion.)

If that does not work, and you have the exact address, you can check the WayBack machine. For example, here is a search for a WaPo article I linked to in last post. This works for very recent articles, so if you can stand being a few days behind, it is often listed on the WayBack machine.

Journal paywalls

Single piece of advice here, use Google Scholar. Here for example is searching for the first Braga POP Criminology article in the last post. Google scholar will tell you if a free pre or post-print URL exists somewhere. See the PDF link on the right here. (You can click around to “All 8 Versions” below the article as well, and that will sometimes lead to other open links as well.)

Quite a few papers have PDFs available, and don’t worry if it is a pre-print, they rarely substance when going into print.1

For my personal papers, I have a google spreadsheet that lists all of the pre-print URLs (as well as the replication materials for those publications).

If those do not work, you can see if your local library has access to the journal, but that is not as likely. And I still have a Uni affiliation that I can use for this (the library and getting some software cheap are the main benefits!). But if you are at that point and need access to a paper I cite, feel free to email and ask for a copy (it is not that much work).

Most academics are happy to know you want to read their work, and so it is nice to be asked to forward a copy of their paper. So feel free to email other academics as well to ask for copies (and slip in a note for them to post their post-prints to let more people have access).

The Criminal Justician and ASEBP

If you like my blog topics, please consider joining the American Society of Evidence Based Policing. To be clear I do not get paid for referrals, I just think it is a worthwhile organization doing good work. I have started a blog series (that you need a membership for to read), and post once a month. The current articles I have written are:

So if you want to read more of my work on criminal justice topics, please join the ASEBP. And it is of course a good networking resource and training center you should be interested in as well.


  1. You can also sign up for email alerts on Google Scholar for papers if you find yourself reading a particular author quite often.↩︎

Random notes, digital art, and pairwise comparisons is polynomial

So not too much in the hopper for the blog at the moment. Have just a bunch of half-baked ideas (random python tips, maybe some crime analysis using osmnx, scraping javascript apps using selenium, normal nerd data science stuff).

Still continuing my blog series on the American Society of Evidence Based Policing, and will have a new post out next week on officer use of force.

If you have any suggestions for topics always feel free to ask me anything!


Working on some random digital art (somewhat focused on maps but not entirely). For other random suggestions I like OptArt and Rick Wicklin’s posts.

Dall-E is impressive, and since it has an explicit goal of creating artwork I think it is a neat idea. Chat bots I have nothing good to say. Computer scientists working on them seem to be under the impression that if you build a large/good enough language model out pops general intelligence. Wee bit skeptical of that.


At work a co-worker was working on timing applications for a particular graph-database/edge-detection project. Initial timings on fake data were not looking so good. Here we have number of nodes and timings for the application:

  Nodes    Minutes
   1000       0.16
  10000       0.25
 100000       1.5
1000000      51

Offhand people often speak about exponential functions (or growth), but here what I expect is we are really looking at is pairwise comparisons (not totally familiar with the tech the other data scientist is using, so I am guessing the algorithmic complexity). So this likely scales something like (where n is the number of nodes in the graph):

Time = Fixed + C1*(n) + C2*(n choose 2) + e

Fixed is just a small constant, C1 is setting up the initial node database, and C2 is the edge detection which I am guessing uses pairwise comparisons, (n choose 2). We can rewrite this to show that the binomial coefficient is really polynomial time (not exponential) in terms of just the number of nodes.

C2*[n choose 2] = C2*[{n*(n-1)}/2]
                  C2*[ (n^2 - n)/2 ]
                  C2/2*[n^2 - n]
                  C2/2*n^2 - C2/2*n

And so we can rewrite our original equation in terms of simply n:

Time = Fixed + (C1 - C2/2)*N + C2/2*N^2

Doing some simple R code, we can estimate our equation:

n <- 10^(3:6)
m <- c(0.16,0.25,1.5,51)
poly_mod <- lm(m ~ n + I(n^2))

Since this fits 3 parameters with only 4 observations, the fit is (not surprisingly) quite good. Which to be clear does not mean much, if really cared would do much more sampling (or read the docs more closely about the underlying tech involved):

> pred <- predict(poly_mod)
> cbind(n,m,pred)
      n     m       pred
1 1e+03  0.16  0.1608911
2 1e+04  0.25  0.2490109
3 1e+05  1.50  1.5000989
4 1e+06 51.00 50.9999991

And if you do instead poly_2 <- lm(m ~ n + choose(n,2)) you get a change in scale of the coefficients, but the same predictions.

We really need this to scale in our application at work to maybe over 100 million records, so what would we predict in terms of minutes based on these initial timings?

> nd = data.frame(n=10^(7:8))
> predict(poly_mod,nd)/60 # convert to hours
         1          2
  70.74835 6934.56850

So doing 10 million records will take a few days, and doing 100 million will be close to 300 days.

With only 4 observations not much to chew over (really it is too few to say it should be a different model). I am wondering though how to best handle errors for these types of extrapolations. Errors are probably not homoskedastic for such timing models (error will be larger for larger number of nodes). Maybe better to use quantile regression (and model the median?). I am not sure (and that advice I think will also apply to modeling exponential growth as well).

Surpassed 100k views in 2022

For the first time, yearly view counts have surpassed 100,000 for my blog.

I typically get a bump of (at best) a few hundred views when I first post a blog. But the most popular posts are all old ones, and I get the majority of my traffic via google searches.

Around March this year monthly bumped up from around 9k to 11k views per month. Not sure of the reason (it is unlikely due to any specific inidividual post, as you can see, none of the most popular posts were posted this year). A significant number of the views are likely bots (what percent overall though I have no clue). So it is possible my blog was scooped up in some other aggregators/scrapers around that time (I would think those would not be counted as search engine referrals though).

One interesting source for the blog, when doing academic style posts with citations, my blog gets picked up by google scholar (see here for example). It is not a big source, but likely a more academic type crowd being referred to the blog (I can tell people have google scholar alerts – when scholar indexes a post I get a handful of referrals).

I have some news coming soon about writing a more regular criminal justice column for an organization (readers will have to wait alittle over a week). But I also do Ask Me Anything, so always feel free to send me an email or comment on here (started AMA as I get a trickle of tech questions via email anyway, and might as well share my response with everyone).

I typically just blog generally about things I am working on. So maybe next up is that auto-ml libraries often have terrible defaults for hypertuning random forests, or maybe an example of data envelopment analysis, or quantile regression for analyzing response times, or monitoring censored data are all random things I have been thinking about recently. But no guarantees about any those topics in particular!

Using weights in regression examples

I have come across several different examples recently where ‘use weights in regression’ was the solution to a particular problem. I will outline four recent examples.

Example 1: Rates in WDD

Sophie Curtis-Ham asks whether I can extend my WDD rate example to using the Poisson regression approach I outline. I spent some time and figured out the answer is yes.

First, if you install my R package ptools, we can use the same example in that blog post showing rates (or as per area, e.g. density) in my internal wdd function using R code (Wheeler & Ratcliffe, 2018):

library(ptools)

crime <- c(207,308,178,150,110,318,157,140)
type <- c('t','ct','d','cd','t','ct','d','cd')
ti <- c(0,0,0,0,1,1,1,1)
ar <- c(1.2,0.9,1.5,1.6,1.2,0.9,1.5,1.6)

df <- data.frame(crime,type,ti,ar)

# The order of my arguments is different than the 
# dataframe setup, hence the c() selections
weight_wdd <- wdd(control=crime[c(2,6)],
                  treated=crime[c(1,5)],
                  disp_control=crime[c(4,8)],
                  disp_treated=crime[c(3,7)],
                  area_weights=ar[c(2,1,4,3)])

# Estimate -91.9 (31.5) for local

So here the ar vector is a set of areas (imagine square miles or square kilometers) for treated/control/displacement/displacementcontrol areas. But it would work the same if you wanted to do person per-capita rates as well.

Note that the note says the estimate for the local effect, in the glm I will show I am just estimating the local, not the displacement effect. At first I tried using an offset, and that did not change the estimate at all:

# Lets do a simpler example with no displacement
df_nod <- df[c(1,2,5,6),]
df_nod['treat'] <- c(1,0,1,0)
df_nod['post'] <- df_nod['ti']

# Attempt 1, using offset
m1 <- glm(crime ~ post + treat + post*treat + offset(log(ar)),
          data=df_nod,
          family=poisson(link="identity"))
summary(m1) # estimate is  -107 (30.7), same as no weights WDD

Maybe to get the correct estimate via the offset approach you need to do some post-hoc weighting, I don’t know. But we can use weights and estimate the rate on the left hand side.

# Attempt 2, estimate rate and use weights
# suppressWarnings is for non-integer notes
df_nod['rate'] <- df_nod['crime']/df_nod['ar']
m2 <- suppressWarnings(glm(rate ~ post + treat + post*treat,
          data=df_nod,
          weights=ar,
          family=poisson(link="identity")))
summary(m2) # estimate is same as no weights WDD, -91.9 (31.5)

The motivation again for the regression approach is to extend the WDD test to scenarios more complicated than simple pre/post, and using rates (e.g. per population or per area) seems to be a pretty simple thing people may want to do!

Example 2: Clustering of Observations

Had a bit of a disagreement at work the other day – statistical models used for inference of coefficients on the right hand side often make the “IID” assumption – independent and identically distributed residuals (or independent observations conditional on the model). This is almost entirely focused on standard errors for right hand side coefficients, when using machine learning models for purely prediction it may not matter at all.

Even if interested in inference, it may be the solution is to simply weight the regression. Consider the most extreme case, we simply double count (or here repeat count observations 100 times over):

# Simulating simple Poisson model
# but replicating data
set.seed(10)
n <- 600
repn <- 100
id <- 1:n
x <- runif(n)
l <- 0.5 + 0.3*x
y <- rpois(n,l)
small_df <- data.frame(y,x,id)
big_df <- data.frame(y=rep(y,repn),x=rep(x,repn),id=rep(id,repn))

# With small data 
mpc <- glm(y ~ x, data=small_df, family=poisson)
summary(mpc)

# Note same coefficients, just SE are too small
mpa <- glm(y ~ x, data=big_df, family=poisson)

vcov(mpc)/vcov(mpa) # ~ 100 times too small

So as expected, the standard errors are 100 times too small. Again this does not cause bias in the equation (and so will not cause bias if the equation is used for predictions). But if you are making inferences for coefficients on the right hand side, this suggests you have way more precision in your estimates than you do in reality. One solution is to simply weight the observations inverse the number of repeats they have:

big_df$w <- 1/repn
mpw <- glm(y ~ x, weight=w, data=big_df, family=poisson)
summary(mpw)
vcov(mpc)/vcov(mpw) # correct covariance estimates

And this will be conservative in many circumstances, if you don’t have perfect replication across observations. Another approach though is to cluster your standard errors, which uses data to estimate the residual autocorrelation inside of your groups.

library(sandwich)
adj_mpa <- vcovCL(mpa,cluster=~id,type="HC2")
vcov(mpc)/adj_mpa   # much closer, still *slightly* too small

I use HC2 here as it uses small sample degree of freedom corrections (Long & Ervin, 2000). There are quite a few different types of cluster corrections. In my simulations HC2 tends to be the “right” choice (likely due to the degree of freedom correction), but I don’t know if that should generally be the default for clustered data, so caveat emptor.

Note again though that the cluster standard error adjustments don’t change the point estimates at all – they simply adjust the covariance matrix estimates for the coefficients on the right hand side.

Example 3: What estimate do you want?

So in the above example, I exactly repeated everyone 100 times. You may have scenarios where you have some observations repeated more times than others. So above if I had one observation repeated 10 times, and another repeated 2 times, the correct weights in that scenario would be 1/10 and 1/2 for each row inside the clusters/repeats. Here is another scenario though where we want to weight up repeat observations though – it just depends on the exact estimate you want.

A questioner wrote in with an example of a discrete choice type set up, but some respondents are repeated in the data (e.g. chose multiple responses). So imagine we have data:

Person,Choice
  1      A  
  1      B  
  1      C  
  2      A  
  3      B  
  4      B  

If you want to know the estimate in this data, “pick a random person-choice, what is the probability of choosing A/B/C?”, the answer is:

A - 2/6
B - 3/6
C - 1/6

But that may not be what you really want, it may be you want “pick a random person, what is the probability that they choose A/B/C?”, so in that scenario the correct estimate would be:

A - 2/4
B - 3/4
C - 1/4

To get this estimate, we should weight up responses! So typically each row would get a weight of 1/nrows, but here we want the weight to be 1/npersons and constant across the dataset.

Person,Choice,OriginalWeight,UpdateWeight
  1      A      1/6             1/4
  1      B      1/6             1/4
  1      C      1/6             1/4
  2      A      1/6             1/4
  3      B      1/6             1/4
  4      B      1/6             1/4

And this extends to whatever regression model if you want to model the choices as a function of additional covariates. So here technically person 1 gets triple the weight of persons 2/3/4, but that is the intended behavior if we want the estimate to be “pick a random person”.

Depending on the scenario you could do two models – one to estimate the number of choices and another to estimate the probability of a specific choice, but most people I imagine are not using such models for predictions so much as they are for inferences on the right hand side (e.g. what influences your choices).

Example 4: Cross-classified data

The last example has to do with observations that are nested within multiple hierarchical groups. One example that comes up in spatial criminology – we want to do analysis of some crime reduction/increase in a buffer around a point of interest, but multiple buffers overlap. A solution is to weight observations by the number of groups they overlap.

For example consider converting incandescent street lamps to LED (Kaplan & Chalfin, 2021). Imagine that we have four street lamps, {c1,c2,t1,t2}. The figure below display these four street lamps; the t street lamps are treated, and the c street lamps are controls. Red plus symbols denote crime locations, and each street lamp has a buffer of 1000 feet. The two not treated circle street lamps overlap, and subsequently a simple buffer would double-count crimes that fall within both of their boundaries.

If one estimated a treatment effect based on these buffer counts, with the naive count within buffer approach, one would have:

c1 = 3    t1 = 1
c2 = 4    t2 = 0

Subsequently an average control would then be 3.5, and the average treated would be 0.5. Subsequently one would have an average treatment effect of 3. This however would be an overestimate due to the overlapping buffers for the control locations. Similar to example 3 it depends on how exactly you want to define the average treatment effect – I think a reasonable definition is simply the global estimate of crimes reduced divided by the total number of treated areas.

To account for this, you can weight individual crimes. Those crimes that are assigned to multiple street lamps only get partial weight – if they overlap two street lamps, the crimes are only given a weight of 0.5, if they overlap three street lamps within a buffer area those crimes are given a weight of 1/3, etc. With such updated weighted crime estimates, one would then have:

c1 = 2    t1 = 1
c2 = 3    t2 = 0

And then one would have an average of 2.5 crimes in the control street lamps, and subsequently would have a treatment effect reduction per average street lamp of 2 crimes overall.

This idea I first saw in Snijders & Bosker (2011), in which they called this cross-classified data. I additionally used this technique with survey data in Wheeler et al. (2020), in which I nested responses in census tracts. Because responses were mapped to intersections, they technically could be inside multiple census tracts (or more specifically I did not know 100% what tract they were in). I talk about this issue in my dissertation a bit with crime data, see pages 90-92 (Wheeler, 2015). In my dissertation using D.C. data, if you aggregated that data to block groups/tracts the misallocation error is likely ~5% in the best case scenario (and depending on data and grouping, could be closer to 50%).

But again I think a reasonable solution is to weight observations, which is not much different to Hipp & Boessan’s (2013) egohoods.

References

Job advice for entry crime analysts

I post occasionally on the Crime Analysis Reddit, and a few recent posts I mentioned about expanding the net to private sector gigs for those interested in crime analysis. And got a question from a recent student as well, so figured a blog post on my advice is in order.

For students interested in crime analysis, it is standard advice to do an internship (while a student), and that gets you a good start on networking. But if that ship has sailed and you are now finished with school and need to get a job that does not help. Also standard to join the IACA (and if you have a local org, like TXLEAN for Texas, you can join that local org and get IACA membership at the same time). They have job boards for openings, and for local it is a good place to network as well for entry level folks. IACA has training material available as well.

Because there are not that many crime analysis jobs, I tell students to widen their net and apply to any job that lists “analyst” in the title. We hire many “business analysts” at Gainwell, and while having a background in healthcare is nice it is not necessary. They mostly do things in Excel, Powerpoint, and maybe some SQL. Probably more have a background in business than healthcare specifically. Feel free to take any background experience in the job description not as requirements but as “nice to have”.

These are pretty much the same data skills people use in crime analysis. So if you can do one you can do the other.

This advice is also true for individuals who are currently crime analysts and wish to pursue other jobs. Unfortunately because crime analysis is more niche in departments, there is not much upward mobility. Other larger organizations that have analysts will just by their nature have more senior positions to work towards over your career. Simultaneously you are likely to have a larger salary in the private sector than public sector for even the same entry level positions.

Don’t get the wrong impression on the technical skills needed for these jobs if you read my blog. Even more advanced data science jobs I am mostly writing python + SQL. I am not writing bespoke optimization functions very often. So in terms of skills for analyst positions I just suggest focusing on Excel. My crime analysis course materials I intentionally did in a way to get you a broad background that is relevant for other analyst positions as well (some SQL/Powerpoint, but mostly Excel).

Sometimes people like to think doing crime analysis is a public service, so look down on going to private sector. Plenty of analysts in banks/healthcare do fraud/waste/abuse that have just as large an impact on the public as do crime analysts, so I think this opinion is misguided in general.

Many jobs at Gainwell get less than 10 applicants. Even if these jobs have listed healthcare background requirements, if they don’t have options among the pool those doing the hiring will lower their expectations. I imagine it is the same for many companies. Just keep applying to analyst jobs and you will land something eventually.

I wish undergrad programs did a better job preparing social science students with tech skills. It is really just minor modifications – courses teaching Excel/SQL (maybe some coding for real go-getters). Better job at making stats relevant to the real world business applications (calculating expected values/variance and trends in those is a common task, doing null hypothesis significance testing is very rare). But you can level up on Excel with various online resources, my course included.

p-values with large samples (AMA)

Vishnu K, a doctoral student in Finance writes in a question:

Dear Professor Andrew Wheeler

Hope you are fine. I am big follower of your blog and have used it heavily to train myself. Since you welcome open questions, I thought of asking one here and I hope you don’t mind.

I was reading the blog Dave Giles and one of his blogs https://davegiles.blogspot.com/2019/10/everythings-significant-when-you-have.html assert that one must adjust for p values when working with large samples. In a related but old post, he says the same

“So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about ‘statistically significant’ results when the sample size is in the thousands, or even bigger. So, the p-values reported above are mostly pretty marginal, as far as significance is concerned” https://davegiles.blogspot.com/2011/04/drawing-inferences-from-very-large-data.html#more

In one of the posts of Andrew Gelman, he said the same

“When the sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise. In general: small n, unlikely to get small p-values.

Large n, likely to find something. Huge n, almost certain to find lots of small p-values” https://statmodeling.stat.columbia.edu/2009/06/18/the_sample_size/

As Leamer (1978) points if the level of significance should be set as a decreasing function of sample size, is there a formula through which we can check the needed level of significance for rejecting a null?

Context 1: Sample Size is 30, number of explanatory variables are 5

Context 2: Sample Size is 1000, number of explanatory variables are 5

In both contexts cant, we use p-value <.05 or should we fix a very small p-value for context 2 even though both contexts relates to same data set with difference in context 2 being a lot more data points!

Worrying about p-values here is in my opinion the wrong way to think about it. You can focus on the effect size, and even if an effect is significant, it may be substantively too small to influence how you use that information.

Finance I see, so I will try to make a relevant example. Lets say a large university randomizes students to take a financial literacy course, and then 10 years later follows up to see their overall retirement savings accumulated. Say the sample is very large, and we have results of:

Taken Class: N=100,000 Mean=5,000 SD=2,000
   No Class: N=100,000 Mean=4,980 SD=2,000

SE of Difference ~= 9
Mean Difference = 20
T-Stat ~= 2.24
p-value ~= 0.025

So we can see that the treated class saves more! But it is only 20 dollars more over ten years. We have quite a precise estimate. Even though those who took the class save more, do you really think taking the class is worth it? Probably not based on these stats, it is such a trivial effect size given the sample and overall variance of savings.

And then as a follow up from Vishnu:

Thanks a lot Prof Andrew. One final question is, Can we use the Cohen’s d or any other stats for effect size estimation in these cases?

Cohen’s d = (4980 – 5000) ⁄ 2000 = 0.01.

I don’t personally really worry about Cohen’s D to be honest. I like to try to work out the cost-benefits on the scales that are meaningful (although this makes it difficult to compare across different studies). So since I am a criminologist, I will give a crime example:

Treated Areas: 40 crimes
Non-Treated Areas: 50 crimes

Ignore the standard error for this at the moment. Whether a drop in 10 crimes “is worth it” depends on the nature of the treatment and the type of crime. If the drop is simply stealing small items from the store, but the intervention was hire 10 security guards, it is likely not worth it (the 10 guards salary is likely much higher than the 10 items they prevented theft of).

But pretend now that the intervention was nudging police officers to patrol more in hot spots (so no marginal labor cost) and the crimes we examined were shootings. Preventing 10 shootings is a pretty big deal, because they have such large societal costs.

In this scenario the costs-benefits are always on the count scale (how many crimes did you prevent). Doing another scale (like Cohen’s D or incident rate ratios or whatever) just obfuscates how to calculate the costs/benefits in this scenario.

Potential IT student projects with police? (AMA)

Abigail White writes in:

I’m still settling in at my agency and getting an analysis program rolling. Before I came here, I volunteered with a nonprofit and I had partnered 2 years running with a local university to provide real-world opportunities for their 4th-year undergrad IT students to do a final project. Basically, I was the project owner and they were supposed to build an application, modify software for my needs, or do some project along those lines to get on-the-ground experience. The school just reached out again and asked if I wanted to do a project with them again this year with whatever organization I’m currently working with.

I really want to tell them yes because it’s a tremendous opportunity for these students to know about how IT can interface with the field of law enforcement (and to get some IT minds working on law enforcement problems). Plus my agency actually may want to hire in-house IT at some point so it could be a good way for them to screen some people and make a job offer at the end of the school year.

Do you have any ideas for projects that would be good to build bridges between the IT industry and law enforcement? I need to come up with something good before I pitch it to command. The best I can think of so far is a phone-based app where officers could record the lat and long of their exact location and add info about the environment around them – ultimately so we could start getting good data for risk terrain modeling (i.e., at this location, there’s an abandoned house, at another location, there’s an unoccupied house where the lawn hasn’t been mown, etc.) I know that’s not much to go on, but I would love to hear any thoughts you have!

And here is what I said off-the-cuff in response:

It is hard for me to gauge feasibility. In terms of app maybe a nice public facing dashboard (maps/charts), with public info. Can pitch as something for community group meetings for example. That is a good marriage between IT (worrying about the server) and crime analysis (queries to feed the dashboard) I think.

Phone app to me sounds too hard. Even if you build it very difficult to get data security down and the PD to actually use it. Public dashboard is for everyone and no security worries. (Building a phone app seems pretty far outside what an IT person would typically do within a PD.)

RTM in terms of building a predictive model is feasible, but making a nice way for the PD to consume it is more difficult (so is more traditional crime analysis type stuff). But could maybe make sense just internally.

I have seen someone try to do the video narratives with officers (similar to your app idea, just using gopro). It is neat, but not sure it results in very actionable intelligence to be honest.

This is just me rambling though. Typically for projects I start with something the PD will find useful based on my own observations. So would look for regular things done low-tech now by the PD that can maybe be automated or improved upon with some coding work.

Sharing here with the hopes others will have more ideas. It is similar in scope to say a really good undergrad or a masters level thesis type project I would say as well.

Solving the P-Median model

Wendy Jang, my former student at UTD and now Data Scientist for the Stanislaus County Sheriff’s Dept. writes in:

Hello Andy,

I have some questions about your posting in GitHub. Honestly, I am not good at Python at all. I was playing with your python code and trying to understand the work flow. Then, I can pretty much mimic what you did; however, I encountered some errors along the way.

I was able to follow through lines but then I am stuck on PMed function. Do you know what possibly caused this error? See the screenshot below. Please advise. Thanks!

Specifically, Wendy is asking about my patrol redistricting example with workload inequality constraints. Wendy’s problem here is specifically she likely does not have CPLEX installed. CPLEX is free for academics, but unfortunately costs a bit of money for anyone else (not sure if they have cheaper licensing for public sector, looks like currently about $200 a month).

This was a good opportunity to update some of my code, so now see the two .py files in the DataCreated folder, in particular 01_pmed_class.py has a nice class function. I have additionally added in functionality to create a map, and more importantly eliminate subtours (my solution can potentially return disconnected areas). I have also added the ability to use different solvers.

For this problem CPLEX just works much better (takes around 10 minutes), but I was able to get the SCIP solver to return the same solution in around 5 hours. GLPK did not return a solution even letting it churn out for over 12 hours. I tested CBC for shorter time periods, but that appears to be a no-go as well.

So just a brief overview, the libraries I will be using (check out the end of the post for setting up the conda environment):

import pickle
import pulp
import networkx
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import geopandas as gpd

class pmed():
    """
    Ar - list of areas in model
    Di - distance dictionary organized Di[a1][a2] gives the distance between areas a1 and a2
    Co - contiguity dictionary organized Co[a1] gives all the contiguous neighbors of a1 in a list
    Ca - dictionary of total number of calls, Ca[a1] gives calls in area a1
    Ta - integer number of areas to create
    In - float inequality constraint
    Th - float distance threshold to make a decision variables
    """
    def __init__(self,Ar,Di,Co,Ca,Ta,In,Th):

I do not copy-paste the entire pmed class in the blog post. It is long, but is the rewrite of my model from the paper. Next, to load in the objects I need to fit the model (which were created/pickled in the 00_PrepareData.py file).

# Loading in data
data_loc = r'D:\Dropbox\Dropbox\PublicCode_Git\PatrolRedistrict\PatrolRedistrict\DataCreated\fin_obj.pkl'
areas, dist_dict, cont_dict, call_dict, nid_invmap, MergeData = pickle.load(open(data_loc,"rb"))

# shapefile for reporting areas
carr_report = gpd.read_file(r'D:\Dropbox\Dropbox\PublicCode_Git\PatrolRedistrict\PatrolRedistrict\DataOrig\ReportingAreas.shp')

And now we can create a pmed object, which has all the info necessary, and gives us a print out of the total number of decision variables and constraints.

# Creating pmed object
pmed12 = pmed(Ar=areas,Di=dist_dict,Co=cont_dict,Ca=call_dict,Ta=12,In=0.1,Th=10)

Writing the class this way allows the user to input their own solver, and you can see it prints out the available solvers on your current machine. So to pass in the SCIOPT solver you could do pmed12.solve(solver=pulp.SCIP_CMD()), which again for this problem does find the solution, but takes 5 hours. So here I still stick with CPLEX to just illustrate the new classes functionality:

pmed12.solve(solver=pulp.CPLEX(timeLimit=30*60,msg=True))     # takes around 10 minutes

This prints out a ton of information at the command line that you do not get if you run from within Jupyter notebooks. So for debugging that is much more useful. timeLimit is in seconds, but does not include the presolve phase I believe, so some of these solvers can just get stuck on the presolve forever.

We can look at the results in a nice map if you do have geopandas installed and pass it a geopandas data frame and the key to match it on:

pmed12.map_plot(carr_report, 'PDGrid')

This map shows the new areas and different colors with a thick border, and the source area as a dot with a white outline. So here you can see that we end up having one disconnected area – a subtour in optimization parlance. I have added some methods to deal with this though:

stres = pmed12.collect_subtours()

And you can see that for one of our areas it identified a subtour. I haven’t written this to automatically resolve for subtours due to the potential length of the solving time. So here we can see the subtours have 0 calls in them. You can assign those areas to wherever and it does not change the objective function. So that is what I did in the paper and showed how in the Jupyter notebook at the main page for the github project.

So if you are using a function that takes 5 hours, I would suggest you manually fix these subtours if there is an obvious to the human eye easy solution. But here with the shorter solve time with CPLEX I can rerun the algorithm with a warm start, and it runs even faster (under 5 minutes). The .collect_subtours() method adds subtour constraints into the pulp model, so you just need to redo the solve() method to eliminate that particular subtour (I do not know if this is guaranteed to converge though for all problems).

So here in this data eliminating that one subtour results in a solution with no subtours:

pmed12.solve(solver=pulp.CPLEX_CMD(msg=False,warmStart=True))
stres = pmed12.collect_subtours()

And we can replot the result, which shows it chooses the areas that you would have assigned by eye anyway:

pmed12.map_plot(carr_report, 'PDGrid')

So again for this problem if you have CPLEX I would recommend it (have not tried Gurobi). But at least for this particular dataset SCIOPT was able to solve the problem in 5 hours. So if you are a crime analyst or someone else without academic access to CPLEX you can install the sciopt solver and give that a go for your actual data.

Note that this is based off of results for 325 subareas. So if you have more subareas it will take a longer (and if you have fewer it may be quite a bit shorter).

Setting up the Conda environment

So once you have python installed, you can typically do something like:

pip install pulp

Or via conda:

conda install pyscipopt

I often have trouble though, especially when working with the python geospatial libraries, to install geopandas, fiona, etc. So here what I do is create a new conda environment:

conda create -n linprog
conda activate linprog
conda install -c conda-forge python=3 pip pandas numpy networkx scikit-learn dbfread geopandas glpk pyscipopt pulp

And then you can run the pmedian code above in this environment. I suppose I should turn this into a python package, and I see a bunch of folks anymore are doing docker images as well with their packages in complicated environments. This is actually not that bad, minus geopandas makes things a bit tricky.

Difference in independent effects for multivariate analysis (SPSS)

For some reason my various posts on testing differences in coefficients are fairly high in google search results. Arden Roeder writes in with another related question on this:

Good evening Dr. Wheeler,

I am a PhD candidate at the University of Oklahoma working on the final phases of data analysis for my dissertation. I found an article on your website that almost answers a question I have about a potential statistical test, and I’m hoping you might be able to help me fill in the gaps.

If you are willing to help me out, I would greatly appreciate it, and if not, I completely understand!

Here’s the setup: I have two independent variables (one measured on a 6-point Likert scale, the other on a 7-point) and six dependent variables. I have hypothesized that IV1 would be a stronger predictor of three of the DVs, and that IV2 would be a stronger predictor of the other three DVs. I ran multiple linear regression tests for each of the DVs, so I have the outputs for those. I know I can compare the standardized betas just to see strength, but what I’d like to know is how I can determine whether the difference between the beta weights is significant, and then to assess this for each of the six DVs.

From reading through your post, it seems like the fourth scenario you set up is quite close to what I’m trying to do, but I’m not sure how to translate the covariance output I have (I’m using SPSS) to what you’ve displayed here. Can I simply square the standard errors I have to get the numbers on the diagonal, and then grab the covariance from the SPSS output and solve accordingly? (I also reviewed your writing here about using the GLM procedure as well, but can’t seem to align my outputs with your examples there either.)

Here’s a sample of the numbers I’m working with:

Any insights you can offer on 1) whether this is the right test to give me the answers I’m looking for about whether the betas are significantly different and 2) how to set up and interpret the results correctly would be a tremendous help.

For 1, yes I think this is an appropriate way to set up the problem. For 2, if sticking to SPSS it is fairly simple syntax in GLM:

*****************************.
*Contrasts with X1 - X2 effect across the variables.
GLM Y1 Y2 Y3 Y4 Y5 Y6 WITH X1 X2
  /DESIGN=X1 X2
  /PRINT PARAMETER
  /LMATRIX = "T1"
             X1  1
             X2 -1.
*****************************.

To get this to do the standardized coefficients, Z-score your variables before the GLM command (this is assuming you are estimating a linear model, and not a non-linear model like logit/Poisson). (I have full simulation SPSS code at the end of the post illustrating.)

Note that when I say the covariance between beta coefficients, this is NOT the same thing as the covariance between the original metrics. So the correlation matrix for X1 vs X2 here does not give us the information we need.

For 2, the reporting part, you can see the Contrast results K matrix table in SPSS. I would just transpose that table, make a footnote/title this is testing X1 – X2, and then just keep the columns you want. So here is the original SPSS contrast table for this result:

And here is how I would clean up the table and report the tests:

SPSS Simulation Code

Here I just simulate independent X’s, and give the Y’s consistent effects. You could also simulate multivariate data with specified correlations if you wanted to though.

****************************************************.
* Simulated data to illustrate coefficient diff.
* tests.
SET SEED 10.
INPUT PROGRAM.
LOOP #i = 1 TO 10000.
COMPUTE Id = #i.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.

COMPUTE X1 = RV.NORMAL(0,1).
COMPUTE X2 = RV.NORMAL(0,1).
COMPUTE Y1 = 1.5*X1 + 0.5*X2 + RV.NORMAL(0,1).
COMPUTE Y2 = 1.5*X1 + 0.5*X2 + RV.NORMAL(0,1).
COMPUTE Y3 = 1.5*X1 + 0.5*X2 + RV.NORMAL(0,1).
COMPUTE Y4 = 0.5*X1 + 1.5*X2 + RV.NORMAL(0,1).
COMPUTE Y5 = 0.5*X1 + 1.5*X2 + RV.NORMAL(0,1).
COMPUTE Y6 = 0.5*X1 + 1.5*X2 + RV.NORMAL(0,1).
EXECUTE.

*Contrasts with X1 - X2 effect across the variables.
GLM Y1 Y2 Y3 Y4 Y5 Y6 WITH X1 X2
  /DESIGN=X1 X2
  /PRINT PARAMETER
  /LMATRIX = "Contrast X1 - X2"
             X1  1
             X2 -1.

*And here is an example stacking equations and using EMMEANS. 
*Stacking equation approach, can work for various.
*Generalized linear models, etc.
VARSTOCASES
  /MAKE Y FROM Y1 TO Y6
  /Index Outcome.

GENLIN Y BY Outcome WITH X1 X2
  /MODEL Outcome X1 X2 Outcome*X1 Outcome*X2
    DISTRIBUTION=NORMAL LINK=IDENTITY
 /CRITERIA COVB=ROBUST
 /REPEATED SUBJECT=Id CORRTYPE=UNSTRUCTURED
 /EMMEANS TABLES=Outcome CONTROL= X1(1) X2(-1).
****************************************************.