A bunch of random shout outs

Busy, busy, busy! Hopefully I will have some time in the near future to write up some more data science posts. But for now, here is a small python snippet to help you build interaction variables between two sets of numpy arrays/dataframes.

import numpy as np
def np_int(a,b):
    rows = a.shape[0]
    cols = a.shape[1]*b.shape[1]
    return np.einsum('ij,ik->ijk', a, b).reshape((rows,cols))

This works for pytorch as well (just replace np.einsum with torch.einsum). So coming up (eventually) I will illustrate encoding interaction between hidden layers in a deep learning model. But for now some quicker updates.

Shout out #1: Scott Jacques has continued to push the charge for open access to criminology journals. He has two recent posts about post-prints, and how our main journal (Criminology) has an excessive policy of not allowing authors to post post prints for over two years (whereas the majority of criminology journals allow you to post immediately).

Several aspects of open science are tricky – posting pre-prints/post-prints is not. If we can come together as a group this is an easy, no cost way to greatly improve the accessibility of our work to the greater public.

Shout out #2: The folks at Police Rewired have hosted a hackathon intended to Hack Hate. It is too late to participate, but they will be displaying the results this Sunday. I have not had the chance to participate in any code hackathons, I will need to make a concerted effort in the future to give at least one a shot. (It seems hard, how can you do any work in only a day or a week or two!? But the proof is in the pudding so to speak, I’ve have seen some pretty cool things come out of various hackathons in the past.)

Shout out #3: My workplace, HMS, is involved in a data sharing collaborative called the Digital Health DRC. They also have a hackathon coming up, but this is related to Telehealth use. The Digital Health DRC is pretty cool though, it is basically a way for HMS (and several other private sector entities) to share various datasets with researchers over the globe.

The scope of HMS’s data is somewhat outside the realm of my old stomping grounds of criminology (but not entirely, a big part of my job is identifying potentially fraudulent patterns in claims data). But for folks who have a research question that could be answered using health insurance claims data, this is a good resource to look into. (HMS has pretty good coverage of Medicare claims across the US.)

Finally, I experimented a few days on the site with hosting ads. I managed to serve up a few thousand and make 10 cents. So I will turn that off for now. I debated on putting the button for folks to donate a coffee, but even that is not necessary. (I can afford the few bucks for the domain, and I use dropbox to back up my files anyway, so hosting extra materials is not a big deal.) I rather folks just take my nerdy notes and make your own cool stuff (and share them with me!) I may need to figure out a better hosting solution for images though — google photos is continuing to give me troubles I see (so if you see an image is not coming through feel free to let me know in the comments or send me an email).

Overview of DataViz books

Keith McCormick the other day on LinkedIn the other day made a post/poll on his favorite data viz books. (I know Keith because I contributed a chapter on geospatial data analysis in SPSS in Keith and Jesus Salcedo’s book, SPSS Statistics for Data Analysis and Visualization, and Jon Peck contributed a chapter as well.)

One thing about this topical area is that there isn’t a standard Data Viz 101 curriculum. So if you pick up Statistics 101 books, they will cover pretty much all the same material (normal distribution, central limit theorem, t-tests, regression). It isn’t 100% overlap (some may spend more time on elementary probability, and others may cover ANOVA), but for someone learning the material there isn’t much point in reading multiple introductory stats books.

This is not so with the Data Viz books in Keith’s picture – they are very different in content. As I have read quite a few different books on the topic over the years I figured I would give my breakdown of the various books.

Albert Cairo’s The Functional Art

While my list is not in rank order, I am putting Cairo’s book first for a reason. Although there is not a Data Viz 101 curriculum, this book is the closest thing to it. Cairo goes through in short order various cognitive aspects on how we view the world that are fundamental to building good data visualizations. This includes things like it is easier to compare lengths along a common axis, and that we can perceive rank order to color saturation, but not to a color’s hue.

It is also enjoyable to read because of all the great journalistic examples. I did not care so much for the interviews at the back, and I don’t like the cover. But if I did a data viz course for undergrads in social sciences (Cairo developed this for journalism students), I would likely assign this book. But despite being very accessible, he covers a broad spectrum of both simple graphs and complicated scientific diagrams.

For this review many of these authors have other books. So I haven’t read Cairo’s The Truthful Art, so I cannot comment on it.

Edward Tufte’s The Visual Display of Quantitative Information

Tufte’s book was the first data viz book I bought in grad school. I initially invested in it as he had a chapter on a critique of powerpoint presentations, which is very straightforward and provides practical advice on what not to do. Most of the critiques of this book are that it is mostly just a collection of Tufte’s opinions about creating minimalist, dense, scientific graphs. So while Cairo dives into the science of perception, Tufte is just riffing his opinions. His opinions are based on his experience though, and they are good!

I believe I have read all of Tufte’s other books as well, but this is the only one that made much of an impression on me (some of his others go beyond graphs, and talk about UI design). I gobbled it up in only two days when I first started reading it, and so if I were stuck on an island with one book scenario I would choose this one over the others I list here (although again think Cairo’s book is the best to start with for most folks). So for scientists I think it is a good investment and an enjoyable read overall.

Nathan Yau’s Visualize This

Of all the books I review, Yau’s is the only how-to actually make graphs in software. Unfortunately, much of Yau’s programmatic advice was outdated already when it was published (e.g. flash was already going by the wayside). So while he has many great examples of creating complicated and beautiful data visualizations, the process he outlines to make them are overly complicated IMO (such as using python to edit parts of a pre-made SVG map). It is a good book for examples no doubt, and maybe you can pick up a few tricks in terms of post editing charts in a vector graphics program, such as Illustrator or Inkscape (most examples are making graphs in base R and then exporting to edit finishing touches).

In terms of making a how-to book it is really hard. Yau I am sure has updates on his Flowing Data website to make charts (and maybe his newer book is better). But I don’t think I would recommend investing in this book for anything beyond looking at pretty examples of data viz.

Stephen Kosslyn’s Graph Design for the Eye and Mind

The prior books all contained complicated, dense, scientific graphs. Kosslyn’s book is specifically oriented to making corporate slide decks/powerpoints, in which the audience is not academic. But his advice is mostly backed on his understanding of the psychology (he relegates extensive endnotes to point to scientific lit, to avoid cluttering up the basic book). He has as few gems of advice I admit, such as it isn’t the number of lines in a graph that make it complicated, but really the number of unique profiles. But then he has some pieces I find bizarre, such as saying pie charts are OK because they are so popular (so have survived a Darwinian survival process in terms of being presented to business people).

I would stick with Tufte’s powerpoint advice (and later will mention a few other books related to giving presentations), as opposed to recommending this book.

Alan MacEachren How maps work: Representation, visualization, and design

MacEachren’s book is encyclopedic in terms of scientific literature on design aspects of both cartography, as well as the psychological literature. So it is like reading an encyclopedia (not 100% sure if I ever finished it front to back to be honest). I would start here if you are interested in designing cognitive experiments to test certain graphs/maps. I think MacEachren pooling from cartography and psychology ends up being a better place to start than say Colin Ware’s Information Visualization (but it is close). They are both very academically oriented though.

Leland Wilkinson’s The Grammar of Graphics

I used SPSS for along time when I read this book, so I was already quite familiar with the grammar of graphics in terms of creating graphs in SPSS. That pre-knowledge helped me digest Wilkinson’s material I believe. Nick Cox has a review of this book, and for this one he notes that the audience for this book is hard to pin down. I agree, in that you need to be pretty far along already in terms of making graphs to be able to really understand the material, and as such it is not clear what the benefit is. Even for power users of SPSS, much of the things Wilkinson talks about are not implemented in SPSS’s GGRAPH language, so they are mostly just on paper.

(Note Nick has a ton of great reviews on Amazon as well for various data viz books. He is a good place to start to decide if you want to purchase a book. For example the worst copy-edited book I have ever seen is Andy Kirk’s via Packt publishing, and Nick notes how poorly it is copy-edited in his review.)

Here is an analogy I think is apt for Wilkinson’s book – if we are talking about cars, you may have a book on the engineering of the car, and another on how to actually drive the car. Knowing how pistons work in a combustible engine does not help you drive a car, but helps you build one. Wilkinson’s book is more about the engineering of a graph from an algebraic perspective. At the fringes it helps in thinking about the components of graphs, but doesn’t really give any advice about what graph to make in-and-of itself, nor what is a good graph or a bad graph.

Note that the R library ggplot2, is actually quite a bit different than Leland’s vision. It is simpler, in that Wickham essentially drops the graph algebra part, so you specify the axes directly, whereas in Wilkinson’s you just say X*Y*Z, and depending on other aspects of the grammer this may produce a 3d scatterplot, a facet gridded scatterplot, a clustered bar chart, etc. I think Wickham was right to make that design choice, but in doing so it really isn’t an implementation of what Wilkinson was talking about in this book.

Jacques Bertin’s Semiology of Graphics: Diagrams, Networks, Maps

Bertin’s book is an attempt to make a dictionary of terms for different aspects of graphs. So it is a bit in the weeds. One unique aspect of Bertin is that he discusses titles and labels for graphs, although I wouldn’t go as far as saying that his discussion leads to straightforward advice. I find Wilkinson’s grammer of graphics a more useful way to overall think about the components of a graph, although Bertin is more encyclopedic in coverage of different types of graphs and maps in the wild.

Short notes on various other books

Most of these books (with the exception of Nathan Yau’s) are not how-to actually write code to make graphs. For those that use R, there are two good options though. Hadley Wickham’s ggplot2: Elegant Graphics for Data Analysis (Use R!) was really good at the time (I am not sure if the newer version is more up to date though, like any software it changes over time so the older one I know is out of date for many different code examples). And though I’ve only skimmed it, Kieran Healy’s Data Visualization: A practical introduction is free and online and looks good (and also for those interested in criminal justice examples Jacob Kaplan has examples in R as well, Crime by the Numbers). So those later two I know are good in terms of being up to date.

For python I just suggest using google (Jake VanderPlas has a book that looks good, and his website is really good). For excel I really like Jorge Camões work (his book is Data at Work, which I don’t think I’ve read, but have followed his website for along time).

In terms of scientific presentations (which covers both graphs and text), I’ve highly suggested in the past Trees, maps, and theorems. This is similar in spirit to Tufte’s minimalist style, but gives practical advice on slides, writing, and presentations. Jon Schwabish’s book, Better Presentations: A Guide for Scholars, Researchers, and Wonks, is very good as well in terms of direct advice. I think for folks in academia I would say go for Doumont’s book, and for those in corporate environment go for Schwabish’s.

Stephen Few’s books deserve a mention here as well, such as Show me the numbers. Stephen is the only one to do a deep dive into the concept of dashboards. Stephen’s advice is very straightforward and more oriented towards a corporate type environment, not so much a scientific one (although it isn’t bad advice for scientists, ditto for Schwabish, just stating more so for an understanding of the intended audience).

I could go on forever, Tukey’s EDA, Calvin Schmid’s book on how to draw graphs with actual splines! How to lie with statistics and how to lie with maps. So many to choose from. But I think if you are starting out in a data oriented role in which you need to make graphs, I would suggest starting with Cairo’s book, then get Tufte to really get some artistic motivation and a good review of bad powerpoint practices. The rest are more advanced material for study though.

From a criminologist, we should restore voting rights

I have donated to the Southern Poverty Law Center in the past (recently my workplace, HMS, matched contributions). I no doubt do not 100% agree with their positions on every little detail (as is probably true for every organization in the criminal justice sphere) , but I believe they do good work. In particular I’ve always though that their identifying hate groups is a valuable public service, see the SPLC’s Hate Map.

They do more work than just the hate group map though. Recently they have been sending information on voter disenfranchisement. It is not uniform across states, but in many places if you have a felony conviction you have your rights to vote stripped entirely. It is even more severe in some places, in that you cannot vote if you simply owe fines or fees to the state.

I figured this would be a good blog post, as I have always had a more extreme view on this than most people. While most argue simply that individuals voting rights should be restored after an individuals imprisonment has ended, I don’t believe they should ever be stripped to begin with. Or more specifically, I believe people who are even currently incarcerated should be allowed to vote.

The reasons I have this opinion are relatively simple. First, there is no evidence that voter disenfranchisement acts as a deterrent to prevent someone from committing a crime. No one thinks, hey, I shouldn’t commit this robbery because I need to cast my ballot this fall. Restoring voting rights, even to those imprisoned, poses no threat to public safety.

The second reason I support restoring voting rights is because an important part of offender reintegration into society is to participate in civil matters. We don’t lock people up and throw away the keys, so we should take steps to help those former offenders come back and have a positive contribution to our society. What simpler way than to allow those individuals to engage in the voting process? (The foremost authority on this subject is Vesla Weaver.)

You may ask how would voting in prison work? For voting in prison the location of the vote should not count where the jail is located, but wherever the last address of the offender was before they were incarcerated. This brings up another issue, that certain state census counting procedures count individuals incarcerated at the location of the prison. This results in gerrymandering, where typically rural areas with prisons get more electoral representation, even though for the most part those individuals have no voting rights.

I believe we would be better off as a nation if not only everyone was allowed to vote, but that everyone did vote.

Open Source Criminology Related Network Datasets

So I am a big proponent of open source data analysis. There is a problem with using criminal justice data sources though – they often have private information that prevents us from sharing the data. For example, I have posted quite a few of my projects here (mostly spatial data analysis), but there are a few I cannot share. For example, I worked on a paper with chronic offender predictions, and I cannot share that data (Wheeler et al., 2019). The outcome, being a victim or perpetrator of gun violence, is so rare that by itself basically makes it impossible to publicly share the data without exposing the individuals under study.

One good resource all criminologists should be aware of is ICPSR, in particular NACJD. Many datasets on there though anymore are restricted, in that you need to get IRB permission and ICPSR permision to download the dataset to use. (Which typically takes like 2~3 months in my experience doing it a few times, which includes both your local Uni IRB and the ICPSR process.) For example here is one I went through the motions to get to (in the end) validate different survival prediction methods.

ICPSR is a great resource to be able to handle sharing potentially sensitive data. But this falls short in two areas. One is in teaching – you cannot go through the IRB ritual in a timely enough fashion to be able to use those datasets in a course environment. The other is in terms of methods, so for example if you wanted to say your model provides better predictions than some other model, they should be established on the same datasets. Current state of affairs in criminology in this regard is pretty bad to be curt – most everybody uses their own data they have access to. So much of the research on different risk assessment instruments for bail/probation/parole are pretty much impossible to say one is better than another.

One example type of data source that is almost entirely missing from NACJD (that I am aware of) is social network datasets relevant for criminology/criminal justice. So I have started a spreadsheet to collate different open source network datasets relevant for criminologists. So I have some from my work and a few other random examples I have come across on the internet.

SPREADSHEET OF NETWORK DATASETS

I have made that spreadsheet open, so anyone should be able to edit in more sources. (Feel free to include links to ICPSR as well, but if you do edit a note to say whether it is restricted access or not.) For here I would be interested in really large networks, for example would love to try to replicate Marie’s work on gang network transitions (Oullet et al., 2019a).

And also while I am here, Jacob Young has created a very nice introductory course to social network analysis. I have a brief lecture in my advanced research design class, but Jacob’s is much more thorough (and he is more of an expert in this area than I am for sure).

I will add to that spreadsheet over time as well. I have made a separate sheet for survival analysis datasets. I would be particularly keen for example criminal justice examples. So for network analysis we have examples of looking at use-of-force networks (Oullet et al., 2019b), and for survival analysis I would be interested in a time to solve example dataset. Unfortunately for the solved cases, NIBRS is a good resource but has a large confound in they don’t measure whether a case was ever assigned to a detective.

Feel free to add whatever in that spreadsheet, but what I was thinking was oriented towards different methods (again as a main motivation is for teaching). So for example if you knew of datasets for age-period-cohort modelling, or for estimating group-based-trajectory models, I think those would be good examples to start new sheets and collate different data sources.

References

  • Ouellet, M., Bouchard, M., & Charette, Y. (2019a). One gang dies, another gains? The network dynamics of criminal group persistence. Criminology, 57(1), 5-33.
  • Ouellet, M., Hashimi, S., Gravel, J., & Papachristos, A. V. (2019b). Network exposure and excessive use of force: Investigating the social transmission of police misconduct. Criminology & Public Policy, 18(3), 675-704.
  • Wheeler, A. P., Worden, R. E., & Silver, J. R. (2019). The accuracy of the violent offender identification directive tool to predict future gun violence. Criminal Justice and Behavior, 46(5), 770-788.

Using the Google Vision and Streetview API to Explore Hotspots

So previously I have shown how to automate the process of downloading google street view imagery (for individual addresses & running down a street). One interesting application is to then code those streetview images. There are many applications in criminology of coding these images for disorder. So Rob Sampson initially had the idea of ecometrics, in which he used systematic social observations via taking a video going down various streets to code physical disorder, such as garbage on the street (Raudenbush & Sampson, 1999). Others than leveraged Google streetview imagery to do those same audits instead of collecting their own footage (Bader et al., 2017).

Those are all someone looks at the images and a human says, there is XYZ in this photo and ABC in this photo. I was interested in testing out the Google Vision API to automate identifying parts of the images. So instead of a human manually reviewing, you build a score automatically. See for example work on identifying the percieved safety of streets (Naik et al., 2014).

Here I was motivated by some recent work of a colleague, Nate Connealy, in which he used this imagery to identify the differences in hot spots vs not hot spots (Connealy, 2020). Also I am pretty sure I saw George Mohler present on this at some ASC before I had the idea (it was similar to this paper, Khorshidi et al., 2019, not 100% sure it was the same one though). For an overview of crim applications using streetview and google maps, which also span CPTED type analyses, check out Vandeviver (2014).

So with Google’s automated vision API, if I submit this photo of a parking garage (this is actually the image I get if I submit the address Bad Address, Dallas, TX to the streetview API, so take in mind errors like that in my subsequent analysis).

You get back these labels, where the first item is the description and the second is the ‘score’ for whether the item is in the image:

('Architecture', 0.817379355430603),
('Floor', 0.7577666640281677),
('Room', 0.7444316148757935),
('Building', 0.7440816164016724),
('Parking', 0.7051371335983276),
('Ceiling', 0.6624311208724976),
('Flooring', 0.6004095673561096),
('Wood', 0.5958532094955444),
('House', 0.5928719639778137),
('Metal', 0.5114516019821167)

So I don’t tell Google what to look for, it just gives me back a ton of different labels depending on what it detects in the image. So what I do here is based on my hotspot work (Wheeler & Reuter, 2020), I grab a sample of 300 addresses inside my Dallas based hot spot areas, and 300 addresses outside of hot spots. (These addresses are based on crime data themselves, so similar to Nate’s work I only sample locations that at least have 1 crime).

So this isn’t a way to do predictions, but I think it is potentially interesting application of exploratory data analysis for hot spots or high crime areas.

Python Code Snippet

I am just going to paste the python code-snippet in its entirety.

'''
Grabbing streetview images and detecting
labels using the google vision API
'''

from google.cloud import vision
import pandas as pd
import io
import os
import urllib
import time

os.chdir(r'D:\Dropbox\Dropbox\Documents\BLOG\GoogleLabels_hotspots\analysis')

add_dat = pd.read_csv('Sampled_Adds.csv')
add_dat['FullAdd'] = add_dat['IncidentAddress'] + ", DALLAS, TX"

# Code to download image based on address 
# https://andrewpwheeler.com/2015/12/28/using-python-to-grab-google-street-view-imagery/

myloc = r"./Images" #replace with your own location
key = "&key=????YourKeyHere????" 

def GetStreet(Add,SaveLoc,Name):
  base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
  MyUrl = base + urllib.parse.quote_plus(Add) + key #added url encoding
  fi = Name + ".jpg"
  loc_tosav = os.path.join(SaveLoc,fi)
  urllib.request.urlretrieve(MyUrl, loc_tosav)

# Code to get the google vision API labels
# for the image

client = vision.ImageAnnotatorClient.from_service_account_json('Geo Dallas-b5543ff0bb6d.json')

def LabelImage(ImageLoc):
    # Loads the image into memory
    with io.open(ImageLoc, 'rb') as image_file:
        content = image_file.read()
    image = vision.types.Image(content=content)
    response = client.label_detection(image=image)
    labels = response.label_annotations
    res = []
    if response.error.message:
        print(f'Error for image {ImageLoc}')
        print(f'Error Message {response.error.message}')
        res.append( ('Error', 1.0 ) )
    else:
        res = []
        for l in labels:
            res.append( (l.description , l.score) )
    return res

#A random parking garage!
GetStreet('Bad Address, Dallas, TX',myloc,'Bad_Address')    
LabelImage(os.path.join(myloc,'Bad_Address.jpg'))
            
long_tup = []
for index, row in add_dat.iterrows():
    #Name of the image
    nm = str(index) + "_" + str(row['Inside'])
    #Download the image    
    GetStreet(row['FullAdd'],myloc,nm)
    #Get the labels
    labs = LabelImage(os.path.join(myloc,nm + '.jpg'))
    #Build the new data tuples
    for l in labs:
        long_dat = (index, nm +'.jpg', row['Inside'], row['FullAdd'], l[0], l[1])
        long_tup.append(long_dat)
    #Sleep for a second to not spam the servers
    time.sleep(1)
    print(f'Done with index {index}')

long_dat = pd.DataFrame(long_tup, 
                        columns=['Index','Image','Inside','Address','Description','Score'])
            
long_dat.to_csv('LabeledData.csv',index=False)

To get this to work you need a few things. First, you need to enable both the Vision API and the Streetview API in your Google API console. The streetview API has a key you can get directly from the API console (as described in my prior posts). But the vision API is different, and you can download a json file with all the necessary info and feed it into the client call. Once that is all done, you have it set up to query both API’s to get the images and then get the labels. But this is quick and dirty, it does not check for errors in either.

Here is a screenshot of some of the images downloaded, you can see that the streetview API doesn’t fail when their is no image available, it just does a mostly blank gray screenshot.

Analyzing the Results

I am not above just piping the results into an Excel document and doing some quick pivot tables. (I like doing that when there are many categories I want to explore quickly.) So here is a pivot table of the sum of the scores across the 300 outside hotspot (column 0) and 300 inside (column 1) images. So you can see the label of property is in more than half of the images for each (since the score value is never above 1). But property is more common outside hot spots than it is inside hot spots.

Here are contrast coded sums, so these identify the different labels that are more common in either hotspots or outside of hotspots. So outside of hotspots trees and plants appear more common (see Kondo et al., 2017 and Kondo’s other work on the topic). Inside hotspots we have more cars & asphault for examples.

This is just a quick and dirty analysis though. I do not take into account here missing images. The Screenshot label shows missing images are more common inside hotspots. And here since I use the addresses sometimes it gives me a shot of the street instead of the view perpendicular to the street. (I am not 100% sure the best way to do it, if you geocode and then use the lat/lon, you may not have the right view of the property either depending on the geocoding engine, so maybe going with the address directly is better?)

Future Work

In terms of predictive applications, I think using the streetview imagery is not likely to improve crime forecasts, that it is really only worthwhile for EDA or theory testing. In terms of predictive analysis, I actually think using the satellite imagery has more potential (see Jay, 2020 for an example, although that isn’t predictive but causal analysis).

So prior work has used 311 calls for service to identify high disorder areas (Magee, 2020; O’Brien & Winship, 2017; Wheeler, 2018), so I wonder if you can specifically build an image detector to identify particular disorder aspects that are not redundant with 311 calls. And also perhaps scales directly relevant to CPTED. The Google Vision labels are a bit superficial to really use for many theory crim applications I am afraid, but is an interesting exploratory data analysis to check them out.

References

Incorporating treatment non-compliance into call-ins

I have previously published work on identifying optimal individuals to prioritize for call-ins in Focused Deterrence interventions. The idea is we want to identify optimal people to spread the message, so you call in a small number of individuals and they should spread the message to the remaining group. There are better people than others to seed the message to to make sure it spreads throughout the network.

I knew of a direct improvement on that algorithm I published (very similar to the TURF problem I described the other day). But the bigger issue was that even when you call in individuals they do not always come to the meeting – treatment non-compliance. When working with state parole and/or local probation, the police department can ask those agencies to essentially make people come in, but otherwise it is voluntary.

The TURF problem I did the other day gave me a bit of inspiration on how to tackle that treatment non-compliance problem though. In a nutshell when you calculate whether someone is reached (via being directly connected to someone called-in), they can be partially reached based on the probability of the selected nodes treatment compliance. I have posted the code to follow along on dropbox here. I won’t go through the whole thing, but just some highlights.

The Model

First, in some quick and dirty text math, the model is:

Maximize Sum( R_i )

Subject to:

  • R_i <= Sum( S_j*p_j ) for each i
  • Sum( S_j ) = k
  • S_i element of [0,1]
  • R_i <= 1 for each i

Here i refers to an individual node in the gang/group network.

The first constraint R_i <= Sum( S_j*p_j ), the j’s are the nodes that are connected to i (and i itself). The p_j are the estimates that an individual will comply with coming into the call-in. For one agency we worked with for that project, they guessed that those who don’t need to come in comply about 1/6th of the time, so I use that estimate here in my examples, and give people who are on probation/parole a 1 for the probability of compliance.

Second constraint is we can only call in so many people, here k. The model solves very fast, so you can generate results for various k until you get the reach you want to in the end. (You could do the model the other way, minimize S_i while constraining the minimized acceptable reach, e.g. Sum( R_i ) >= threshold, I don’t suggest this in practice though, as when dealing with compliance there may be no feasible solution that gets you the amount of reach in the network you want.)

For the third constraint, the decision variables S_i are binary 0/1’s, but the R_i are continuous. But the trick here is that the last constraint, R_i <= 1, means that the expected reach is capped at 1. Here is a way to think about this, imagine you want to know the chance that person A is reached, and they are connected to two called-in individuals, who each have a 40% chance at complying with the treatment (coming to the call-in). The expected times person A would be reached then is additive in the probabilities, 0.4 + 0.4 = 0.8. If we had 3 people connected to A again at 40% apiece, the expected number of times A would be reached is then 0.4 + 0.4 + 0.4 = 1.2. So a person can be reached multiple times. (Note this is not the probability a person is reached at least once! It is a non-linear problem to model that.)

But if we took away the last constraint, what would happen is that the algorithm would just pick the nodes that had the highest number of neighbors. Since we are maximizing expected reach, if we had a sample of two people, the expected reach values of [2.5, 0] would be preferable to [1, 1], although clearly we rather have the reach spread out. So to prevent that, I cap the expected reach variable at 1, R_i <= 1 for each i, so this spreads out the selected individuals. So in the end the expected number of times people are reached are a lower bound estimate, but those are only people who are expected to receive the message multiple times.

This is a bit of a hack, but in my tests works quite well. I attempted to model the non-linear problem of estimating the probabilities at the person level and still maximizing the expected reach (in the code I have an example of using the CVXR R package). But it was quite fickle in when it would return a solution. So I am focusing on the linear program here, which is not perfect, but is an improvement over my prior published work.

Some Python Snippets

So for my example code, I am using City 4 Gang 4 from my paper. The reason is this was the largest network, and my original algorithm performed the worst. 99 nodes, and my original algorithm identified a 33 person dominant set, but Borgotti’s tool (that uses a genetic algorithm) identified a 29 dominant set.

Here is an example of calling my function to select the individuals for a call-in based on the non-compliance estimates. (g4 is the networkx graph object, the second arg is the number of individuals, and compliance is the node attribute that has the probability of treated compliance.) If we call in only 5 people, we still expect a reach of 29 individuals. Here there ends up being some highly connected people on parole/probation, so they have a 1 probability of complying with the treatment.

A consequence of this algorithm is that if you pipe in 1’s for the treatment compliance, you basically get an improvement to my original algorithm. So for a test we can see if I get the same minimal dominating set as Borgotti did for his algorithm here, where const is just everybody complies 100% of the time.

And yep we get a dominating set (all 99 people are reached). What happens if we go down one, and only select 28 people?

We only reach 98 out of the 99. So it appears a 29 set is the minimal dominating set here. But like I said the treatment non-compliance is a big deal in this setting. What is our expected reach if we take that into account, but still call-in 29 people?

It is still pretty high, around 2/3s of the network, but is still much smaller. Also if you look at the overlap between the constant versus non-compliance model, they select quite a few different individuals. It makes a big difference.

Here is a graph I made of selecting 20 individuals. Red means I selected that person, pink means they are reached at least some, and the size of the reach is proportion to the node. Then grey folks I wouldn’t expect to be reached by the message (at least by first degree connections).

So you can see that most of the people selected have that full 1 expected reach, so the algorithm does prioritize individuals on probation/parole who have a 100% expected compliance. But you can see a few folks who have a lower compliance who are selected as they are in places in the network not covered by those on probation/parole.

I have a tough time getting network layouts to look nice in python (even with the same layout algorithms, I feel like igraph in R just looks much better out of the box).

Future Work

Out of the box, this algorithm could incorporate several different pieces of information. So here I use the non-compliance estimate as a constant, but you could have varying estimates for that based on some other model no problem (e.g. older individuals comply more often than younger, etc.). Also another interesting extension (if you could get estimates) would be the probability a called-in individual spreads the message. In the part Sum( S_j*p_j ) it would just be something like Sum( S_j*p_cj*p_sj ), where p_cj is the compliance probability for attending, and p_sj is the probability to spread the message to those they are connected to.

Getting worthwhile estimates for either of those things will be tough though. Only way I can see it is via some shoe leather qualitative or survey approach.

Simulating runs of events

I still lurk on the Cross Validated statistics site every now and then. There was a kind of common question about the probability of a run of events occurring, and the poster provided a nice analytic solution to the problem using Markov Chains and absorbing states I was not familiar with.

I was familiar with a way to approximate the answer though using a simple simulation, and encoding data via run length encoding. Run length encoding works like this, if you have an original sequence that is AABBBABBBB, then the run length encoded version of this sequence is:

A,2
B,2
A,1
B,4

This is a quite convenient sparse data format to be familiar with. E.g. if you are using tensors in various deep learning libraries, you can encode the data like this and then stack the tensor. But the stacked tensor is just a view, so it doesn’t take up as much memory as the initial full tensor.

Using this encoding also makes a simulation to answer the question, how often do runs of 5+ occur in this hypothetical experiment quite easy to estimate. You just calculate the run length encoded version of the data, and see if any of the lengths are equal to or greater than 5. Below are code snippets in R and Python.

While the analytic solution is of course preferable when you can figure it out, simulations are nice to test whether the solution is correct, as well as to provide an answer when you are not familiar with how to analytically derive a solution.

R Code

R has a native run-length encoding command, rle. The reason is that runs tests are a common time series technique for looking at randomness. Encourage you to run the code yourself to see how my simulated answer lines up with the analytic answer provided on the stats site!

##########################################
# R Code
set.seed(10)
die <- 1:6
run_sim <- function(rolls=1000, conseq=5){
    test <- sample(die,rolls,TRUE)
    res <- max(rle(test)$lengths) >= conseq
    return(res)
}

sims <- 1000000
results <- replicate(sims, run_sim(), TRUE)
print( mean(results) )
##########################################

Python Code

The python code is very similar to the R code. Main difference is there is no native run length encoding command in numpy or scipy I am aware of (although there should be)! So I edited a function I found from Stackoverflow to accomplish the rle.

##########################################
# Python code

import numpy as np
np.random.seed(10)

# Edited from https://stackoverflow.com/a/32681075/604456
# input numpy arrary, return tuple (lengths, vals)
def rle(ia):
    y = np.array(ia[1:] != ia[:-1])         # pairwise unequal (string safe)
    i = np.append(np.where(y), len(ia) - 1) # must include last element
    z = np.diff(np.append(-1, i))           # run lengths
    return (z, ia[i])

die = list(range(6))

def run_sim(rolls=1000, conseq=5):
    rlen, vals = rle(np.random.choice(a=die,size=rolls,replace=True))
    return rlen.max() >= conseq

sims = 1000000
results = [run_sim() for i in range(sims)]
print( sum(results)/len(results) )
##########################################

I debated on expanding this post to show how to do these simulations in parallel, this is a bit of a cheesy experiment to show though. To do 1 million simulations on my machine still only takes like 10~20 seconds for each of these code snippets. So that will have to wait until another post!

You may be thinking why do I care about runs of dice rolls? Well, it can be extended to many different types of time series monitoring problems. For example, when I worked as a crime analyst at Troy I thought about this in terms of analyzing domestic violent reports. They were too numerous for me to read through every report, so I needed to devise a system to identify if there were anomalous patterns in the recent number of reports. You could devise a test here, say how many days of 10+ reports in a row, and see how frequently you would expect that occur in say a year of monitoring. The simulations above could easily be amended to do that, via doing simulations of the Poisson distribution instead of dice rolls, or assigning weights to particular outcomes.

Street Network Distances and Correlations

Wouter Steenbeek (a friend and co-author for a few articles) has a few recent blog posts replicating some of my prior work replicating some of my work on street network vs Euclidean distances in Albany, NY (Wouters, 1, 2) and my posts (1,2).

In Wouter’s second post, he was particularly interested in checking out shorter distances (as that is what we are often interested in in criminology, checking crime clustering). When doing that, the relationship between network and Euclidean distances sometimes appear less strong, so my initial statement that they tend to be highly correlated is incorrect.

But this is an artifact for the correlation between any two measures – worth pointing out in general for analysis. If you artificially restrict the domain of one variable the correlation always goes down. See some examples on the cross-validated site (1, 2) that illustrate this with nicer graphs than I can whip up in a short time.

But for a quick idea about the issue, imagine a scenario where you slice out Euclidean distances in some X bin width, and check the scatterplot between Euclidean and network distances. So you will get less variation on the X axis, and more variation on the Y axis. Now take this to the extreme, and slice on Euclidean distances at only one value, say 100 meters exactly. In this scatterplot, there is no X variation, it is just a vertical line of points. So in that scenario the correlation is 0.

So I should not say the correlation between the two measures is high, as this is not always true – you can construct an artificial sample in which that statement is false. So a more accurate statement is that you can use the Euclidean distance to predict the network distance fairly accurately, or that the linear relationship between Euclidean and network distances is quite regular – no matter what the Euclidean distance is.

My analysis I have posted the python code here. But for a quick rundown, I grab the street networks for a buffer around Albany, NY using the osmnx library (so it is open street map network data). I convert this street network to an undirected graph (so no worrying about one-way streets) in a local projection. Then using all of the intersections in Albany (a few over 4000), I calculate all of the pairwise distances (around 8.7 million pairs, takes my computer alittle over a day to crunch it out in the background).

So again, the overall correlation is quite high:

But if you chunk the data up into tinier intervals, here 200 meter intervals, the correlations are smaller (an index of 100 means [0-200), 300 means [200-400), etc.).

But this does not mean the linear relationship between the two change. Here is a comparison of the linear regression line for the whole sample (orange), vs a broken-stick type model (the blue line). Imagine you take a slice of data, e.g. all Euclidean distances in the bin [100-200) and fit a regression line. And then do the same for the Euclidean distances [200-300) etc. The blue line here are those regression fits for each of those individual binned estimates. You can see that the two estimates are almost indistinguishable, so the relationship doesn’t change if you subset the data to shorter distances.

Technically the way I have drawn the blue line is misleading, I should have breaks in the line (it is not forced to be connected between bins, like my post on restricted cubic splines is). But I am too lazy to write code to do those splits at the moment.

Now, what does this mean exactly? So for research designs that may want to use network distances and an independent variable, e.g. look at prison visitation as a function of distance, or in my work on patrol redistricting I had to impute some missing travel time distances, these are likely OK to use typical Euclidean distances. Even my paper on survivability for gun shot fatality shows improved accuracy estimates using network distances, but very similar overall effects compared to using Euclidean distances.

So while here I have my computer crunch out the network distances for a day, where the Euclidean distances with the same data only takes a second, e.g. using scipy.spatial.distance. So it depends on the nature of the analysis whether that extra effort is worth it. (It helps to have good libraries ease the work, like here I used osmnx for python, and Wouter showed R code using sf to deal with the street networks, hardest part is the networks are often not stored in a way that makes doing the routing very easy. Neither of those libraries were available back in 2014.) Also note you only need to do the network calculations once and then can cache them (and I could have made these network computations go faster if I parallelized the lookup). So it is slightly onerous to do the network computations, but not impossible.

So where might it make a difference? One common use of these network distances in criminology is for analyses like Ripley’s K or near-repeat patterns. I don’t believe using network distances makes a big deal here, but I cannot say for sure. What I believe happens is that using network distances will dilate the distances, e.g. if you conclude two point patterns are clustered starting at 100 meters using Euclidean distances, then if using network it may spread out further and not show clustering until 200 meters. I do not think it would change overall inferences, such as where you make an inference whether two point patterns are clustered or not. (One point is does make a difference is doing spatial permutations in Ripley’s K, you should definitely restrict the simulations to generating hypothetical distributions on the street network and not anywhere in the study area.)

Also Stijn Ruiter makes the point (noted in Wouter’s second post), that street networks may be preferable for prediction purposes. Stijn’s point is related to spatial units of analyses, not to Euclidean vs Network distances. You could have a raster spatial unit of analysis but incorporate street network statistics, and vice-versa could have a vector street unit spatial unit of analysis and use Euclidean distance measures for different measures related to those vector units.

Wouter’s post also brought up another idea I’ve had for awhile, that when using spatial buffers around areas they can be bad control areas, as even if you normalize the area they have a very tiny sliver of network distance attributable to them. I will need to show that for another blog post though. (This was mostly my excuse to learn osmnx to do the routing!)

Recent Papers on Hot Spots of Crime in Dallas

So I have two different papers that were published recently. Both are on hot spots in Dallas, so might as well discuss them together.

For each I have posted the code to replicate the results (and that spreadsheet has links to preprints as well).

For each as a bit of a background as to the motivation for the projects, Dallas has had official hot spots, named TAAG (Target Area Action Grid). These were clearly larger than what would be considered best practice in identifying hot spots (they were more like entire neighborhoods). I realize ‘best practices’ is a bit wishy-washy, but the TAAG areas ended up covering around 20% of the city (a smidge over 65 square miles). Here is a map of the 2017 areas. There were 54 TAAG areas that covered, so on average each is alittle over 1 square mile.

Additionally I knew the Dallas police department was interested in purchasing the RTM software to do hot spots. And a separate group, the Dallas Crime Task Force was interested in using the software as well for non-police related interventions.

So I did these projects on my own (with my colleagues Wouter and Sydney of course). It wasn’t paid work for any of these groups (I asked DPD if they were interested, and had shared my results with folks from CPAL before that task force report came out, but nothing much came of it unfortunately). But my results for Dallas data are very likely to generalize to other places, so hopefully they will be helpful to others.

Machine Learning to Predict and Understand Hot Spots

So I see the appeal for folks who want to use RTM. It is well validated in both theory and practice, and Joel has made a nice software as a service app. But I knew going in that I could likely improve upon the predictions compared to RTM.

RTM tries to find a middle ground between prediction and causality (which isn’t a critique, it is sort of what we are all doing). RTM in the end spits out predictions that are like “Within 800 feet of a Subway Entrances is Risk Factor 1” and “The Density of Bars within 500 Feet is Risk Factor 2”. So it prefers simple models, that have prognostic value for PDS (or other agencies) to identify potential causal reasons for why that location is high crime. And subsequently helps to not only identify where hot spots are, but frame the potential interventions an agency may be interested it.

But this simplicity has a few drawbacks. One is that it is a global model, e.g. “800 feet within a subway entrance” applies to all subway entrances in the city. Most crime generators have a distribution that makes it so most subway entrances are relatively safe, only a few end up being high crime (for an example). Another is that it forces the way that different crime generators predict crime to be a series of step functions, e.g. “within 600 ft” or “a high density within 1000 ft”. In reality, most geographic processes follow a distance decay function. E.g. if you are looking at the relationship between check-cashing stores and street robbery, there are likely to be more very nearby the store, and it tails off in a gradual process the further away you get.

So I fit a more complicated random forest model that has neither of those limitations and can learn much more complicated functions, both in terms of distance to crime generators as well as spatially varying over the city. But because of that you don’t get the simple model interpretation – they are fundamentally conflicting goals. In terms of predictions either my machine learning model or a simpler comparison of using prior crime = future crime greatly outperforms RTM for several different predictive metrics.

So this shows the predictions are better for RTM no matter how you slice the hot spot areas, but again you lose out the prognostic value of RTM. To replace that, I show local interpretability scores for hot spots. I have an online map here for an example. If you click on one of the high crime predicted areas, it gives you a local breakdown of the different variables that contributes to the risk score.

So it is still more complicated than RTM, but gets you a local set of factors that potentially contribute to why places are hot spots. (It is still superficial in terms of causality, but PDs aren’t going to be able to get really well identified causal relationships for these types of predictions.)

Return on Investment for Hot Spots Policing

The second part of this is that Dallas is no doubt in a tight economic bind. And this was even before all the stuff about reforming police budgets. So policing academics have been saying PDs should shift many more resources from reactive to proactive policing for years. But how to make the argument that it is in police departments best interest to shift resources or invest in additional resources?

To do this I aimed to calculate a return on investment on investing in hot spots policing. Priscilla Hunt (from RAND) recently came up with labor cost estimates for crime specifically relevant for police departments. So if an aggravated assault happens PDs (in Texas) typically spend around $8k in labor costs to respond to the crime and investigate (it is $125k for a homicide). So based on this, you can say, if I can prevent 10 agg assaults, I then save $80k in labor costs. I use this logic to estimate a return on investment for PDs to do hot spots policing.

So first I generate hot spots, weighting for the costs of those crimes. Here is an interactive map to check them out, and below is a screenshot of the map.

I have an example of then calculating a return on investment for the hot spot area that captured the most crime. I get this estimate by transforming meta-analysis estimates of hot spots policing, estimating an average crime reduction, and then backing out how much labor costs that would save a police department. So in this hot spot, an ROI for hot spots policing (for 1.5 years) is $350k.

That return would justify at least one (probably more like two) full time officers just to be assigned to that specific hot spot. So if you actually hire more officers, it will be around net-zero in terms of labor costs. If you shift around current officers it should be a net gain in labor resources for the PD.

So most of the hot spots I identify in the study if you do this ROI calculation likely aren’t hot enough to justify hot spots policing from this ROI perspective (these would probably never justify intensive overtime that is typical of crackdown like interventions). But a few clearly are, and definitely should be the targets of some type of hot spot intervention.

A linear programming example for TURF analysis in python

Recently on LinkedIn I saw a very nice example of TURF (Total Unduplicated Reach & Frequency) analysis via Jarlath Quinn. I suggest folks go and watch the video, but for a simple example imagine you are an ice cream food truck, and you only have room in your truck to sell 5 different ice-creams at a time. Some people only like chocolate, others like vanilla and neapolitan, and then others like me like everything. So what are the 5 best flavors to choose to maximize the number of people that like at least one flavor?

You may think this is a bit far-fetched to my usual posts related to criminal justice, but it is very much related to the work I did on identifying optimal gang members to deliver the message in a Focused Deterrence initiative. (I’m wondering if also there is an application in Association Rules/Conjunctive analysis.) But most examples I see of this are in the marketing space, e.g. whether to open a new store, or to carry a new product in a store, or to spend money on ads to reach an audience, etc.

Here I have posted the python code and data used in the analysis, below I go through the steps in formulating different linear programs to tackle this problem. I ended up taking some example simulated data from the XLStat website. (If you have a real data example feel free to share!)

A bit of a side story – growing up in rural Pennsylvania, going out to restaurants was sort of a big event. I specifically remember when we would travel to Williamsport, we would often go to eat at a restaurant called Hoss’s and we would all just order the salad bar buffet. So I am going to pretend this restaurant survey is maximizing the reach for Hoss’s buffet options.

Frontmatter

Here I am using pulp to fit the linear programming, reading in the data, and I am making up names for the columns for different food items. I have a set of main course meals, sides, and desserts. You will see in a bit how I incorporate this info into the buffet plans.

############################################################
#FRONT MATTER 

import pandas as pd
import pulp
import os

os.chdir('D:\Dropbox\Dropbox\Documents\BLOG\TURF_Analysis\Analysis')

#This is simulated data from XLStat
surv_data = pd.read_excel('demoTURF.xls',sheet_name='Data',index_col=0)

#Need 27 total of match up simulated data
main = ['Steak',
        'Pizza',
        'FriedChicken',
        'BBQChicken',
        'GrilledSalmon',
        'FriedHaddock',
        'LemonHaddock',
        'Roast',
        'Burger',
        'Wings']

sides = ['CeaserSalad',
         'IcebergSalad',
         'TomatoSoup',
         'OnionSoup',
         'Peas',
         'BrusselSprouts',
         'GreenBeans',
         'Corn',
         'DeviledEggs',
         'Pickles']

desserts = ['ChoclateIceCream',
            'VanillaIceCream',
            'ChocChipCookie',
            'OatmealCookie',
            'Brownie',
            'Blondie',
            'CherryPie']

#Renaming columns
surv_data.columns = main + sides + desserts

#Replacing the likert scale data with 0/1
surv_data.replace([1,2,3],0,inplace=True)
surv_data.replace([4,5],1,inplace=True)

#A customer weight example, here setting to 1 for all
surv_data['CustWeight'] = 1
cust_weight = surv_data['CustWeight']
############################################################

Maximizing Customers Reached

And now onto the good stuff, here is an example TURF model linear program. I end up picking the same 5 items that the XLStat program picked in their spreadsheet as well.

############################################################
#TRADITIONAL TURF ANALYSIS

k = 5 #pick 5 items
Cust_Index = surv_data.index
Prod_Index = main + sides + desserts

#Problem and Decision variables
P = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)

#Objective Function
P += pulp.lpSum( Cust_Dec[i] * cust_weight[i] for i in Cust_Index )

surv_items = surv_data[Prod_Index] #Dont want the weight variable
#Reached Constraint
for i in Cust_Index:
    #Get the products selected
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    #Set the constraint
    P += Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )
    
#Total number of products selected constraint
P += pulp.lpSum( Prod_Dec[j] for j in Prod_Index) == k

#Now solve the model
P.solve()

#Figure out the total reached people
print( pulp.value(P.objective) ) #129

#Print out the products you picked
picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec[j].varValue == 1:
        picked.append( (n+1,j) )

print(picked) 

#Same as XLStat
#[(14, 'OnionSoup'), (15, 'Peas'), (16, 'BrusselSprouts'), 
# (23, 'ChocChipCookie'), (26, 'Blondie')]

#For 5 items, XLStat selected items 
# 14 15 16 23 26 that reached 129 people
############################################################

One of the things I have done here is to create a ‘weight’ variable associated with each customer. So here I say all of the customers weights are all equal to 1, but you could swap out whatever you wanted. Say you had estimates on how much different individuals spend, so you could give big spenders more weight. (In a criminal justice example, for the Focused Deterrence initiative, folks typically want to target ‘leaders’ more frequently, so you may give them more weight in this example.) Since these examples are based on surveys, you may also want the weight to correspond to the proportion that survey respondent represents in the population, aka raking weights. Or if you have a crazy large survey population, you could use frequency weights for responses that give the exact same picks.

One thing to note as well in this formula is that I recoded the data earlier to be 0/1. You might however consider the likert scale rating 1 to 5 directly, subtract 1 and divide by 4. Then take that weight, and instead of the line:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )

You may want something like:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j]*likert_weight[i,j] for j in sel_prod )

In that case you would want to set the Cust_i decision variable to a continuous value, and then maybe cap it at 1 (so you can partially reach customers).

The total number of decision variables will be the number of customers plus the number of potential products, so here only 185 + 27 = 212. And the number of constraints will be the number of customers plus an additional small number. I’d note you can easily solve systems with 100,000’s of decision variables and constraints on your laptop, so at least for the example TURF analyses I have seen they are definitely within the ‘can solve this in a second on a laptop’ territory.

You can add in additional constraints into this problem. So imagine we always wanted to select one main course, at least two side dishes, and no more than three desserts. Also say you never wanted to pair two items together, say you had two chicken dishes and never wanted both at the same time. Here is how you could do each of those different constraints in the problem.

############################################################
#CONSTRAINTS ON ITEMS SELECTED

#Redoing the initial problem, but select 7 items
k = 7
P2 = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec2 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec2 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
P2 += pulp.lpSum( Cust_Dec2[i] * cust_weight[i] for i in Cust_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    P2 += Cust_Dec2[i] <= pulp.lpSum( Prod_Dec2[j] for j in sel_prod )
P2 += pulp.lpSum( Prod_Dec2[j] for j in Prod_Index) == k

#No Fried and BBQ Chicken at the same time
P2 += pulp.lpSum( Prod_Dec2['FriedChicken'] + Prod_Dec2['BBQChicken']) <= 1
#Exactly one main course
P2 += pulp.lpSum( Prod_Dec2[m] for m in main) == 1
#At least two sides (but could have 0)
P2 += pulp.lpSum( Prod_Dec2[s] for s in sides) >= 2
#No more than 3 desserts
P2 += pulp.lpSum( Prod_Dec2[d] for d in desserts) <= 3

#Now solve the model and print results
P2.solve()
print( pulp.value(P2.objective) ) #137
picked2 = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec2[j].varValue == 1:
        picked2.append( (n+1,j) )
print(picked2)
#[(10, 'Wings'), (12, 'IcebergSalad'), (14, 'OnionSoup'), (15, 'Peas'), 
# (16, 'BrusselSprouts'), (23, 'ChocChipCookie'), (27, 'CherryPie')]
############################################################

You could also draw a trade-off curve for how many more people you will reach if you can up the total number of items you can place on the menu, so estimate the model with 4, 5, 6, etc items and see how many more people you can reach if you extend the menu.

One of the other constraints you may consider in this formula is a budget constraint. So imagine instead of the food example, you are working for a marketing company, and you have an advertisement budget. You want to maximize the customer reach given the budget, so here a “product” may be a billboard, radio ad, newspaper ad, etc, but each have different costs. So instead of the constraint Prod_j == k where you select so many products, you have the constraint Prod_j*Cost_j <= Budget, where each product is associated with a particular cost.

Alt Formula, Minimizing Cost while Reaching a Set Amount

So in that last bit I mentioned costs for selecting a particular portfolio of products. Another way you may think about the problem is minimizing cost while meeting constraints on the reach (instead of maximizing reach while minimizing cost). So say you were a marketer, and wanted an estimate of how much budget you would need to reach a million people. Or going with our buffet example, imagine we wanted to appeal to at least 50% of our sample (so at least 93 people). Our formula would then be below (where I make up slightly different costs for buffet each of the buffet options).

############################################################
#MINIMIZE COST

#Cost dictionary made up prices
cost_prod = {'Steak' : 5.0,
             'Pizza' : 2.0,
             'FriedChicken' : 4.0,
             'BBQChicken' : 3.5,
             'GrilledSalmon' : 4.5,
             'FriedHaddock' : 5.3,
             'LemonHaddock' : 4.7,
             'Roast' : 3.9,
             'Burger' : 1.5,
             'Wings' : 2.4,
             'CeaserSalad' : 1.0,
             'IcebergSalad' : 0.8,
             'TomatoSoup' : 0.4,
             'OnionSoup' : 0.9,
             'Peas' : 0.6,
             'BrusselSprouts' : 0.5,
             'GreenBeans' : 0.4,
             'Corn' : 0.3,
             'DeviledEggs' : 0.7,
             'Pickles' : 0.73,
             'ChoclateIceCream' : 1.3,
             'VanillaIceCream' : 1.2,
             'ChocChipCookie' : 1.5,
             'OatmealCookie' : 0.9,
             'Brownie' : 1.2,
             'Blondie' : 1.3,
             'CherryPie' : 1.9}

#Setting up the model with the same selection constraints
n = 100
Pmin = pulp.LpProblem("TURF", pulp.LpMinimize)
Cust_Dec3 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec3 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Minimize this instead of Maximize reach
Pmin += pulp.lpSum( Prod_Dec3[j] * cost_prod[j] for j in Prod_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    Pmin += Cust_Dec3[i] <= pulp.lpSum( Prod_Dec3[j] for j in sel_prod )
#Instead of select k items, we want to reach at least n people
Pmin += pulp.lpSum( Cust_Dec3[i]*cust_weight[i] for i in Cust_Index) >= n

#Same constraints on meal choices
Pmin += pulp.lpSum( Prod_Dec3['FriedChicken'] + Prod_Dec3['BBQChicken']) <= 1
Pmin += pulp.lpSum( Prod_Dec3[m] for m in main) == 1
Pmin += pulp.lpSum( Prod_Dec3[s] for s in sides) >= 2
Pmin += pulp.lpSum( Prod_Dec3[d] for d in desserts) <= 3

#Now solve the model and print results
Pmin.solve()

reached = 0
for i in Cust_Index:
    reached += Cust_Dec3[i].varValue
print(reached) #100 reached on the nose

picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec3[j].varValue == 1:
        picked.append( (n+1,j,cost_prod[j]) )
        cost += cost_prod[j]
print(picked)
#[(9, 'Burger', 1.5), (13, 'TomatoSoup', 0.4), (15, 'Peas', 0.6)]
print(pulp.value(Pmin.objective)) #Total Cost 2.5
############################################################

So for this example our minimum budget buffet has some burgers, tomato soup, and peas. (Sounds good to me, I am not a picky eater!)

You can still incorporate all of the same other constraints I discussed before in this formulation. So here we need at a minimum to serve only 3 items to get the (over) 50% reach that we desire. If you wanted fairness type constraints, e.g. you want to reach 60% of females and 40% of males, you could do that as well. In that case you would just have two separate constraints for each group level you wanted to reach (which would also be applicable to the prior maximize reach formula, although you may need to keep upping the number of products selected before you identify a feasible solution).

In the end you could mash up these two formulas into one bi-objective function. You would need to define a term though to balance reach and cost. I’m wondering as well if there is a way to incorporate marginal benefits of sales into this as well, e.g. if you sell a Steak you may make a larger profit than if you sell a Pizza. But I am not 100% sure how to do that in this set up (even though I like all ice-cream, I won’t necessarily buy every flavor if I visit the shop). Similar for marketing adverts some forms may have better reach, but may have worse conversion rates.