Generating multiple solutions for linear programs

I had a question the other day on my TURF analysis about generating not just the single best solution for a linear programming problem, but multiple solutions. This is one arena I like the way people typically do genetic algorithms, in that they have a leaderboard with multiple top solutions. Which is nice for exploratory data analysis.

This example though it is pretty easy to amend the linear program to return exact solutions by generating lazy constraints. (And because the program is fast, you might as well do it this way as well to get an exact answer.) The way it works here, we have decision variables for products selected. You run the linear program, and then collect the k products selected, then add a constraint to the model in the form of:

Sum(Product_k) <= k-1 

Then rerun the model with this constraint, get the solution, and then repeat. This constraint makes it so the group of decision variables selected cannot be replicated (and in the forbidden set). Python’s pulp makes this quite easy, and here is a function to estimate the TURF model I showed in that prior blog post, but generate this multiple leaderboard:

import pandas as pd
import pulp

# Function to get solution from turf model
def get_solution(prod_dec,prod_ind):
    '''
    prod_dec - decision variables for products
    prod_ind - indices for products
    '''
    picked = []
    for j in prod_ind:
        if prod_dec[j].varValue >= 0.99:
            picked.append(j)
    return picked

# Larger function to set up turf model 
# And generate multiple solutions
def turf(data,k,solutions):
    '''
    data - dataframe with 0/1 selected items
    k - integer for products selected
    solutions - integer for number of potential solutions to return
    '''
    # Indices for Dec variables
    Cust = data.index
    Prod = list(data)
    # Problem and Decision Variables
    P = pulp.LpProblem("TURF", pulp.LpMaximize)
    Cu = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust], lowBound=0, upBound=1, cat=pulp.LpInteger)
    Pr = pulp.LpVariable.dicts("Products Selected", [j for j in Prod], lowBound=0, upBound=1, cat=pulp.LpInteger)
    # Objective Function (assumes weight = 1 for all rows)
    P += pulp.lpSum(Cu[i] for i in Cust)
    # Reached Constraint
    for i in Cust:
        #Get the products selected
        p_sel = data.loc[i] == 1
        sel_prod = list(p_sel.index[p_sel])
        #Set the constraint
        P += Cu[i] <= pulp.lpSum(Pr[j] for j in sel_prod)
    # Total number of products selected constraint
    P += pulp.lpSum(Pr[j] for j in Prod) == k
    # Solve the equation multiple times, add constraints
    res = []
    km1 = k - 1
    for _ in range(solutions):
        P.solve()
        pi = get_solution(Pr,Prod)
        # Lazy constraint
        P += pulp.lpSum(Pr[j] for j in pi) <= km1
        # Add in objective
        pi.append(pulp.value(P.objective))
        res.append(pi)
    # Turn results into nice dataframe
    cols = ['Prod' + str(i+1) for i in range(k)]
    res_df = pd.DataFrame(res, columns=cols+['Reach'])
    res_df['Reach'] = res_df['Reach'].astype(int)
    return res_df

And now we can simply read in our data, turn it into binary 0/1, and then generate the multiple solutions.

#This is simulated data from XLStat
#https://help.xlstat.com/s/article/turf-analysis-in-excel-tutorial?language=en_US
prod_data = pd.read_csv('demoTURF.csv',index_col=0)

#Replacing the likert scale data with 0/1
prod_bin = 1*(prod_data > 3)

# Get Top 10 solutions
top10 = turf(data=prod_bin,k=5,solutions=10)
print(top10)

Which generates the results:

>>> print(top10)
        Prod1       Prod2       Prod3       Prod4       Prod5  Reach
0   Product 4  Product 10  Product 15  Product 21  Product 25    129
1   Product 3  Product 15  Product 16  Product 25  Product 26    129
2   Product 4  Product 14  Product 15  Product 16  Product 26    129
3  Product 14  Product 15  Product 16  Product 23  Product 26    129
4   Product 3  Product 15  Product 21  Product 25  Product 26    129
5  Product 10  Product 15  Product 21  Product 23  Product 25    129
6   Product 4  Product 10  Product 15  Product 16  Product 27    129
7  Product 10  Product 15  Product 16  Product 23  Product 27    129
8   Product 3  Product 10  Product 15  Product 21  Product 25    128
9   Product 4  Product 10  Product 15  Product 21  Product 27    128

I haven’t checked too closely, but this looks to me offhand like the same top 8 solutions (with a reach of 129) as the XLStat website. The last two rows are different, likely because there are further ties.

For TURF analysis, you may actually consider a lazy constraint in the form of:

Product_k = 0 for each k

This is more restrictive, but if you have a very large pool of products, will ensure that each set of products selected are entirely different (so a unique set of 5 products for every row). Or you may consider a constraint in terms of customers reached instead of on products, so if solution 1 reaches 129, you filter those rows out and only count newly reached people in subsequent solutions. This ensures each row of the leaderboard above reaches different people than the prior rows.

A linear programming example for TURF analysis in python

Recently on LinkedIn I saw a very nice example of TURF (Total Unduplicated Reach & Frequency) analysis via Jarlath Quinn. I suggest folks go and watch the video, but for a simple example imagine you are an ice cream food truck, and you only have room in your truck to sell 5 different ice-creams at a time. Some people only like chocolate, others like vanilla and neapolitan, and then others like me like everything. So what are the 5 best flavors to choose to maximize the number of people that like at least one flavor?

You may think this is a bit far-fetched to my usual posts related to criminal justice, but it is very much related to the work I did on identifying optimal gang members to deliver the message in a Focused Deterrence initiative. (I’m wondering if also there is an application in Association Rules/Conjunctive analysis.) But most examples I see of this are in the marketing space, e.g. whether to open a new store, or to carry a new product in a store, or to spend money on ads to reach an audience, etc.

Here I have posted the python code and data used in the analysis, below I go through the steps in formulating different linear programs to tackle this problem. I ended up taking some example simulated data from the XLStat website. (If you have a real data example feel free to share!)

A bit of a side story – growing up in rural Pennsylvania, going out to restaurants was sort of a big event. I specifically remember when we would travel to Williamsport, we would often go to eat at a restaurant called Hoss’s and we would all just order the salad bar buffet. So I am going to pretend this restaurant survey is maximizing the reach for Hoss’s buffet options.

Frontmatter

Here I am using pulp to fit the linear programming, reading in the data, and I am making up names for the columns for different food items. I have a set of main course meals, sides, and desserts. You will see in a bit how I incorporate this info into the buffet plans.

############################################################
#FRONT MATTER 

import pandas as pd
import pulp
import os

os.chdir('D:\Dropbox\Dropbox\Documents\BLOG\TURF_Analysis\Analysis')

#This is simulated data from XLStat
surv_data = pd.read_excel('demoTURF.xls',sheet_name='Data',index_col=0)

#Need 27 total of match up simulated data
main = ['Steak',
        'Pizza',
        'FriedChicken',
        'BBQChicken',
        'GrilledSalmon',
        'FriedHaddock',
        'LemonHaddock',
        'Roast',
        'Burger',
        'Wings']

sides = ['CeaserSalad',
         'IcebergSalad',
         'TomatoSoup',
         'OnionSoup',
         'Peas',
         'BrusselSprouts',
         'GreenBeans',
         'Corn',
         'DeviledEggs',
         'Pickles']

desserts = ['ChoclateIceCream',
            'VanillaIceCream',
            'ChocChipCookie',
            'OatmealCookie',
            'Brownie',
            'Blondie',
            'CherryPie']

#Renaming columns
surv_data.columns = main + sides + desserts

#Replacing the likert scale data with 0/1
surv_data.replace([1,2,3],0,inplace=True)
surv_data.replace([4,5],1,inplace=True)

#A customer weight example, here setting to 1 for all
surv_data['CustWeight'] = 1
cust_weight = surv_data['CustWeight']
############################################################

Maximizing Customers Reached

And now onto the good stuff, here is an example TURF model linear program. I end up picking the same 5 items that the XLStat program picked in their spreadsheet as well.

############################################################
#TRADITIONAL TURF ANALYSIS

k = 5 #pick 5 items
Cust_Index = surv_data.index
Prod_Index = main + sides + desserts

#Problem and Decision variables
P = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)

#Objective Function
P += pulp.lpSum( Cust_Dec[i] * cust_weight[i] for i in Cust_Index )

surv_items = surv_data[Prod_Index] #Dont want the weight variable
#Reached Constraint
for i in Cust_Index:
    #Get the products selected
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    #Set the constraint
    P += Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )
    
#Total number of products selected constraint
P += pulp.lpSum( Prod_Dec[j] for j in Prod_Index) == k

#Now solve the model
P.solve()

#Figure out the total reached people
print( pulp.value(P.objective) ) #129

#Print out the products you picked
picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec[j].varValue == 1:
        picked.append( (n+1,j) )

print(picked) 

#Same as XLStat
#[(14, 'OnionSoup'), (15, 'Peas'), (16, 'BrusselSprouts'), 
# (23, 'ChocChipCookie'), (26, 'Blondie')]

#For 5 items, XLStat selected items 
# 14 15 16 23 26 that reached 129 people
############################################################

One of the things I have done here is to create a ‘weight’ variable associated with each customer. So here I say all of the customers weights are all equal to 1, but you could swap out whatever you wanted. Say you had estimates on how much different individuals spend, so you could give big spenders more weight. (In a criminal justice example, for the Focused Deterrence initiative, folks typically want to target ‘leaders’ more frequently, so you may give them more weight in this example.) Since these examples are based on surveys, you may also want the weight to correspond to the proportion that survey respondent represents in the population, aka raking weights. Or if you have a crazy large survey population, you could use frequency weights for responses that give the exact same picks.

One thing to note as well in this formula is that I recoded the data earlier to be 0/1. You might however consider the likert scale rating 1 to 5 directly, subtract 1 and divide by 4. Then take that weight, and instead of the line:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j] for j in sel_prod )

You may want something like:

Cust_Dec[i] <= pulp.lpSum( Prod_Dec[j]*likert_weight[i,j] for j in sel_prod )

In that case you would want to set the Cust_i decision variable to a continuous value, and then maybe cap it at 1 (so you can partially reach customers).

The total number of decision variables will be the number of customers plus the number of potential products, so here only 185 + 27 = 212. And the number of constraints will be the number of customers plus an additional small number. I’d note you can easily solve systems with 100,000’s of decision variables and constraints on your laptop, so at least for the example TURF analyses I have seen they are definitely within the ‘can solve this in a second on a laptop’ territory.

You can add in additional constraints into this problem. So imagine we always wanted to select one main course, at least two side dishes, and no more than three desserts. Also say you never wanted to pair two items together, say you had two chicken dishes and never wanted both at the same time. Here is how you could do each of those different constraints in the problem.

############################################################
#CONSTRAINTS ON ITEMS SELECTED

#Redoing the initial problem, but select 7 items
k = 7
P2 = pulp.LpProblem("TURF", pulp.LpMaximize)
Cust_Dec2 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec2 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
P2 += pulp.lpSum( Cust_Dec2[i] * cust_weight[i] for i in Cust_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    P2 += Cust_Dec2[i] <= pulp.lpSum( Prod_Dec2[j] for j in sel_prod )
P2 += pulp.lpSum( Prod_Dec2[j] for j in Prod_Index) == k

#No Fried and BBQ Chicken at the same time
P2 += pulp.lpSum( Prod_Dec2['FriedChicken'] + Prod_Dec2['BBQChicken']) <= 1
#Exactly one main course
P2 += pulp.lpSum( Prod_Dec2[m] for m in main) == 1
#At least two sides (but could have 0)
P2 += pulp.lpSum( Prod_Dec2[s] for s in sides) >= 2
#No more than 3 desserts
P2 += pulp.lpSum( Prod_Dec2[d] for d in desserts) <= 3

#Now solve the model and print results
P2.solve()
print( pulp.value(P2.objective) ) #137
picked2 = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec2[j].varValue == 1:
        picked2.append( (n+1,j) )
print(picked2)
#[(10, 'Wings'), (12, 'IcebergSalad'), (14, 'OnionSoup'), (15, 'Peas'), 
# (16, 'BrusselSprouts'), (23, 'ChocChipCookie'), (27, 'CherryPie')]
############################################################

You could also draw a trade-off curve for how many more people you will reach if you can up the total number of items you can place on the menu, so estimate the model with 4, 5, 6, etc items and see how many more people you can reach if you extend the menu.

One of the other constraints you may consider in this formula is a budget constraint. So imagine instead of the food example, you are working for a marketing company, and you have an advertisement budget. You want to maximize the customer reach given the budget, so here a “product” may be a billboard, radio ad, newspaper ad, etc, but each have different costs. So instead of the constraint Prod_j == k where you select so many products, you have the constraint Prod_j*Cost_j <= Budget, where each product is associated with a particular cost.

Alt Formula, Minimizing Cost while Reaching a Set Amount

So in that last bit I mentioned costs for selecting a particular portfolio of products. Another way you may think about the problem is minimizing cost while meeting constraints on the reach (instead of maximizing reach while minimizing cost). So say you were a marketer, and wanted an estimate of how much budget you would need to reach a million people. Or going with our buffet example, imagine we wanted to appeal to at least 50% of our sample (so at least 93 people). Our formula would then be below (where I make up slightly different costs for buffet each of the buffet options).

############################################################
#MINIMIZE COST

#Cost dictionary made up prices
cost_prod = {'Steak' : 5.0,
             'Pizza' : 2.0,
             'FriedChicken' : 4.0,
             'BBQChicken' : 3.5,
             'GrilledSalmon' : 4.5,
             'FriedHaddock' : 5.3,
             'LemonHaddock' : 4.7,
             'Roast' : 3.9,
             'Burger' : 1.5,
             'Wings' : 2.4,
             'CeaserSalad' : 1.0,
             'IcebergSalad' : 0.8,
             'TomatoSoup' : 0.4,
             'OnionSoup' : 0.9,
             'Peas' : 0.6,
             'BrusselSprouts' : 0.5,
             'GreenBeans' : 0.4,
             'Corn' : 0.3,
             'DeviledEggs' : 0.7,
             'Pickles' : 0.73,
             'ChoclateIceCream' : 1.3,
             'VanillaIceCream' : 1.2,
             'ChocChipCookie' : 1.5,
             'OatmealCookie' : 0.9,
             'Brownie' : 1.2,
             'Blondie' : 1.3,
             'CherryPie' : 1.9}

#Setting up the model with the same selection constraints
n = 100
Pmin = pulp.LpProblem("TURF", pulp.LpMinimize)
Cust_Dec3 = pulp.LpVariable.dicts("Customers Reached", [i for i in Cust_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
Prod_Dec3 = pulp.LpVariable.dicts("Products Selected", [j for j in Prod_Index], lowBound=0, upBound=1, cat=pulp.LpInteger)
#Minimize this instead of Maximize reach
Pmin += pulp.lpSum( Prod_Dec3[j] * cost_prod[j] for j in Prod_Index )
for i in Cust_Index:
    p_sel = surv_items.loc[i] == 1
    sel_prod = list(p_sel.index[p_sel])
    Pmin += Cust_Dec3[i] <= pulp.lpSum( Prod_Dec3[j] for j in sel_prod )
#Instead of select k items, we want to reach at least n people
Pmin += pulp.lpSum( Cust_Dec3[i]*cust_weight[i] for i in Cust_Index) >= n

#Same constraints on meal choices
Pmin += pulp.lpSum( Prod_Dec3['FriedChicken'] + Prod_Dec3['BBQChicken']) <= 1
Pmin += pulp.lpSum( Prod_Dec3[m] for m in main) == 1
Pmin += pulp.lpSum( Prod_Dec3[s] for s in sides) >= 2
Pmin += pulp.lpSum( Prod_Dec3[d] for d in desserts) <= 3

#Now solve the model and print results
Pmin.solve()

reached = 0
for i in Cust_Index:
    reached += Cust_Dec3[i].varValue
print(reached) #100 reached on the nose

picked = []
for n,j in enumerate(Prod_Index):
    if Prod_Dec3[j].varValue == 1:
        picked.append( (n+1,j,cost_prod[j]) )
        cost += cost_prod[j]
print(picked)
#[(9, 'Burger', 1.5), (13, 'TomatoSoup', 0.4), (15, 'Peas', 0.6)]
print(pulp.value(Pmin.objective)) #Total Cost 2.5
############################################################

So for this example our minimum budget buffet has some burgers, tomato soup, and peas. (Sounds good to me, I am not a picky eater!)

You can still incorporate all of the same other constraints I discussed before in this formulation. So here we need at a minimum to serve only 3 items to get the (over) 50% reach that we desire. If you wanted fairness type constraints, e.g. you want to reach 60% of females and 40% of males, you could do that as well. In that case you would just have two separate constraints for each group level you wanted to reach (which would also be applicable to the prior maximize reach formula, although you may need to keep upping the number of products selected before you identify a feasible solution).

In the end you could mash up these two formulas into one bi-objective function. You would need to define a term though to balance reach and cost. I’m wondering as well if there is a way to incorporate marginal benefits of sales into this as well, e.g. if you sell a Steak you may make a larger profit than if you sell a Pizza. But I am not 100% sure how to do that in this set up (even though I like all ice-cream, I won’t necessarily buy every flavor if I visit the shop). Similar for marketing adverts some forms may have better reach, but may have worse conversion rates.