Fitting a gamma distribution (with standard errors) in python

Over on Crime De-Coder, I have up a pre-release of my book, Data Science for Crime Analysis with Python.

I have gotten asked by so many people “where to learn python” over the past year I decided to write a book. Go check out my preface on why I am not happy with current resources in the market, especially for beginners.

I have the preface + first two chapters posted. If you want to sign up for early releases, send me an email and will let you know how to get an early copy of the first half of the book (and send updates as they are finished).

Also I have up the third installment of the AltAc newsletter. Notes about the different types of healthcare jobs, and StatQuest youtube videos are a great resource. Email me if you want to sign up for the newsletter.


So the other day I showed how to fit a beta-binomial model in python. Today, in a quick post, I am going to show how to estimate standard errors for such fitted models.

So in scipy, you have distribution.fit(data) to fit the distribution and return the estimates, but it does not have standard errors around those estimates. I recently had need for gamma fits to data, in R I would do something like:

# this is R code
library(fitdistrplus)
x <- c(24,13,11,11,11,13,9,10,8,9,6)
gamma_fit <- fitdist(x,"gamma")
summary(gamma_fit)

And this pipes out:

> Fitting of the distribution ' gamma ' by maximum likelihood
> Parameters :
>        estimate Std. Error
> shape 8.4364984  3.5284240
> rate  0.7423908  0.3199124
> Loglikelihood:  -30.16567   AIC:  64.33134   BIC:  65.12713
> Correlation matrix:
>           shape      rate
> shape 1.0000000 0.9705518
> rate  0.9705518 1.0000000

Now, for some equivalent in python.

# python code
from scipy.optimize import minimize
from scipy.stats import gamma

x = [24,13,11,11,11,13,9,10,8,9,6]

res = gamma.fit(x,floc=0)
print(res)

And this gives (8.437256958682587, 0, 1.3468401423927603). Note that python uses the scale version, so 1/scale = rate, e.g. 1/res[-1] results in 0.7424786123640676, which agrees pretty closely with R.

Now unfortunately, there is no simple way to get the standard errors in this framework. You can however minimize the parameters yourself and get the things you need – here the Hessian – to calculate the standard errors yourself.

Here to make it easier apples to apples with R, I estimate the rate parameter version.

# minimize negative log likelihood
def gll(parms,data):
    alpha, rate = parms
    ll = gamma.logpdf(data,alpha,loc=0,scale=1/rate)
    return -ll.sum()

result = minimize(gll,[8.,0.7],args=(x),method='BFGS')
se = np.sqrt(np.diag(result.hess_inv))
# printing results
np.stack([result.x,se],axis=1)

And this gives:

array([[8.43729465, 3.32538824],
       [0.74248193, 0.30255433]])

Pretty close, but not an exact match, to R. Here the standard errors are too low (maybe there is a sample correction factor?).

Some quick tests with this, using this scale version tends to be a bit more robust in fitting than the rate version.

The context of this, a gamma distribution with a shape parameter greater than 1 has a hump, and less than 1 is monotonically decreasing. This corresponds to the “buffer” hypothesis in criminology for journey to crime distributions. So hopefully some work related to that coming up soonish!


And doing alittle more digging, you can get “closed form” estimates of the parameters based on the Fisher Information matrix (via Wikipedia).

# Calculating closed form standard
# errors for gamma via Fisher Info
from scipy.special import polygamma
from numpy.linalg import inv
import numpy as np

def trigamma(x):
    return float(polygamma(1,x))

# These are the R estimates
shape = 8.4364984
rate = 0.7423908
n = 11 # 11 observations

fi = np.array([[trigamma(shape), -1/rate],
               [-1/rate, shape/rate**2]])

inv_fi = inv(n*fi)  # this is variance/covariance
np.sqrt(np.diagonal(inv_fi))

# R reports      3.5284240 0.3199124
# python reports 3.5284936 0.3199193

I do scare quotes around closed form, as it relies on the trigamma function, which itself needs to be calculated numerically I believe.

If you want to go through the annoying math (instead of using the inverse above), you can get a direct one liner for either.

# Closed for for standard error of shape
np.sqrt(shape / (n*(trigamma(shape)*shape - 1)))
# 3.528493591892887

# This works for the scale or the rate parameterization
np.sqrt((trigamma(shape))/ (n*rate**-2*(trigamma(shape)*shape - 1)))
# 0.31995664847081356

# This is the standard error for the scale version
scale = 1/rate
np.sqrt((trigamma(shape))/ (n*scale**-2*(trigamma(shape)*shape - 1)))
# 0.5803962127677769

Fitting a plateau effects model in scipy

Dealing with a few models recently that people fit non-linear effects (either via polynomials or splines), and the results are just on their face too curvy.

There is also a common social science trope where people fit a polynomial to some data, and that clearly exploratory model fitting exercise becomes a main focal point of the paper.

But there is one scenario I commonly see though for curves that I think makes sense for quite a bit of social science data – a plateau effect. See for example this Hipp article that finds a plateau effect for poverty -> crime rates. John though uses a cubic function later to fit these effects, so it curves back down – I think a more reasonable model would enforce monotonic constraints so it doesn’t dip back down in the tails of the data. (The same issue often happens with quadratic polynomials as well.) I have some other blog posts on segmented models as well that are subject to the same not being monotonic where they should be.

A plateau model is difficult to fit out of the box though in most current stat software. Rick Wicklin on his blog has a nice formulation though:

It fits a quadratic, and then plateaus after a particular breakpoint. For theory testing I imagine the breakpoint itself will be of interest to many criminologists, and you can estimate that location in this formulation.

Rick works for SAS, and so if familiar with SAS go ahead and use his code. But here I coded up an example fitting a constrained non-linear regression in python using scipy.

Python Code

Taking the same data from Rick Wicklin’s blog post, this code just reads in the data and converts dates to days since 3/20/2019. I don’t scale the data here to be an exact replicate of Rick’s blog post, but for data with a wider range it would be necessary to prevent some numerical instability.

# Python libraries to replicate

from datetime import datetime
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.optimize import NonlinearConstraint

# Via https://blogs.sas.com/content/iml/2020/12/14/segmented-regression-sas.html
dat = [(1,'3/20/2019',182),
       (3,'5/30/2019',223),
       (5,'6/11/2019',111),
       (7,'7/26/2019',83),
       (9,'8/29/2019',162),
       (11,'10/10/2019',70),
       (13,'10/31/2019',113),
       (15,'11/21/2019',83),
       (17,'12/5/2019',73),
       (19,'12/19/2019',86),
       (21,'1/16/2020',124),
       (23,'1/30/2020',134),
       (25,'6/4/2020',60),
       (2,'5/16/2019',150),
       (4,'6/6/2019',142),
       (6,'7/11/2019',164),
       (8,'8/22/2019',144),
       (10,'9/19/2019',83),
       (12,'10/17/2019',114),
       (14,'11/7/2019',97),
       (16,'12/5/2019',111),
       (18,'12/12/2019',87),
       (20,'1/9/2020',102),
       (22,'1/23/2020',95),
       (24,'3/5/2020',121)]

df = pd.DataFrame(dat,columns=['SugeryNo','Date','Duration'])
df['Date'] = pd.to_datetime(df['Date'])
df['DaysRef'] = (df['Date'] - pd.to_datetime('3/20/2019')).dt.days
df['DR2'] = df['DaysRef']**2

Now, one of the things I sometimes find confusing in posts that optimize arbitrary functions (in R or python) is that to minimize a function, it is with respect to your data at hand. Sometimes folks have functions that can pass in data and the parameters. But I find it easier to just keep the data fixed and only pass in parameters.

So you can see in my non-linear pred function, it passes in the parameters (which we will estimate), and gives a prediction for the fixed dataset. Ditto for the loss function (you could update to do logistic regression for example if predicting 0/1s). Then the nlconst object is a special python function to define the non-linear constraints that make this plateau model work. Then start solutions and finally minimize the function (using a Fortran solver!):

# Pass in global data into the function
def prednl(x):
    b0 = x[0]
    b1 = x[1]
    b2 = x[2]
    brp = x[3]
    before = (df['DaysRef'] < brp)
    y0 = b0 + b1*df['DaysRef'] + b2*df['DR2']
    y1 = b0 + b1*brp + b2*brp*brp
    return y0*before + (~before)*y1

def lossnl(x):
    yhat = prednl(x)
    squares = (df['Duration'] - yhat)**2
    return squares.sum()

def nlconst(x):
    r1 = x[4] - (x[0] + x[1]*x[3] + x[2]*x[3]*x[3])    # plateau
    r2 = x[3] - ((-0.5*x[1])/x[2])                     # breakpoint
    # Could also consider bounds on breakpoint and curve needs to be non-zero
    return np.array([r1,r2])

nlc = NonlinearConstraint(nlconst, np.array([0.0,0.0]), 
                                   np.array([0.0,0.0]))

start = np.array([185.0,-1.0,0.1,150.0,60.0])

solution = minimize(lossnl,start,method='trust-constr',
                    constraints=nlc,options={'maxiter':50000})

And this returns the same fit as did the SAS routine:

Now I will admit defeat to trying to figure out analytical standard errors (tried via the outer product gradient approach via autograd, as well as using BFGS and its inverse hessian estimate, which is not even close to the results SAS gives).

So I do the thing all lazy statisticians do at this point – the bootstrap. (SPSS I believe will only give standard errors for its nonlinear estimates via bootstrap.)

# Do the bootstrap, 95% CI
res = []
mess = []
for i in range(19):
    print(f'iter {i+1}: ',datetime.now())
    boot = df.sample(n=df.shape[1],replace=True).reset_index(drop=True)
    days_ref = boot['DaysRef'].to_numpy()
    duration = boot['Duration'].to_numpy()
    dr2 = boot['DR2'].to_numpy()
    def lb(x):
        b0 = x[0]
        b1 = x[1]
        b2 = x[2]
        brp = x[3]
        before = (days_ref < brp)
        y0 = b0 + b1*days_ref + b2*dr2
        y1 = b0 + b1*brp + b2*brp*brp
        yhat = y0*before + (~before)*y1
        squares = (duration - yhat)**2
        return squares.sum()
    sl = minimize(lb,start,method='trust-constr',
                  constraints=nlc,options={'maxiter':50000})
    mess.append(sl.message)
    print(sl.message)
    res.append(sl.x)

rdf = pd.DataFrame(res,columns=['B0','B1','B2','break','plateau'])
rdf.describe() #min/max are the 95% CIs

And we can see that these estimates are very wide. We can look at individual iterations, and in a few the estimates go off the rails (and they still say they converged, they just converged to non-sense).

# Some of the wayward estimates
# still pass convergence
rdf['Eval'] = mess
rdf

But this is the nature of these non-linear functions. They can be pretty finicky. If a straight line fits the data quite well, the quadratic term will be very small, and so the estimated plateau may be outside of the data (or just totally unstable).

Still, even though it is more work and potentially more finicky in model fitting, I would rather people have explicit functional form predictions for non-linear effects, than simply throwing in polynomial functions and writing a paper about “look at these non-linear effects”.

And this formulation provides an explicit mechanism to measure the location of a plateau effect directly as a parameter.

aggregate retention/churn models in python

Instead of having so much code just randomly floating around in blog posts, I need to start making packages (both in R and python) more often. I took it as a challenge to make a simple python package, here retenmod (pypi, github). I got the idea after answering a question on crossvalidated. (The resources I leveraged the most were these two sites/tutorials, packaging projects and minimal example.)

It is a simple port of the R package foretell that provides several different models to forecast churn based on aggregate survival probabilities. So it only has three functions, and I did not focus too much on extras (like building sphinx docs). Buit it has just the amount of complexity to make a nice intro get my feet wet example.

So you can now download/install the package via pip:

pip install retenmod

And it will automatically install scipy and numpy if you do not have them already installed. For a very simple example, I don’t have retention probabilities for any police department offhand, but this document has estimates for how many staff positions police tend to retain after increases.

Here is a simple example of using the library, in particular the BdW model.

import retenmod
import matplotlib.pyplot as plt

large = [100,66,56,52,49,47,44,42]
time = [0,1,2,3,4,5,10,15]

# Only fitting with the first 3 values
train, ext = 4, 15
lrg_bdw = retenmod.bdw(large[0:train],ext - train + 1)

# Showing predicted vs observed
pt = list(range(16))
fig, ax = plt.subplots()
ax.plot(pt[1:], lrg_bdw.proj[1:],label='Predicted', 
        c='k', linewidth=2, zorder=-1)
ax.scatter(time[1:],large[1:],label='Observed', 
           edgecolor='k', c='r', s=50, zorder=1)
ax.axvline(train - 0.5, label='Train', color='grey', 
           linestyle='dashed', linewidth=2, zorder=-2)
ax.set_ylabel('% Retaining Position')
ax.legend(facecolor='white', framealpha=1)
plt.xticks(pt[1:])
plt.show()

So you can see even with only fitting the data to the first three years, years 4 and 5 were forecasted quite well. It underestimates retention further out at 10 and 15 years (the model has a hard time going down very fast from 100 to 66 and then flattening out in a reasonable way). But even so the super far out forecasts are not that crazy given only three data points.

I will have to work on an example later of showing how to translate this to cost-benefit analysis (although would prefer actual retention data from a PD). Essentially you can calculate the benefit of trying to save officers (retain them) vs hiring new officers and training them up based on just aggregate data. If you wanted to do something like estimate if retention is going down due to recent events, I would probably use micro-level data and estimate a survival model directly.

Next up I will try to turn my Exact distribution tests (R Code) for day of week/Benford’s analysis into a simple R package and see if I can get it on Cran. Posting to pypi is quite easy.