I still lurk on the Cross Validated statistics site every now and then. There was a kind of common question about the probability of a run of events occurring, and the poster provided a nice analytic solution to the problem using Markov Chains and absorbing states I was not familiar with.
I was familiar with a way to approximate the answer though using a simple simulation, and encoding data via run length encoding. Run length encoding works like this, if you have an original sequence that is AABBBABBBB
, then the run length encoded version of this sequence is:
A,2
B,2
A,1
B,4
This is a quite convenient sparse data format to be familiar with. E.g. if you are using tensors in various deep learning libraries, you can encode the data like this and then stack the tensor. But the stacked tensor is just a view, so it doesn’t take up as much memory as the initial full tensor.
Using this encoding also makes a simulation to answer the question, how often do runs of 5+ occur in this hypothetical experiment quite easy to estimate. You just calculate the run length encoded version of the data, and see if any of the lengths are equal to or greater than 5. Below are code snippets in R and Python.
While the analytic solution is of course preferable when you can figure it out, simulations are nice to test whether the solution is correct, as well as to provide an answer when you are not familiar with how to analytically derive a solution.
R Code
R has a native run-length encoding command, rle
. The reason is that runs tests are a common time series technique for looking at randomness. Encourage you to run the code yourself to see how my simulated answer lines up with the analytic answer provided on the stats site!
##########################################
# R Code
set.seed(10)
die <- 1:6
run_sim <- function(rolls=1000, conseq=5){
test <- sample(die,rolls,TRUE)
res <- max(rle(test)$lengths) >= conseq
return(res)
}
sims <- 1000000
results <- replicate(sims, run_sim(), TRUE)
print( mean(results) )
##########################################
Python Code
The python code is very similar to the R code. Main difference is there is no native run length encoding command in numpy or scipy I am aware of (although there should be)! So I edited a function I found from Stackoverflow to accomplish the rle.
##########################################
# Python code
import numpy as np
np.random.seed(10)
# Edited from https://stackoverflow.com/a/32681075/604456
# input numpy arrary, return tuple (lengths, vals)
def rle(ia):
y = np.array(ia[1:] != ia[:-1]) # pairwise unequal (string safe)
i = np.append(np.where(y), len(ia) - 1) # must include last element
z = np.diff(np.append(-1, i)) # run lengths
return (z, ia[i])
die = list(range(6))
def run_sim(rolls=1000, conseq=5):
rlen, vals = rle(np.random.choice(a=die,size=rolls,replace=True))
return rlen.max() >= conseq
sims = 1000000
results = [run_sim() for i in range(sims)]
print( sum(results)/len(results) )
##########################################
I debated on expanding this post to show how to do these simulations in parallel, this is a bit of a cheesy experiment to show though. To do 1 million simulations on my machine still only takes like 10~20 seconds for each of these code snippets. So that will have to wait until another post!
You may be thinking why do I care about runs of dice rolls? Well, it can be extended to many different types of time series monitoring problems. For example, when I worked as a crime analyst at Troy I thought about this in terms of analyzing domestic violent reports. They were too numerous for me to read through every report, so I needed to devise a system to identify if there were anomalous patterns in the recent number of reports. You could devise a test here, say how many days of 10+ reports in a row, and see how frequently you would expect that occur in say a year of monitoring. The simulations above could easily be amended to do that, via doing simulations of the Poisson distribution instead of dice rolls, or assigning weights to particular outcomes.
apwheele
/ October 26, 2020Just for my own reference, David Spiegelhalter has a reference on doing this for Poisson counts with a moving window approximation, https://understandinguncertainty.org/when-cluster-real-cluster