# Random notes, digital art, and pairwise comparisons is polynomial

So not too much in the hopper for the blog at the moment. Have just a bunch of half-baked ideas (random python tips, maybe some crime analysis using osmnx, scraping javascript apps using selenium, normal nerd data science stuff).

Still continuing my blog series on the American Society of Evidence Based Policing, and will have a new post out next week on officer use of force.

If you have any suggestions for topics always feel free to ask me anything!

Working on some random digital art (somewhat focused on maps but not entirely). For other random suggestions I like OptArt and Rick Wicklin’s posts. Dall-E is impressive, and since it has an explicit goal of creating artwork I think it is a neat idea. Chat bots I have nothing good to say. Computer scientists working on them seem to be under the impression that if you build a large/good enough language model out pops general intelligence. Wee bit skeptical of that.

At work a co-worker was working on timing applications for a particular graph-database/edge-detection project. Initial timings on fake data were not looking so good. Here we have number of nodes and timings for the application:

``````  Nodes    Minutes
1000       0.16
10000       0.25
100000       1.5
1000000      51``````

Offhand people often speak about exponential functions (or growth), but here what I expect is we are really looking at is pairwise comparisons (not totally familiar with the tech the other data scientist is using, so I am guessing the algorithmic complexity). So this likely scales something like (where `n` is the number of nodes in the graph):

``Time = Fixed + C1*(n) + C2*(n choose 2) + e``

Fixed is just a small constant, C1 is setting up the initial node database, and C2 is the edge detection which I am guessing uses pairwise comparisons, `(n choose 2)`. We can rewrite this to show that the binomial coefficient is really polynomial time (not exponential) in terms of just the number of nodes.

``````C2*[n choose 2] = C2*[{n*(n-1)}/2]
C2*[ (n^2 - n)/2 ]
C2/2*[n^2 - n]
C2/2*n^2 - C2/2*n``````

And so we can rewrite our original equation in terms of simply `n`:

``Time = Fixed + (C1 - C2/2)*N + C2/2*N^2``

Doing some simple R code, we can estimate our equation:

``````n <- 10^(3:6)
m <- c(0.16,0.25,1.5,51)
poly_mod <- lm(m ~ n + I(n^2))``````

Since this fits 3 parameters with only 4 observations, the fit is (not surprisingly) quite good. Which to be clear does not mean much, if really cared would do much more sampling (or read the docs more closely about the underlying tech involved):

``````> pred <- predict(poly_mod)
> cbind(n,m,pred)
n     m       pred
1 1e+03  0.16  0.1608911
2 1e+04  0.25  0.2490109
3 1e+05  1.50  1.5000989
4 1e+06 51.00 50.9999991``````

And if you do instead `poly_2 <- lm(m ~ n + choose(n,2))` you get a change in scale of the coefficients, but the same predictions.

We really need this to scale in our application at work to maybe over 100 million records, so what would we predict in terms of minutes based on these initial timings?

``````> nd = data.frame(n=10^(7:8))
> predict(poly_mod,nd)/60 # convert to hours
1          2
70.74835 6934.56850``````

So doing 10 million records will take a few days, and doing 100 million will be close to 300 days.

With only 4 observations not much to chew over (really it is too few to say it should be a different model). I am wondering though how to best handle errors for these types of extrapolations. Errors are probably not homoskedastic for such timing models (error will be larger for larger number of nodes). Maybe better to use quantile regression (and model the median?). I am not sure (and that advice I think will also apply to modeling exponential growth as well).