Saving data files in wheel packages

As a small update, I continue to use my retenmod python package as a means to illustrate various python packaging and CICD tricks. Most recently, I have added in an example of saving local data files inside of the wheel package. So instead of just packaging the .py files in the wheel package, it also bundles up two data files: a csv file and a json data file for illustration.

The use case I have seen for this, sometimes I see individual .py files in peoples packages that have thousands of lines – they just are typically lookup tables. It is is better to save those lookup tables in more traditional formats than it is to coerce them into python objects.

It is not too difficult, but here are the two steps you need:

Step 1, in setup.cfg in the root, I have added this package_data option.

[options.package_data]
* = *.csv, *.json

Step 2, create a set of new functions to load in the data. You need to use pkg_resources to do this. It is simple enough to just copy-paste the entire data_funcs.py file here in a blog post to illustrate:

import json
import numpy as np
import pkg_resources

# Reading the csv data
def staff():
    stream = pkg_resources.resource_stream(__name__, "agency.csv")
    df = np.genfromtxt(stream, delimiter=",", skip_header=1)
    return df

# Reading the metadata
def metaf():
    stream = pkg_resources.resource_stream(__name__, "notes.json")
    res = json.load(stream)
    return res

# just having it available as object
metadata = metaf()

So instead of doing something like pd.read_csv('agency.csv') (or here I use numpy, as I don’t have pandas as a package dependency for retenmod). You create a stream object, and the __name__ is just the way for python to figure out all of the relative path junk. Depending on the different downstream modules, you may need to stream.read(), but here for both json and numpy you can just pass them to their subsequent read functions and it works as intended.

And again you can checkout the github actions to see in the works of testing the package, and generating the wheel file all in one go.


If you install the latest via the github repo:

pip install https://github.com/apwheele/retenmod/blob/main/dist/retenmod-0.0.1-py3-none-any.whl?raw=true

And to test out this, you can do:

from retenmod import data_funcs

data_funcs.metadata # dictionary from json
data_funcs.staff() # loading in numpy array from CSV file

If you go check out wherever you package is installed to on your machine, you can see that it will have the agency.csv and the notes.json file, along with the .py files with the functions.

Next on the todo list, auto uploading to pypi and incrementing minor tags via CICD pipeline. So if you know of example packages that do that already let me know!

My journey submitting to CRAN

So my R package ptools is up on CRAN. CRAN obviously does an important service – I find the issues I had to deal with pedantic – but will detail my struggles here, mostly so others hopefully do not have to deal with the same issues in the future. Long story short I knew going in it can be tough and CRAN did not disappoint.

Initially I submitted the package in early June, which it passed the email verification, but did not receive any email back after that. I falsely presumed it was in manual review. After around a month I sent an email to cran-sysadmin. The CRAN sysadmin promptly sent an email back with the reason it auto-failed – examples took too long – but not sure why I did not receive an auto-message back (so it never got to the manual review stage). When I got auto-fail messages at the equivalent stage in later submissions, it was typically under an hour to get that stage auto-fail message back.

So then I went to fixing the examples that took too long (which on my personal machine all run in under 5 seconds, I have a windows $400 low end “gaming” desktop, with an extra $100 in RAM, so I am not running some supercomputer here as background). Running devtools check() is not the same as running R CMD check --as-cran path\package.tar.gz, but maybe check_built() is, I am not sure. So first note to self just use the typical command line tools and don’t be lazy with devtools.

Initially I commented out sections of the examples that I knew took too long. Upon manual review though, was told don’t do that and to wrap too long of examples in donttest{}. Stochastic changes in run times even made me fail a few times at this – some examples passed the time check in some runs but failed in others. Some examples that run pretty much instantly on my machine failed in under 10 seconds for windows builds on CRAN’s checks. (My examples use plots on occasion, and it may be spplot was the offender, as well as some of my functions that are not fast and use loops internally.) I have no advice here than to just always wrap plot functions in donttest{}, as well as anything too complicated for an abacus. There is no reliable way (that I can figure) to know examples that are very fast on my machine will take 10+ seconds on CRAN’s checks.

But doing all of these runs resulted in additional Notes in the description about spelling errors. At first it was last names in citations (Wheeler and Ratcliffe). So I took those citations out to prevent the Note. Later in manual review I was asked to put them back in. Occasionally a DOI check would fail as well, although it is the correct DOI.

One of the things that is confusing to me – some of the Note’s cause automatic failures (examples too long) and others do not (spelling errors, DOI check). The end result messages to me are the same though (or at least I don’t know how to parse a “this is important” Note vs a “whatever not a big deal” Note). The irony of this back and forth related to these spelling/DOI notes in the description is that the description went through changes only to get back to what is was originally.

So at this point (somewhere around 10+ submission attempts), 7/16, it finally gets past the auto/human checks to the point it is uploaded to CRAN. Finished right – false! I then get an automated email from Brian Ripley/CRAN later that night saying it is up, but will be removed on 8/8 because Namespace in Imports field not imported from: 'maptools'.

One function had requireNamespace("maptools") to use the conversion functions in maptools to go between sp/spatspat objects. This caused that “final” note about maptools not being loaded. To fix this, I ended up just removing maptools dependency altogether, as using unexported functions, e.g. maptools:::func causes a note when I run R CMD check locally (so presume it will auto-fail). There is probably a smarter/more appropriate way to use imports – I default though to doing something I hope will pass the CRAN checks though.

I am not sure why this namespace is deal breaker at this stage (after already on CRAN) and not earlier stages. Again this is another Note, not a warning/error. But sufficient to get CRAN to remove my package in a few weeks if I don’t fix. This email does not have the option “send email if a false positive”.

When resubmitting after doing my fixes, I then got a new error for the same package version (because it technically is on CRAN at this point), so I guess I needed to increment to 1.0.1 and not fix the 1.0.0 package at this point. Also now the DOI issue in the description causes a “warning”. So I am not sure if this update failed because of package version (which doesn’t say note or warning in the auto-fail email) or because of DOI failure (which again is now a warning, not a Note).

Why sometimes a DOI failure is a warning and other times it is a note I do not know. At some later stage I just take this offending DOI out (against the prior manual review), as it can cause auto-failures (all cites are in the examples/docs as well).

OK, so package version incremented and namespace error fixed. Now in manual review for the 1.0.1 version, get a note back to fix my errors – one of my tests fails on noLD/M1Mac (what is noLD you may ask? It is “no long doubles”). These technically failed on prior as well, but I thought I just needed to pass 2+ OS’s to get on CRAN. I send an email to Uwe Ligges at this point (as he sent an email about errors in prior 1.0.0 versions at well) to get clarity about what exactly they care about (since the reason I started round 2 was because of the Namespace threat, not the test errors on Macs/noLD). Uwe responds very fast they care about my test that fails, not the DOI/namespace junk.

So in some of my exact tests I have checks along the line ref <- c(0.25,0.58); act <- round(f,2) where f is the results scooped up from my prior function calls. The note rounds the results to the first digit, e.g. 0.2 0.5 in the failure (I suspect this is some behavior for testhat in terms of what is printed to the console for the error, but I don’t know how exactly to fix the function calls so no doubles will work). I just admit defeat and comment out the part of this test function that I think is causing the failure, any solution I am not personally going to be able to test in my setup to see if it works. Caveat Emptor, be aware my exact test power calculation functions are not so good if you are on a machine that can’t have long doubles (or M1 Mac’s I guess, I don’t fricken know).

OK, so that one test fixed, upon resubmission (the following day) I get a new error in my tests (now on Windows) – Error in sp::CRS(...): PROJ4 argument-value pairs must begin with +. I have no clue why this is showing an error now, for the first time going on close to 20 submissions over the past month and a half.

The projection string I pass definitely has a “+” at the front – I don’t know and subsequent submissions to CRAN even after my attempts to fix (submitting projections with simpler epsg codes) continue to fail now. I give up and just remove that particular test.

Uwe sends an updated email in manual review, asking why I removed the tests and did not fix them (or fix my code). I go into great detail about the new SP error (that I don’t think is my issue), and that I don’t know the root cause of the noLD/Mac error (and I won’t be able to debug before 8/8), that the code has pretty good test coverage (those functions pass the other tests for noLD/Mac, just one), and ask for his grace to upload. He says OK patch is going to CRAN. It has been 24 hours since then, so cannot say for sure I will not get a ‘will be removed’ auto-email.

To be clear these issues back and forth are on me (I am sure the \donttest{} note was somewhere in online documentation that I should have known). About the only legit complaint I have in the process is that the “Note” failure carries with it some ambiguity – some notes are deal breakers and others aren’t. I suspect this is because many legacy packages fail these stringent of checks though, so they need to not auto-fail and have some discretion.

The noLD errors make me question reality itself – does 0.25 = 0.2 according to M1 Mac’s? Have I been living a lie my whole life? Do I really know my code works? I will eventually need to spin up a Docker image and try to replicate the noLD environment to know what is going on with that one exact test power function.

For the projection errors, I haven’t travelled much recently – does Long Island still exist? Is the earth no longer an ellipsoid? At our core are we just binary bits flipping the neural networks of our brain – am I no better than the machine?

There is an irony here that people with better test code coverage are more likely to fail the auto-checks (although those packages are also more likely to be correct!). It is intended and reasonable behavior from CRAN, but it puts a very large burden on the developer (it is not easy to debug noLD behavior on your own, and M1 Mac’s are effectively impossible unless you wish to pony up the cash for one).


CRAN’s model is much different than python’s PyPI, in that I could submit something to PyPI that won’t install at all, or will install but cause instant errors when running import mypackage. CRANs approach is more thorough, but as I attest to above is quite a bit on the pedantic side (there are no “functional” changes to my code in the last month I went through the back and forth).

The main thing I really care about in a package repository is that it does not have malicious code that does suspicious os calls and/or sends suspicious things over the internet. It is on me to verify the integrity of the code in the end (even if the examples work it doesn’t mean the code is correct, I have come across a few packages on R that have functions that are obviously wrong/misleading). This isn’t an open vs closed source thing – you need to verify/sanity check some things work as expected on your own no matter what.

So I am on the fence whether CRAN’s excessive checking is “worth it” or not. Ultimately since you can do:

library(devtools)
install_github("apwheele/ptools")

Maybe it does not matter in the end. And you can peruse the github actions to see the current state of whether it runs on different operating systems and avoid CRAN altogether.

aggregate retention/churn models in python

Instead of having so much code just randomly floating around in blog posts, I need to start making packages (both in R and python) more often. I took it as a challenge to make a simple python package, here retenmod (pypi, github). I got the idea after answering a question on crossvalidated. (The resources I leveraged the most were these two sites/tutorials, packaging projects and minimal example.)

It is a simple port of the R package foretell that provides several different models to forecast churn based on aggregate survival probabilities. So it only has three functions, and I did not focus too much on extras (like building sphinx docs). Buit it has just the amount of complexity to make a nice intro get my feet wet example.

So you can now download/install the package via pip:

pip install retenmod

And it will automatically install scipy and numpy if you do not have them already installed. For a very simple example, I don’t have retention probabilities for any police department offhand, but this document has estimates for how many staff positions police tend to retain after increases.

Here is a simple example of using the library, in particular the BdW model.

import retenmod
import matplotlib.pyplot as plt

large = [100,66,56,52,49,47,44,42]
time = [0,1,2,3,4,5,10,15]

# Only fitting with the first 3 values
train, ext = 4, 15
lrg_bdw = retenmod.bdw(large[0:train],ext - train + 1)

# Showing predicted vs observed
pt = list(range(16))
fig, ax = plt.subplots()
ax.plot(pt[1:], lrg_bdw.proj[1:],label='Predicted', 
        c='k', linewidth=2, zorder=-1)
ax.scatter(time[1:],large[1:],label='Observed', 
           edgecolor='k', c='r', s=50, zorder=1)
ax.axvline(train - 0.5, label='Train', color='grey', 
           linestyle='dashed', linewidth=2, zorder=-2)
ax.set_ylabel('% Retaining Position')
ax.legend(facecolor='white', framealpha=1)
plt.xticks(pt[1:])
plt.show()

So you can see even with only fitting the data to the first three years, years 4 and 5 were forecasted quite well. It underestimates retention further out at 10 and 15 years (the model has a hard time going down very fast from 100 to 66 and then flattening out in a reasonable way). But even so the super far out forecasts are not that crazy given only three data points.

I will have to work on an example later of showing how to translate this to cost-benefit analysis (although would prefer actual retention data from a PD). Essentially you can calculate the benefit of trying to save officers (retain them) vs hiring new officers and training them up based on just aggregate data. If you wanted to do something like estimate if retention is going down due to recent events, I would probably use micro-level data and estimate a survival model directly.

Next up I will try to turn my Exact distribution tests (R Code) for day of week/Benford’s analysis into a simple R package and see if I can get it on Cran. Posting to pypi is quite easy.