Managing R environments using conda

DataColada have a recent blog about their groundhog package, intended to aid in reproducible science. This is more from a perspective of “I have this historical code, how can I try to replicate that researchers environment to get the same results”. So more of a forensic task. What I am going to talk about in this post is to create an environment from the get-go that has the info necessary for others to replicate.

First before I get to that though, I have come across people critiquing open science using essentially ‘the perfect is the enemy of the good’ arguments. Sharing code is good, period. Even if there are different standards of replicability, some code is quite a bit better than no code. And scientists are not professional programmers – understanding all of this stuff takes time and training often in short supply in academia (hence me blogging about boring stuff like creating environments and using github). If this stuff is over your head, please feel free to email/ask a question and I can try to help.

At work I have to solve a very similar problem to scientific reproducibility; I need to write code in one environment (a dev environment, or sometimes my laptop), and then have that code run in a production environment. The way we do this at work is either via conda environments (for persistent environments) or docker images (for ephemeral environments). We currently are 100% python for machine learning, but you can also use the same workflow for R environments (or have a mashup of R/python).

Groundhog doesn’t really solve this all by itself – it doesn’t specify the version of R for example. (And there are issues with even using dates to try to forensically recreate environments, see the Hackernews thread.) But you can use conda directly to set up a reproducible environment from the get-go. Again, what is good for reproducible science is good for reproducing my work in different environments at my workplace.

I have a github folder to show the steps, but just here they are quite simple. First to start, in your project directory at the root, have two files. One is a requirements.txt file that specifies the R libraries you want. And this file may look like:

# This is the requirements.txt file
r-spatstat
r-leaflet
r-devtools
r-markdown

Conda has an annoying add r-* at the front to distinguish r packages from python ones. If there happen to be libraries you are using that are not on conda-forge (e.g. just added to CRAN, or more likely just are on github), we can solve that as well. Make a second script, here I name it packs.R, and within this R script you can install these additional packages. Here is an example installing groundhog, and my ptools package that is only on github. Each have ways you can point to a very specific version:

# This is the packs.R script
library(devtools) # for installing github packages

# Install specific commit/version from github
install_github("apwheele/ptools",ref="9826241c93e9975804430cb3d838329b86f27fd3")

# Install a specific library version from CRAN
# Specifying specific version url for cran package (not on conda-forge)
gh_url <- "https://cran.r-project.org/src/contrib/groundhog_1.5.0.tar.gz"
install.packages(gh_url,repos=NULL,type="source")

OK, so now we are ready to set up our conda environment, so from the command line (or more specifically the anaconda prompt), if you are in the root of your project, you can run something like:

conda create --name rnew
conda activate rnew
conda install -c conda-forge r-base=4.0.5 --file requirements.txt

And this installs a specific version of R, as well as those libraries in the text file. Then if you have additional libraries in the packs.R to install, you can then run:

Rscript packs.R

And conda is smart and the library defaults to installing all the R junk in the right folder (can print out .libPaths() in an R session to see where your conda environment lives). (I am more familiar with conda, so cannot comment, but likely this is exchangeable with RStudio’s renv, horses for courses.)

You may notice my requirements.txt file does not have specific versions. Often you want to be generic when you are first setting up your project, and let conda figure out the mess of version dependencies. If you want to be uber vigilant then, you can then save the exact versions of packages via overwriting your initial requirements file, something like:

conda list --export > requirements.txt

And this updated file will have everything in it, R version, conda-forge ID, etc. (although does not have the packages you installed not via conda, so still need to keep the packs.R file to be able to replicate).

I will put on the slate an example of using docker to create a totally independent environment to replicate code on. I think that is a bit over-kill for most academic projects (although is really even more isolated than this work flow). Even all this work is not 100% foolproof. conda or CRAN or the github package you installed could go away tomorrow – no guarantees in life. But again don’t let the perfect be the enemy of the good – share your scientific code, warts and all!

Leave a comment

1 Comment

  1. Managing R environments using conda – NoteForDataScience

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: