Building wheel files in github actions

At work we are using a new databricks environment (claims based pop health related models). Databricks is very nice as a data querying environment, but it is challenging building well vetted code libraries in python. See the blog post Please don’t make me use databricks notebooks for an overview of the issues. (Other environments that make you write in notebooks, such as Apache Zeppelin, have pretty much all the same limitations.)

So we are still working out the design pattern for how to best write well vetted code. It is looking a bit like this workflow by menziess, I have been able to get dbconnect (and databricks-sql), installed on local windows machines. From there I can do all the usual junk – linting pre-commits, writing unit tests, etc. on my local machine. Then I push, and can do some final checks (or run a real life pipeline), in the databricks GUI environment.

One difference though is instead of doing Azure pipelines to build the wheel files, I am using Github Actions. To share I use my retenmod package as an example. The github action is pretty straightforward, and uses the same trick to push inside the action as I wrote about previously.

So here is the action code in-situ, but I can copy-paste the workflow right here in the blog to illustrate the yaml:

# Github actions to build
# and push wheel files
on:
  push:
    branches:
      - main
      - master

jobs:
  build_wheel:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.9
      - name: Build wheel and install
        run: |
          python -m pip install --user --upgrade build
          python -m build
          #pip install .
          find ./dist/*.whl | xargs pip install
          python simple_test.py
      - name: Configure Git
        run: |
          git config --global user.email "apwheele@gmail.com"
          git config --global user.name "apwheele"
      - name: Commit and push wheel
        run: |
          git add -f ./dist/*.whl
          git commit -m 'pushing new wheel'
          git push

And then in your databricks notebooks, you can then have a locally scoped environment, so can have:

%pip install ./dist/libname.whl

At the front of your notebook. And then in a code cell, can then do:

import retenmod as rm
# do whatever rm functions from the library

Just like any normal python package. There are a few potential gotchas here. 1) I will need to write a python script to also edit libname.whl in the data pipelines whenever I update versions (my unix grep/sed fu is not up to task to grep out whl files). But that should be as simple as calling python edit_files.py inside the github action, and then amending the git add . to scoop up the edited files.

A second part is that with work repos, pushing inside the action is a bit trickier, so we need to work with personal access tokens/actions secrets and set the remote url for the push. It is tough for me to illustrate that with public repos though, so will have to wait until another blog post.

Leave a comment

2 Comments

  1. Setting up pyspark to run sql tests | Andrew Wheeler
  2. Saving data files in wheel packages | Andrew Wheeler

Leave a comment