Getting started with github notes

I mentioned on LinkedIn the other day I think github is a good resource for crime analysts to learn. Even if you don’t write code, it is convenient to have an audit-trail of changes in documents.

Jerry Ratcliffe made the comment that it is a tough learning curve, and I agree dealing with merge conflicts is a pain in the butt:

In the past I have suggested people to get started by using the github desktop GUI tool. But I do not suggest that anymore because of the issues Jerry mentions. If you get headaches like this, you pretty much need to use the command line to deal with them. I do not have many git commands memorized, and I will give a rundown of my getting started with git and github notes. So I just suggest now people bite the bullet and learn the command line.

Agree it takes some effort, but I think it is well worth it.

Making a project and first commit

Technically github is the (now Microsoft owned) software company that offers web hosted version control, and git is a more general system for version control. (There is another popular web host called Gitlab for example.) Here I will offer advice about using github and git from the command line.

So first, I typically create projects first online on the web-browser on github.com (I do not have the command prompt command memorized to create a new repository). On github.com, click the green New button:

Here I am creating a new repo named example_repo. I do it this way intentionally, as I can make sure I set the repo owner to the correct one (myself or my organization), and set the repo to the correct public/private by default. Many things you want to default to private.

Note on windows, the git command is probably not installed by default. If you install git-bash, it should be available in the command prompt.

Now that you have your repository created, in github I click the green code button, and copy the URL to the repo:

Then from the command line, navigate to where you want to download the repo (I set up my windows machine so I have a G drive mapped to where I download github repos). So from command line, mine looks like:

# cd to to correct location
git clone https://github.com/apwheele/example_repo.git
# now go inside the folder you just downloaded
cd ./example_repo

Now typically I do two things when first creating a repo, edit the README.md to give a high level overview of the project, and also create a .gitignore file (no file extension!). Often you have files that you don’t want committed to the github repository. Most of my .gitignore files look like this for example, where # are comment lines:

# No csv files
*.csv

# No python artifacts
*.pyc
__pycache__

# can prevent uploading entire folders if you want
/folder_dont_upload

Note if you don’t generally want files, but want a specific file for whatever reason, you can use an exclamation point, e.g. !./data/keep_me.csv will include that file, even if you have *.csv as ignored in the .gitignore file in general. And if you want to upload an empty folder, place a .gitkeep file in that folder.

Now in the command prompt, run git status. You will see the files that you have edited listed (minus any file that is ignored in the gitignore file).

So once you have those files edited, then in the command prompt you will do three different commands in a row:

git add .
git commit -m 'making init commit'
git push

The first command git add ., adds all of the files you edited (again minus any file that is ignored in the gitignore file). Note you can add a specific file one at a time if you want, e.g. git add README.md, but using the period adds all of the files you edited at once.

Git commit adds in a message where you should write a short note on the changes. Technically at this point you could go and do more changes, but here I am going to git push, which will send the updates to the online hosted github branch. (Note if doing this the first time from the command prompt, you may need to give your username and maybe set up a github token or do two-factor authentication).

You don’t technically need to do these three steps at once, but in my workflows I pretty much always do. Now you can go checkout the online github repo and see the updated changes.

Branches

When you are working on things yourself for small projects, just those above commands and committing directly to the default main branch is fine. Branches allow for more complicated scenarios like:

  • you want the main code to not change, but you want to experiment and try out some changes
  • you have multiple people working on the code at the same time

Branches provide isolation – they allow the code in the branch to change, whereas code in main (or other branches) does not change. Here I am going to show how to make a branch in the command prompt, but first a good habit when working with multiple people is to do at the start of your day:

git fetch
git pull origin main

Git fetch updates the repo if other collaborators added a branch (but does not update the files directly). And git pull origin main pulls the most recent main branch version. So if a colleague updated main, when you do git pull origin main it will update the code on your local computer. (If you want to pull the most recent version of a different branch, it will be git pull origin branch_name.)

To create a new branch, you can do:

git checkout -b new_branch

Note if the branch is already created you can just omit the -b flag, and this just switches to that branch. Make a change, and then when pushing, use git push origin new_branch, which specifies you are specifically pushing to your branch you just created (instead of pushing to the default main branch).

# after editing readme to make a change
git add .
git commit -m 'trivial edit'
git push origin new_branch

Now back in the browser, you can go and checkout the updated code by switching the branch you are looking at in the dropdown on the left hand part of the screen that says “new_branch” with the tiny branching diagram:

A final step, you want to merge the branch back into the main code script. If you see the big green button Compare and Pull Request in the above screenshot, click that, and it will bring up a dialog about creating a pull request. Then click the green Create Pull Request button:

Then after you created the request, it will provide another dialogue to merge in the code into the target (by default main).

If everything is ok (you have correct permissions and no merge conflicts), you can click the buttons to merge the branches and that is that.

Merge Conflicts

The rub with above is that sometimes merge conflicts happen, and as Jerry mentions, these can be a total pain to sort out. It is important to understand why merge conflicts happen first though, and to take steps to prevent them. In my experience merge conflicts most often happen because of two reasons:

Multiple people are working on the same branch, and I forget to run git pull origin branch at the start of my day, so I did not incorporate the most recent changes. (Note these can happen via auto-changes as well, such as github actions running scripts.)

The second scenario is someone updated main, and I did not update my version of main. This tends to occur with more long term development. Typically this means at the start of my day, I should have run git checkout main, then git pull origin main.

I tend to find managing merge conflicts is very difficult using the built in github tools (so I don’t typically use git rebase for example). More commonly, when I have a merge conflict for a single file, first I will save the file that is giving me problems outside of the github repo (so I don’t accidently delete/overwrite it). Then, if my new_branch is conflicting with main, I will do:

# this pulls the exact file from main
git checkout main conflict_file.txt
git add conflict_file.txt
git commit -m 'pulling file to fix conflict'
git push origin new_branch

Then if I want to make edits to conflict_file.txt, make the edits now, then redo add-commit-push.

This workflow tends to be easier in my experience than dealing with rebase or trying to edit the merge conflicts directly.

It is mostly important though to realize what caused the merge conflict to begin with, to prevent the pain of dealing with it again in the future. My experience these are mostly avoidable, and mean you made a personal mistake in not pulling the most recent version, or more rarely collaboration with a colleague wasn’t coordinated correctly (you both editing the same file at the same time).

I realize this is not easy – it takes a bit of work to understand github and incorporate into your workflow. I think it is a worthwhile tool for analysts and data scientists to learn though.

Building wheel files in github actions

At work we are using a new databricks environment (claims based pop health related models). Databricks is very nice as a data querying environment, but it is challenging building well vetted code libraries in python. See the blog post Please don’t make me use databricks notebooks for an overview of the issues. (Other environments that make you write in notebooks, such as Apache Zeppelin, have pretty much all the same limitations.)

So we are still working out the design pattern for how to best write well vetted code. It is looking a bit like this workflow by menziess, I have been able to get dbconnect (and databricks-sql), installed on local windows machines. From there I can do all the usual junk – linting pre-commits, writing unit tests, etc. on my local machine. Then I push, and can do some final checks (or run a real life pipeline), in the databricks GUI environment.

One difference though is instead of doing Azure pipelines to build the wheel files, I am using Github Actions. To share I use my retenmod package as an example. The github action is pretty straightforward, and uses the same trick to push inside the action as I wrote about previously.

So here is the action code in-situ, but I can copy-paste the workflow right here in the blog to illustrate the yaml:

# Github actions to build
# and push wheel files
on:
  push:
    branches:
      - main
      - master

jobs:
  build_wheel:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.9
      - name: Build wheel and install
        run: |
          python -m pip install --user --upgrade build
          python -m build
          #pip install .
          find ./dist/*.whl | xargs pip install
          python simple_test.py
      - name: Configure Git
        run: |
          git config --global user.email "apwheele@gmail.com"
          git config --global user.name "apwheele"
      - name: Commit and push wheel
        run: |
          git add -f ./dist/*.whl
          git commit -m 'pushing new wheel'
          git push

And then in your databricks notebooks, you can then have a locally scoped environment, so can have:

%pip install ./dist/libname.whl

At the front of your notebook. And then in a code cell, can then do:

import retenmod as rm
# do whatever rm functions from the library

Just like any normal python package. There are a few potential gotchas here. 1) I will need to write a python script to also edit libname.whl in the data pipelines whenever I update versions (my unix grep/sed fu is not up to task to grep out whl files). But that should be as simple as calling python edit_files.py inside the github action, and then amending the git add . to scoop up the edited files.

A second part is that with work repos, pushing inside the action is a bit trickier, so we need to work with personal access tokens/actions secrets and set the remote url for the push. It is tough for me to illustrate that with public repos though, so will have to wait until another blog post.

Some more github action tricks

Hackernews shared the other day a project using github actions to generate a nice readme for your base Github profile. That workflow uses rust to query the github API and get some stats to then insert into the README.

Two things I noticed I did not realize you could do with actions previously; 1) you can schedule actions to run on a regular basis via a cron job, 2) you can push to the repo inside of the action. (And this does not cause some infinite recursion with actions.) So I have updated my profile to run some python code, generating an image of the number of potholes filled in Raleigh per week.

And you can see that this was updated on 4/7, and that was the automated job that was re-run.

It is pretty simple python code. You just have to have a step in your actions to build the python environment, then you can run your code.

With the regular cron job, you could offload different pieces of work to github, say automate scraping a site or sending out emails once a week. You just need to have a python (or whatever language) script to automate that process. Or you could do more fancy analysis for a project, and post that in the readme via a Jupyter notebook script. If the source data can be downloaded via the internet anyway.

pre-commit hooks and github actions

In keeping up with learning about code development in python and R, two things I have added to my retenmod python package recently are pre-commit hooks and github actions. I am not going to give a code example here, you can google pre-commit python black flake and get like a dozen different blog posts to describe the process. Ditto for github actions. I think it is good to do some googling on the processes, and then you can see the final yaml files I have made to do either process:

The idea behind pre-commit, is that it runs a set of commands to check your py files before you commit your changes to github on your local system. So here the pre-commit I created for this package does three things:

  • runs black code formatter for python (formats whitespace nicely where it can)
  • runs flake8 checks (checks whether py files meet pep standards)
  • updates my readme document

Note one thing I want to be able to do but cannot currently with pre-commit is to update all jupyter notebooks in place as well. Unfortunately this does not work, as notebooks generate some time meta-data under the hood (so the file is modified, and fails the test). There is probably a way to fix this (maybe some smart config via nbdime), but it is not a big deal for me at the moment. But it works fine executing notebooks and saving to different files, so the readme hook in that example works just fine. (And won’t be as painful as say for Rmarkdown as for Jupyter with that time meta-data.)

Pre-commit hooks run on your local system, so you might not want to have it do pytests. My retenmod package though is tiny (intentionally did it as a very simple example to learn). At work I do development on both a windows machine and Red Hat virtual machines, and fortunately so far have not had issues with needing different yaml files, although I could potentially seeing that happen in more complicated set ups. Although if that were the case, I would like just not install on windows (I use local windows laptop to edit powerpoint presentations, and use Red Hat for pretty much everything else).

Github actions does not run on your personal machine, but runs after you have pushed your changes to github. And then github basically spins up virtual environments and does whatever tests. This makes sense for package development, to make sure your package can be installed on multiple operating systems. And you can run other unit tests at this stage on those systems as well. But for certain tests that only make sense on your local system (say functions to generate database connections) github actions will not make sense.

Next up I will need to learn whether it makes sense to generate artifacts via github actions (for that Jupyter notebook example where pre-commit is not working so well?). Also I have never really figured out sphinx docs for python. So I will see if I can get that up and running as well for this retenmod package.

Using Jupyter Notebooks to make nice README’s for GitHub

Working on my R package ptools, the devtools folks have you make a readme R markdown file to compile to a nice readme markdown file for github. I thought to myself that you could functionally do the same thing with juypter notebooks for python. So here is a quick example of that for my retenmod python package.

So first, here is the old readme in its entirety, rendered on Github:

You can see I have an example code snippet, but it does not actually output the results. Here is the update, where I use jupyter to render the markdown nicely:

So we have nice syntax highlighting even. (Note that the pip install code is not run in a %sh cell, it is just code formatting inside of a markdown cell.) I do like the way R markdown renders the output a bit nicer than here (I also haven’t tried with pretty pandas tables). But you can also include matplotlib images as well:

You might typically want to add in README.ipynb to your gitignore file, but here I included it in the github package so you can see what this notebook looks like. To compile the notebook to markdown is quite simple:

jupyter nbconvert --execute --to markdown README.ipynb

If you have matplotlib images, it saves them in a folder named README_files (not sure if there is a flag to change this option). To get the images to render you then also need to push that image folder into Github.