Getting started with github notes

I mentioned on LinkedIn the other day I think github is a good resource for crime analysts to learn. Even if you don’t write code, it is convenient to have an audit-trail of changes in documents.

Jerry Ratcliffe made the comment that it is a tough learning curve, and I agree dealing with merge conflicts is a pain in the butt:

In the past I have suggested people to get started by using the github desktop GUI tool. But I do not suggest that anymore because of the issues Jerry mentions. If you get headaches like this, you pretty much need to use the command line to deal with them. I do not have many git commands memorized, and I will give a rundown of my getting started with git and github notes. So I just suggest now people bite the bullet and learn the command line.

Agree it takes some effort, but I think it is well worth it.

Making a project and first commit

Technically github is the (now Microsoft owned) software company that offers web hosted version control, and git is a more general system for version control. (There is another popular web host called Gitlab for example.) Here I will offer advice about using github and git from the command line.

So first, I typically create projects first online on the web-browser on github.com (I do not have the command prompt command memorized to create a new repository). On github.com, click the green New button:

Here I am creating a new repo named example_repo. I do it this way intentionally, as I can make sure I set the repo owner to the correct one (myself or my organization), and set the repo to the correct public/private by default. Many things you want to default to private.

Note on windows, the git command is probably not installed by default. If you install git-bash, it should be available in the command prompt.

Now that you have your repository created, in github I click the green code button, and copy the URL to the repo:

Then from the command line, navigate to where you want to download the repo (I set up my windows machine so I have a G drive mapped to where I download github repos). So from command line, mine looks like:

# cd to to correct location
git clone https://github.com/apwheele/example_repo.git
# now go inside the folder you just downloaded
cd ./example_repo

Now typically I do two things when first creating a repo, edit the README.md to give a high level overview of the project, and also create a .gitignore file (no file extension!). Often you have files that you don’t want committed to the github repository. Most of my .gitignore files look like this for example, where # are comment lines:

# No csv files
*.csv

# No python artifacts
*.pyc
__pycache__

# can prevent uploading entire folders if you want
/folder_dont_upload

Note if you don’t generally want files, but want a specific file for whatever reason, you can use an exclamation point, e.g. !./data/keep_me.csv will include that file, even if you have *.csv as ignored in the .gitignore file in general. And if you want to upload an empty folder, place a .gitkeep file in that folder.

Now in the command prompt, run git status. You will see the files that you have edited listed (minus any file that is ignored in the gitignore file).

So once you have those files edited, then in the command prompt you will do three different commands in a row:

git add .
git commit -m 'making init commit'
git push

The first command git add ., adds all of the files you edited (again minus any file that is ignored in the gitignore file). Note you can add a specific file one at a time if you want, e.g. git add README.md, but using the period adds all of the files you edited at once.

Git commit adds in a message where you should write a short note on the changes. Technically at this point you could go and do more changes, but here I am going to git push, which will send the updates to the online hosted github branch. (Note if doing this the first time from the command prompt, you may need to give your username and maybe set up a github token or do two-factor authentication).

You don’t technically need to do these three steps at once, but in my workflows I pretty much always do. Now you can go checkout the online github repo and see the updated changes.

Branches

When you are working on things yourself for small projects, just those above commands and committing directly to the default main branch is fine. Branches allow for more complicated scenarios like:

  • you want the main code to not change, but you want to experiment and try out some changes
  • you have multiple people working on the code at the same time

Branches provide isolation – they allow the code in the branch to change, whereas code in main (or other branches) does not change. Here I am going to show how to make a branch in the command prompt, but first a good habit when working with multiple people is to do at the start of your day:

git fetch
git pull origin main

Git fetch updates the repo if other collaborators added a branch (but does not update the files directly). And git pull origin main pulls the most recent main branch version. So if a colleague updated main, when you do git pull origin main it will update the code on your local computer. (If you want to pull the most recent version of a different branch, it will be git pull origin branch_name.)

To create a new branch, you can do:

git checkout -b new_branch

Note if the branch is already created you can just omit the -b flag, and this just switches to that branch. Make a change, and then when pushing, use git push origin new_branch, which specifies you are specifically pushing to your branch you just created (instead of pushing to the default main branch).

# after editing readme to make a change
git add .
git commit -m 'trivial edit'
git push origin new_branch

Now back in the browser, you can go and checkout the updated code by switching the branch you are looking at in the dropdown on the left hand part of the screen that says “new_branch” with the tiny branching diagram:

A final step, you want to merge the branch back into the main code script. If you see the big green button Compare and Pull Request in the above screenshot, click that, and it will bring up a dialog about creating a pull request. Then click the green Create Pull Request button:

Then after you created the request, it will provide another dialogue to merge in the code into the target (by default main).

If everything is ok (you have correct permissions and no merge conflicts), you can click the buttons to merge the branches and that is that.

Merge Conflicts

The rub with above is that sometimes merge conflicts happen, and as Jerry mentions, these can be a total pain to sort out. It is important to understand why merge conflicts happen first though, and to take steps to prevent them. In my experience merge conflicts most often happen because of two reasons:

Multiple people are working on the same branch, and I forget to run git pull origin branch at the start of my day, so I did not incorporate the most recent changes. (Note these can happen via auto-changes as well, such as github actions running scripts.)

The second scenario is someone updated main, and I did not update my version of main. This tends to occur with more long term development. Typically this means at the start of my day, I should have run git checkout main, then git pull origin main.

I tend to find managing merge conflicts is very difficult using the built in github tools (so I don’t typically use git rebase for example). More commonly, when I have a merge conflict for a single file, first I will save the file that is giving me problems outside of the github repo (so I don’t accidently delete/overwrite it). Then, if my new_branch is conflicting with main, I will do:

# this pulls the exact file from main
git checkout main conflict_file.txt
git add conflict_file.txt
git commit -m 'pulling file to fix conflict'
git push origin new_branch

Then if I want to make edits to conflict_file.txt, make the edits now, then redo add-commit-push.

This workflow tends to be easier in my experience than dealing with rebase or trying to edit the merge conflicts directly.

It is mostly important though to realize what caused the merge conflict to begin with, to prevent the pain of dealing with it again in the future. My experience these are mostly avoidable, and mean you made a personal mistake in not pulling the most recent version, or more rarely collaboration with a colleague wasn’t coordinated correctly (you both editing the same file at the same time).

I realize this is not easy – it takes a bit of work to understand github and incorporate into your workflow. I think it is a worthwhile tool for analysts and data scientists to learn though.

Running files locally in SPSS

Say I made alittle python script for a friend to scrape data from a website whenever they wanted updates. I write my python script, say scrape.py, and a run_scrape.bat file for my friend on their windows machine (or run_scrape.sh on Mac/Unix). And inside the bat file has the command:

python scrape.py

I tell my friend save those two files in whatever folder you want, you just need to double click the bat file and it will save the scraped data into the same folder. Here the bat file is run locally, it sets the current directory to wherever the bat file is located.

This is how the majority of code is packaged – can clone from github or email a zip file, and it will just work no matter where the local user saves those scripts. I need to have my friend have their python environment set up correctly, but most of the stuff I do I can say download Anaconda and click yes to setting python on the path and they are golden.

SPSS makes things more painful, say I added SPSS to my environment variable in my windows machine, and I run from the command prompt an SPSS production job:

cd "C:\Users\Andrew"
spss print_dir.spj" -production silent

And say the spj file, all it does is call a syntax show.sps which has as the only command SHOW DIR. This still prints out wherever SPSS is installed as the current working directory inside of the SPSS session. On my machine currently C:\Program Files\IBM\SPSS Statistics. So SPSS takes over the location of the current directory. Also we can open up the spj file (it is just a plain text xml file). Here is what a current spj file looks like for me (note it is all on one line as well!):

And that file also has several hard coded file locations. So to get the same behavior as python scrape.py earlier, we need to do dynamically set the paths in the production job as well, not just alter the command line scripts. This can be done with a little command line magic in windows, dynamically replacing the right text in the spj file. So in a bat file, you can do something like:

@echo on
set "base=%cd%"
:: code to define SPJ (SPSS production file)
echo ^<?xml version=^"1.0^" encoding=^"UTF-8^" standalone=^"no^"?^>^<job xmlns=^"http://www.ibm.com/software/analytics/spss/xml/production^" codepageSyntaxFiles=^"false^" print=^"false^" syntaxErrorHandling=^"continue^" syntaxFormat=^"interactive^" unicode=^"true^" xmlns:xsi=^"http://www.w3.org/2001/XMLSchema-instance^" xsi:schemaLocation=^"http://www.ibm.com/software/analytics/spss/xml/production http://www.ibm.com/software/analytics/spss/xml/production/production-1.4.xsd^"^>^<locale charset=^"UTF-8^" country=^"US^" language=^"en^"/^>^<output imageFormat=^"jpg^" imageSize=^"100^" outputFormat=^"text-codepage^" outputPath=^"%base%\job_output.txt^" tableColumnAutofit=^"true^" tableColumnBorder=^"^|^" tableColumnSeparator=^"space^" tableRowBorder=^"-^"/^>^<syntax syntaxPath=^"%base%\show.sps^"/^>^<symbol name=^"setdir^" quote=^"true^"/^>^</job^> > transfer_job.spj
"C:\Program Files\IBM\SPSS Statistics\stats.exe" "%base%\transfer_job.spj" -production silent -symbol @setdir "%base%"

It would be easier to use sed to find/replace the text for the spj file instead of the superlong one-liner on echo, but I don’t know if Window’s always has sed installed. Also note the escape characters (it is crazy how windows parses this single long line, apparently the max length is around 32k characters though).

You can see in the call to the production job, I pass a parameter, @setdir, and expand it out in the shell using %base%. In show.sps, I now have this line:

CD @setdir.

And now SPSS has set the current directory to wherever you have the .bat file and .sps syntax file saved. So now everything is dynamic, and runs wherever you have all the files saved. The only thing that is not dynamic in this setup is the location of the SPSS executable, stats.exe. So if you are sharing SPSS code like this, you will need to either tell your friend to add C:\Program Files\IBM\SPSS Statistics to their environment path, or edit the .bat file to the correct path, but otherwise this is dynamically run in the local folder with all the materials.

Git excluding specific files when merging branches

The other day at work I had a mildly annoying problem – merging only selected files between a test and production branch in github. My particular use case was I had a test branch that needed to only interact with a test database, and the master branch needed to talk to the prod database. So I had particular config files with essentially different SQLAlchemy connection strings, but nothing else. Note I did not want these files ignored, just not merged between branches. (If I edit them I will need to make sure to edit both master & test branches in the end.)

I often use the GitHub desktop GUI to commit changes (when working on my local laptop). You can use the GUI to make a pull request, but when accepting the pull request in the browser I think it is all or nothing. I also need an entire command line solution for when I am working on a remote headerless machine with no GUI as well anyway. So here are my notes on how I solved the issue.

So just for illustration I added a test branch to my Blog_Code repository, and then some junk files just to illustrate. Via the git bash shell, if you navigate to your repository and do git diff master test --name-only, it shows you the different files in the two branches:

So you can see that I have 5 different files in total. Two config files and three different text files. If we do git diff master test -- special_config1, we can see the more specific differences between those two config files:

So you can see that the test branch version (in red) and the master branch version (in green), just have a minor difference. But in the end I want to keep those two files different between the branches, and not merge this config file (along with the other config file).

So here is the particular logic I put together, piping a bunch of commands together:

git switch master
git diff master test --name-only |
grep -v 'special_config1' |
grep -v 'Python/special_config2' |
sed 's/.*/"&"/' |
xargs git checkout test

The first line git switch is pretty self explanatory – I switch to the master branch (I will typically be doing work on test). Second I grab all the files that are different using git diff branch1 branch2, and only print out the file names. Third/Fourth lines I use grep to get rid of my specific config files out of that resulting list of files. You could also do grep -v 'file1.txt|file2.txt' |, but in this case this was giving me fits (maybe due to the forward slash not being escaped the right way for grep?).

The fifth sed line I wrap the files in quotes (if you have a file that has a space it will cause problems otherwise).

Sixth line I then use xargs to pass git checkout from the test environment, and pass in all of my files (minus my two config files). This is advice taken from this blog, just a slicker way to grab all of the files that are different minus a few specific config files. So instead of typing git checkout test file1.txt file2.txt etc. and typing the files by hand, I just grab all the files that are different and check them all out together.

Then once that is done it is the usual to commit the updated files. And then here in the end I switch the active environment back to test.

git commit -m 'Example only merging select files'
git push
git switch test

Maybe one of these days I will entirely ditch the GUI behind. But for now will just have to get by with my limited command line fu compared to these real computer programmers I work with more regularly.