I mentioned on LinkedIn the other day I think github is a good resource for crime analysts to learn. Even if you don’t write code, it is convenient to have an audit-trail of changes in documents.
Jerry Ratcliffe made the comment that it is a tough learning curve, and I agree dealing with merge conflicts is a pain in the butt:
In the past I have suggested people to get started by using the github desktop GUI tool. But I do not suggest that anymore because of the issues Jerry mentions. If you get headaches like this, you pretty much need to use the command line to deal with them. I do not have many git commands memorized, and I will give a rundown of my getting started with git and github notes. So I just suggest now people bite the bullet and learn the command line.
Agree it takes some effort, but I think it is well worth it.
Making a project and first commit
Technically github is the (now Microsoft owned) software company that offers web hosted version control, and git is a more general system for version control. (There is another popular web host called Gitlab for example.) Here I will offer advice about using github and git from the command line.
So first, I typically create projects first online on the web-browser on github.com (I do not have the command prompt command memorized to create a new repository). On github.com, click the green New button:
Here I am creating a new repo named example_repo
. I do it this way intentionally, as I can make sure I set the repo owner to the correct one (myself or my organization), and set the repo to the correct public/private by default. Many things you want to default to private.
Note on windows, the git command is probably not installed by default. If you install git-bash, it should be available in the command prompt.
Now that you have your repository created, in github I click the green code button, and copy the URL to the repo:
Then from the command line, navigate to where you want to download the repo (I set up my windows machine so I have a G drive mapped to where I download github repos). So from command line, mine looks like:
# cd to to correct location
git clone https://github.com/apwheele/example_repo.git
# now go inside the folder you just downloaded
cd ./example_repo
Now typically I do two things when first creating a repo, edit the README.md
to give a high level overview of the project, and also create a .gitignore
file (no file extension!). Often you have files that you don’t want committed to the github repository. Most of my .gitignore
files look like this for example, where #
are comment lines:
# No csv files
*.csv
# No python artifacts
*.pyc
__pycache__
# can prevent uploading entire folders if you want
/folder_dont_upload
Note if you don’t generally want files, but want a specific file for whatever reason, you can use an exclamation point, e.g. !./data/keep_me.csv
will include that file, even if you have *.csv
as ignored in the .gitignore file in general. And if you want to upload an empty folder, place a .gitkeep
file in that folder.
Now in the command prompt, run git status
. You will see the files that you have edited listed (minus any file that is ignored in the gitignore file).
So once you have those files edited, then in the command prompt you will do three different commands in a row:
git add .
git commit -m 'making init commit'
git push
The first command git add .
, adds all of the files you edited (again minus any file that is ignored in the gitignore file). Note you can add a specific file one at a time if you want, e.g. git add README.md
, but using the period adds all of the files you edited at once.
Git commit adds in a message where you should write a short note on the changes. Technically at this point you could go and do more changes, but here I am going to git push
, which will send the updates to the online hosted github branch. (Note if doing this the first time from the command prompt, you may need to give your username and maybe set up a github token or do two-factor authentication).
You don’t technically need to do these three steps at once, but in my workflows I pretty much always do. Now you can go checkout the online github repo and see the updated changes.
Branches
When you are working on things yourself for small projects, just those above commands and committing directly to the default main branch is fine. Branches allow for more complicated scenarios like:
- you want the main code to not change, but you want to experiment and try out some changes
- you have multiple people working on the code at the same time
Branches provide isolation – they allow the code in the branch to change, whereas code in main (or other branches) does not change. Here I am going to show how to make a branch in the command prompt, but first a good habit when working with multiple people is to do at the start of your day:
git fetch
git pull origin main
Git fetch updates the repo if other collaborators added a branch (but does not update the files directly). And git pull origin main
pulls the most recent main branch version. So if a colleague updated main, when you do git pull origin main
it will update the code on your local computer. (If you want to pull the most recent version of a different branch, it will be git pull origin branch_name
.)
To create a new branch, you can do:
git checkout -b new_branch
Note if the branch is already created you can just omit the -b
flag, and this just switches to that branch. Make a change, and then when pushing, use git push origin new_branch
, which specifies you are specifically pushing to your branch you just created (instead of pushing to the default main branch).
# after editing readme to make a change
git add .
git commit -m 'trivial edit'
git push origin new_branch
Now back in the browser, you can go and checkout the updated code by switching the branch you are looking at in the dropdown on the left hand part of the screen that says “new_branch” with the tiny branching diagram:
A final step, you want to merge the branch back into the main code script. If you see the big green button Compare and Pull Request in the above screenshot, click that, and it will bring up a dialog about creating a pull request. Then click the green Create Pull Request button:
Then after you created the request, it will provide another dialogue to merge in the code into the target (by default main).
If everything is ok (you have correct permissions and no merge conflicts), you can click the buttons to merge the branches and that is that.
Merge Conflicts
The rub with above is that sometimes merge conflicts happen, and as Jerry mentions, these can be a total pain to sort out. It is important to understand why merge conflicts happen first though, and to take steps to prevent them. In my experience merge conflicts most often happen because of two reasons:
Multiple people are working on the same branch, and I forget to run git pull origin branch
at the start of my day, so I did not incorporate the most recent changes. (Note these can happen via auto-changes as well, such as github actions running scripts.)
The second scenario is someone updated main, and I did not update my version of main. This tends to occur with more long term development. Typically this means at the start of my day, I should have run git checkout main
, then git pull origin main
.
I tend to find managing merge conflicts is very difficult using the built in github tools (so I don’t typically use git rebase
for example). More commonly, when I have a merge conflict for a single file, first I will save the file that is giving me problems outside of the github repo (so I don’t accidently delete/overwrite it). Then, if my new_branch
is conflicting with main, I will do:
# this pulls the exact file from main
git checkout main conflict_file.txt
git add conflict_file.txt
git commit -m 'pulling file to fix conflict'
git push origin new_branch
Then if I want to make edits to conflict_file.txt
, make the edits now, then redo add-commit-push.
This workflow tends to be easier in my experience than dealing with rebase or trying to edit the merge conflicts directly.
It is mostly important though to realize what caused the merge conflict to begin with, to prevent the pain of dealing with it again in the future. My experience these are mostly avoidable, and mean you made a personal mistake in not pulling the most recent version, or more rarely collaboration with a colleague wasn’t coordinated correctly (you both editing the same file at the same time).
I realize this is not easy – it takes a bit of work to understand github and incorporate into your workflow. I think it is a worthwhile tool for analysts and data scientists to learn though.