Module 5: What is Git and why is it useful for reproducible research?

First off a digression (or an introduction, however you want to look at it), you may be thinking that your research is reproducible already and not really see any benefits to reading what this module is about and employing the suggested techniques. We can tell you from personal experience that unless you make a conscious effort to ensure that your research is reproducible without instruction from you (say in the unlikely event that you got hit by a bus and were never heard from again), it IS NOT reproducible (unpublished anecdotal results from years of research experience by the authors). Research is fully reproducible when you provide sufficient instructions along with data and code that will allow another researcher to reproduce your exact results to all decimal places. We repeat to all decimal places (close enough doesn’t cut it for the reproducible research club). There are many benefits to you besides the altruistic benefit of making science better. These include having a research record that you can go back to (without reinventing the wheel) when you forget what you did 6 months ago, the surprising sense of liberation that comes with transparency (let the world discover my errors!), and no need to explain things to co-workers, grad students, etc. (just send them the link with all the details!).

Now that the digression is over, let’s begin to answer the question for this module.

Git is a software document version control system. Git has been most commonly used in the field of computer science for version control of computer code and to allow for collaborative coding between people working on the same project. It has recently become recognized as an essential tool for helping ensure reproducible research. Typically, many people track file versions in a project directory or folder by adding numbers (e.g. docv1, docv2, docv3) or dates (doc02012018, doc02052018, doc02142018) and sometimes when we think something is done, it’s actually not done and we end up with filenames like final, final2, final3, and finalfinal. Do you have a directory that looks like the one below containing multiple versions of a file?


An actual messy directory:


The proliferation of dozens of file versions over the course of a project can affect research reproducibility when project staff become uncertain about the correct file version. This can happen for example when files have had changes to numbers due to error discovery and correction, which can increase the potential for an incorrect version to be used in the generation of results reported in a manuscript in the absence of a robust version control system. Git tracks file versions behind the scenes and can include messages about changes to each file version. Only the current file version is visible in your directory. Imagine how much work-related stress might be removed if your file directories for manuscripts looked instead like the one below.



How does Git work in a nutshell?

Git works by taking what can be thought of as ‘snapshots’ of a file in a directory that a user has put under version control. In the image below, files A, B, and C are being tracked by Git . In version 2 (i.e. at snapshot time 2), files A and C have changes but B does not. In version 3 (i.e. at snapshot time 3), only file C has changed.

What is GitHub and why is it useful?

GitHub is a free website that uses git and allows you to store version controlled files (among other features) on an accessible cloud-based server platform. Of note, you can use version control for any type of file (code, data, word, PowerPoint, PDFs, etc.) In this module, we will demonstrate how GitHub can be used to host your code, associated data sets, and file versions associated with a project or published manuscript. Making these files publicly accessible improves the quality of scientific research because others are can evaluate the accuracy of the work, add to the work, and learn (hopefully) robust analytic strategies that you have created and shared.

Working with Git

We will give instructions for two commonly used git software tools in this module: GitHub Desktop and Git Bash. GitHub desktop is a point-and-click graphical user interface (GUI) application. Git Bash is a popular command line interface (CLI) that uses Unix commands. Personally we prefer Git Bash not only because it is marginally faster once you learn to use it but because when you use command line, there is a certain cool factor as it is perceived by many as more hardcore than GUI applications (Whoa you know command line? You are so cool!). However, our testers for this module told us to put GitHub Desktop instructions first so we did. However, if you are feeling a bit like a renegade, skip the GitHub desktop instructions and proceed to Git Bash instructions below. Both are free and allow you to do local version control as well as remote version control on GitHub. Of note, you can also drag and drop files into a GitHub website repository that you create (see instructions below), and although it is super easy, it is super softcore and it doesn’t work when you have no internet or are just using Git for local version control.

Step-by-step instructions for GitHub desktop

1. Create a GitHub account

Go to: https://github.com to create an account. For more detailed instructions on using GitHub than are provided below, see: https://guides.github.com/activities/hello-world/.

2. Download and install GitHub desktop

For 64-bit Windows: https://central.github.com/deployments/desktop/desktop/latest/win32

For Mac OS: https://central.github.com/deployments/desktop/desktop/latest/darwin

Or go to https://desktop.github.com/ and find the version for your machine.

3. Open GitHub desktop, sign in and initialize a repository

When you first open GitHub desktop, you will be prompted to enter your user name and password from github.com. Enter these, and click File –> New repository.

Enter a name for your new repository. I called mine Project1, and in the description box, I typed This is my first GitHub repo. I then changed the local path from the default to a more convenient directory (I used the desktop). Finally, I clicked the box for “Initialize this repository with a readme” and then “Create repository”. The readme file is optional and it is a text file created with the repository name in it and nothing else. If you choose to create it here, you can open it on your machine and put more information about the repository in it for you and your team (e.g., a list of all the files you’ve added with more details about each, a brief project overview).

4. Save project file(s) in the repository folder

If you already have a project you’re working on and would like to store/share on GitHub, copy and paste, drag and drop, or otherwise move those files to the new project folder/repository you’ve created. For this tutorial, I created a text file with “hello world” in it, because I have no imagination, and saved it as helloWorld.txt in the Project1 folder. Feel free to do so if you don’t have other files handy or ready for prime time.

5. Commit changes to the repository

Once you’ve moved your files to your version of the Project1 folder, they should appear in GitHub desktop under the “Changes” tab. In the lower left corner there are two boxes, one for a summary and one for a description. Type something in the summary box (I used “This is a test”) and if you like, type a longer description in the description box (I used “My first GitHub commit!”). The summary box is required, and the description box is optional. You can always add more to these or change them later.

Click “Commit to master”. The large GitHub desktop panel will change to read “No local changes…”

6. Send your new repository to your GitHub account

When you’re ready to send your new repository to GitHub, click on “Publish repository” toward the top right of GitHub desktop. A popup will show you the details of the repository. Unless you have an upgraded (paid) GitHub account, you need to unclick the “Keep this code private” box. Review the other information and then click “Publish repository”.

Switch to your browser and go to github.com to verify that your repository is there. Each time you’re working on your project, open GitHub desktop and 1) commit changes, then 2) “Push” the repo to your online GitHub account to save changes in the cloud. You can use the Repository menu –> Push, or use the keyboard shortcut Ctrl P to push changes once committed.

7. Pull your repository down to a local machine after changes

If you’ve pushed your repository to GitHub, and then your colleague makes changes or updates, you’ll need to pull the repository back down to make and then commit new changes to the current version. To do this, first open the repository from your local copy in Git Desktop. Then from the Repository menu, click “Pull”. The current version will be downloaded to your local machine. Then you can make changes, commit, and push your new version to GitHub. Don’t forget to type a brief description of the changes you’ve made to this version. Your colleagues (and your future self) will thank you.

Clone a repository from GitHub to your local machine with Git Desktop

If another team member created a repository that you’re using, or if you find a repository you’d like to have, you can save a local copy to work on yourself and then commit changes and push them to GitHub online.

  • From GitHub desktop, simply click on File –> Clone Repository;
  • Choose one of your existing repositories or fill in the URL to the repository you’d like to save locally;
  • Ensure that the local path points where you would like and click “Clone”;
  • Now you can edit the repo like normal and commit, push, and share changes.

Step-by-step instructions for using Git Bash

Git Bash may be a little intimidating for some researchers without any computer science experience so we provide the following warning. It uses a command line interface, which can be fun to use because it is a very efficient way to use version control for your files (it may carve minutes maybe even hours off of your time spent putting stuff on GitHub over your lifetime). However, if this does not sound like an adventure that you would like to have (learning a command line program), you can stop here. GitHub Desktop is a perfectly fine application that will provide you with the same results as Git Bash.

1. Create a GitHub account.

Go to: https://github.com to create an account. For more additional instructions on using GitHub, see https://guides.github.com/activities/hello-world/.

2. Download and install Git Bash

For Mac users, download and install Git Bash from: https://git-scm.com/download/mac. Helpful tip: If you get an error message that your security preferences do not allow Git Bash to be opened, go into system preferences -> security & privacy -> general and select “open anyways”. If you can’t find git after downloading, go to finder -> go (drop down menu) -> go to the folder and then type folder path and it will probably be “usr/local/bin”. You can drag the folder labeled “git” to another, easier to find location of your choice. If you are not sure of the folder path, go to the git installer and open the readme.txt and it should tell you the path.

For Windows users, download and install Git Bash from: https://git-scm.com/download/win.

3. Open Git Bash and set up your username and email using the same email that you used for your GitHub account

To set up your username type: git config --global user.name "Your user name" at the command prompt (i.e. $)

To set up your user email type: git config --global user.email [enter your email address here without brackets] at the command prompt (i.e. $)

.

4. Create a directory for version control and change your directory in Git Bash to this directory

The first thing you need to do to start using git is create a project directory (a folder) on your computer (or skip this if already have one in mind). Change the directory to the one that you will put under version control by typing cd at the unix command prompt ($) followed by the path in quotation marks as shown below.

.

Note: On Mac, you will be doing all of this in the terminal. You can find the terminal by searching “Terminal” in the search bar.

You will now see that this directory is referred to as the ‘master’ directory. For more help with setting up git see: https://docs.gitlab.com/ce/gitlab-basics/start-using-git.html.

5. Initialize a git repository in the directory you created in step 4

At the command prompt ($), type git init to initialize a git repository in the directory you created in step 4. The folder labelled .git that is AUTOMATICALLY created with the git init command in the local master directory provides all the necessary metadata for using git (the magic). You can ignore this folder but know that it is essential for using git.

$ git init

6. Put a file under version control.

If you do not have an existing file you want to put under version control for the purposes of this tutorial, you can make an empty text file using notepad or some other text editor. I made a file containing the text “hello world” and saved it as helloworld.txt in my master directory. Once I have the file, I can now add it for version tracking by git using the git add command.

$ git add 'C:.txt' or git add helloworld.txt

The helloworld.txt file has now been added to what is colloquially referred to as the “staging” environment and is ready to be committed as a snapshot (i.e. a version).

7. Record changes in your file to your local repository

You can include a message with git commit to provide information about the changes made to the file or if it is your first file you can include any message you want associated with the file. This is helpful so that you (or your collaborators) know what changes occurred in each version.

$ git commit -m 'This is my first commit!'

This is all you have to do if you only want to control file versions locally. If you want to add version controlled files to a remote repository (such as one on GitHub), proceed to step 8.

8. Create a project repository on GitHub

Go to your GitHub site and log into your account (if you are not already logged in). To create a new GitHub project repository, click the “+” sign next to your icon (in the upper right hand corner of the screen) and select “New repository”. Provide it with a project associated name. You can also click on the green “New repository button” on the right hand side of the screen. I am going to call my GitHub repository Git_demo_files. Click the green ‘Create repository’ button.

10. Add committed version controlled files in your local repository (master) to your GitHub project repository

To do this, you use the git push command shown below. Files must be added and committed before they can be pushed to the remote repository.

$ git push -u origin master

You should see the helloworld.txt file now in your GitHub repository.

Note: You might be prompted for your username and password, just type each and press enter. Your password may not appear, but just trust that you typed it correctly, if you typed it incorrectly then just restart this step

Clone a repository from GitHub to your local machine with Git Bash.

You can also clone a GitHub repository to your local machine to start version control instead of creating a directory as in step 4. To clone a GitHub repository, go to the GitHub repository you want to clone (you may need to create it first as mentioned in step 10) and click the green Clone or download button and then Download to zip.

In the zipped folder, there will be a copy of the repository that you can transfer to a location on your computer that will contain all of your files (if it isn’t empty).

Then follow steps 4-7 (for step 4 start with cd) and step 10 (omit steps 8 and 9) to push versions of files to the GitHub repository. You can also pull versions from the GitHub repository (in the case that you are working with a collaborator) so that you have the most recent version using the command git pull origin master. You can then repeat the cycle (work on the new file version, add and commit it as a new version, and then push it back to the GitHub repository).

Final words on Git Bash

It gets more complicated but this should be enough to get you started with using git for version control to enhance the reproducibility of your research. To learn more about working with the git version control system (including branching and merging that are important concepts for working collaboratively), see this excellent resource: https://git-scm.com/book/en/v2

Other useful commands (from various sources):

To change GitHub repositories:

$ git remote rm origin

Following this command, add a new origin as described in step 9.

To remove all files from the staging area:

$ git reset HEAD -- .

To remove a single file from staging area:

$ git reset HEAD -- "path/to/file"

To see files that are to be committed (those in your staging area):

$ git diff --cached

To see files in staging area:

$ git status

To revert to an earlier version of a file:

$ git reset - -hard 'commit hash

Note for ‘commit hash’, you need to insert the long or abbreviated alphanumeric hash number for the commit. You can get with abbreviated hash numbers for each commit with the below command. (For output options, see: https://gitscm.com/book/en/v2/Git-Basics-Viewing-the-Commit-History):

$ git log --pretty=format:"%h %s"--graph




Reproducibility Toolkit on GitHub

Top