Version control for science

Have you ever:

You aren't the first person to have these problems, in fact there is a big set of tools to handle these issues. Welcome: version control.

This is a 10 minute introduction to git, one specific version control system. The tutorial has a very specific goal: to teach one the general concepts of version control, and enough to use git on their own, personal projects. It doesn't go into the full power of git or version control systems (that's the next tutorial).

Keep in mind: I can't teach you git, but I can give you ideas and your curiosity can teach you git.

Goals

After completing this tutorial, you should be able to:

Version control for science

What is version control?

Let's look at science without using VCS.

Science without version control

How often have you changed something and drastically changed results, and then you have to spend hours figuring out what you just did?

With the copying system system, at least you have some backups. But you have to copy it yourself, and you end up with code.v2.py, code.v3.py, code.final.py, code.submitted.py, code.submitted.final.py, and so on. And then, once you have all these files, you have to keep them organized, and getting any information out of them is a lot of work, too. You probably won't make backups often enough, either.

People often send files back and forth for papers (it could be used for code, but in that case it's probably easier to just work by yourself). You end up with the filename game again, and only one person can edit at once.

What should version control be used for?

Pros use version control for everything: code, papers (LaTeX), websites, notes, etc. All my papers are in version control, and I can even make PDFs showing what changed between revisions. My website is in git, I record changes and "push" to the server to automaticaly update it. People have written git add-ons for distributed storage of large files (git-annex). These tutorials are stored in a repository.

Usage of git

Terminology

A repository is one directory that is being tracked. The history consists of a chain of commits or revisions identified by hashes like 526b2f9a. When you want to make a record, you commit or check in your code. The differences between commits are diffs.

Installation of git

This tutorial doesn't talk about how to install git! However, this is a very well documented thing, so you should have no problem doing it yourself. If you have a shared computer, it probably already has git installed. You can download it for almost any operating system here:

git is not just one program, there are also other graphical user interface (GUI) git clients, which can provide a nicer interface for certain tasks. In this tutorial, I focus on the concepts of git and the command line. At the end I will demonstrate some other programs.

Making a new repository

The specific git repository format is simple but complicated, and each VCS works differently. We don't need to worry about it now.

Once you run git init, you won't notice any changes. The only thing that will happen is the creation of a .git directory.

No versions are saved, and your files are not touch, unless you run a git yourself. This makes git relatively safe. Nothing happens in the background without you knowing. If you delete the .git directory, it's as if it was never made.

Notice how easy this is. You should be doing it for every project.

Adding initial files

You have to use git add here, but git add has another use that I am not going to discuss in this tutorial. This is known as "staging" things to the "index". It can be useful, but for now it's an unnecessary complication that you'll learn about when reading other things.

You will usually run git status to check if you forgot anything (next section).

Check status

git status shows what the current state is. You will see a section for "files staged for commit", "modified files", and "untracked files". "Untracked" is files you have not git add``ed yet.  "Modified" is tracked files which you have edited since the last commit.  "Staged" is files you run ``git add on but not yet committed. If you do this, you can use git diff --cached to see the diff.

Make your first commit

1
$ git commit

Regular work flow: edits and status

This is what you do on normal working days:

Why should you look at diffs? First, and most importantly, it lets you check yourself. You can see all changes you have made since your last checkpoint (commit), to see if it makes sense when put together. This may be a bit of extra work, but it is very important for good development practices.

Regular work flow: committing

This is the last step. Before doing this, check status and diffs. After doing this, check status and make sure everything is clean.

We'll talk about how to structure and group changes into commits later.

Viewing history

To view history in git, run:

1
2
3
4
5
$ git log
$ git log --oneline              # abbreviated format
$ git log --patch                # also show patches
$ git log --stat                 # also show stats
$ git log --oneline --graph --decorate --all  # for later use

Getting information

Exercises

Exercise Git-1.1: Connect to Triton

  1. Everything today will be done via ssh on triton

  2. To connect to triton, run:

    1
    2
    $ ssh USERNAME@triton.aalto.fi
    $ cd $WRKDIR
    
  3. If a not your own account, make a subdirectory and change to it

Exercise Git-1.2: Standard configuration options

  1. Git has a configuration file stored in your home directory at ~/.gitconfig. This has options that are shared among all of your repositories. This can make your life easier.

  2. You should at least set your name and email address wherever you work.

  3. On triton, copy and paste the following commands into a shell (don't paste these into the file yourself - git will do that itself). Don't forget to change the name/email to your own.

    1
    2
    3
    4
    5
    6
    7
    $ git config --global user.name "Your Name"
    $ git config --global user.email your.name@domain.fi
    $ git config --global color.ui auto
    
    $ git config --global alias.log1a "log --oneline --graph --decorate --all"
    $ git config --global alias.st "status"
    $ git config --global alias.cm "commit"
    
  4. You can also set your preferred editor if you don't want to use vim

    1
    $ git config --global core.editor "emacs"
    
  5. Bonus: look at the git manual page for the config file and see the types of things that are available:

    1
    $ man git config
    

Exercise Git-1.3: Making a new repository

  1. In this exercise, we will go to a directory with a simple project, make a new git repository, and go through the steps needed to make a commit. Copy (cp -r) the prototype to your working directory. The base is in /scratch/scip/git/git-1/.

  2. Change to the directory

    1
    $ cd ~/scip/git/git-1/
    
  3. Run git init to create a new repository in a directory.

    1
    2
    $ git init
    Initialized empty Git repository in /home/darstr1/scip/git-1/.git/
    
  4. Everything is stored in the .git directory within your project. Your files are never modified unless you run a git command that is supposed to.

  5. You need to add all the files you are working on. git doesn't make any guesses: you could have temporary files, backups, and so on that you don't want tracked.

    1
    $ git add code1.py mod2.py README.txt
    
  6. Make your initial commit using git commit. This records all files that have previously been added. An editor will come up. Add the commit message of "Initial commit" at the top of the file and save. (Hint: to save in vim, the default editor, use ESC : w q ENTER)

    1
    $ git commit
    
  7. Check if your commit appears in the log

    1
    $ git log
    

Exercise Git-1.4: Making edits and commits

  1. Edit README.txt and add some lines.
  2. Preview your changes before committing. This is good practice to make sure that you know what you are doing. Run git diff to see the differences, and git status to see a summary showing that README.txt is modified.
  3. Use git commit README.txt to record the file.
  4. Repeat the above several times. Make a) an edit to another file and commit, b) edits to two files at the same time and commit both, and c) add and commit a new file. For each change, make the loop of edit, diff, status, commit, log (to verify changes). Commit different ways. Try using commit -a, commit [FILENAME], commit -p, and so on.

Exercise Git-1.5 Check information from history

  1. You can make changes, but how do you use them? Eventually, you will wonder "what was I doing a week ago?". git has lots of tools to use to answer these questions. We will explore them now.
  1. Get the OpenMP Examples repository. We will cover the clone command later, but for now just run this command in your working directory

    1
    $ git clone https://github.com/OpenMP/Examples.git
    

    You should now see a new Examples folder. Change into it.

  2. Run git log to see recent changes. You should be able to see the description, author, and date. Try adding on a -p or --stat options to get more details.

  3. Run git log README to see recent changes to only the README file. You can limit to certain files this way, and even track them if they have been renamed.

  4. What if you want to see an old version of a file? You can see it using git show commit_id:filename:

    1
    $ git show 542c10d:README
    

Exercise Git-1.6: Bonus: Extra history information (annotate, diff)

  1. Often, you want to know more than just the changes. What happens when you want to know who and when a particular line was created? Well, there's a command for that (obviously). git annotate takes a file, and for every line, shows you who committed it, when it was committed, and the commit hash. You can use this to track down exactly when a bug was introduced, for example.

  2. You should still be in the OpenMP-Examples directory from the previous exercise.

  3. Run git annotate Title_Page.tex to see who has last changed each line. Who is the main author of this file? When was it last modified?

  4. The long hexadecimal numbers are the version numbers. Try to figure out what these git diff commands do:

    1
    2
    $ git diff be603ae            # same as git diff be603ae..HEAD
    $ git diff a17ad37..be603ae
    

Exercise Git-1.7: Bonus: .gitignore

  1. Make a file called .gitignore and put patterns of things you want to ignore.

    *.o
    *.pyc
    *~
    
  2. This makes the "git status" output more useful and you generally want to keep your ignore file up to date.

I should really emphasize how important the .gitignore file is! It seems minor, but clean "status" output will really make git much more usable. .gitignore can be checked into version control itself.
  1. Extra bonus: Create a .gitignore file in your home directory. To do this, find the configuration option for the global ignore file and set it to some common path, such as ~/.gitignore.

Sharing with others

Branches and remotes

Due to time constraints and practicality, we will not go into branches and remotes in great detail.

git remotes

Commands for sending/receiving code

Conflicts

Dealing with conflicts: meta-notes

Dealing with conflicts: resolution steps

Exercise Git-2.1: Cloning

  1. In this set of exercises, we will explore git pushing, pulling, and conflict resolution at a very high level. We aren't going to try to cover everything here, but we will see some of the major points. It is better to become familiar with the basics before going too deep into branches, remotes, and conflicts.

  2. Go to http://github.com. Use the search at the top to find a project related to your field.

  3. Go to the project page. Find the "HTTPS Clone URL" on the right side.

  4. Clone the repository

    1
    $ git clone https://github.com/igraph/igraph.git
    
  5. Check out the log. How many total commits are there in this repository? (Hint: git log | grep ^commit | wc)

Exercise Git-2.2: Pulling

  1. Copy the directory /scratch/scip/git/OpenMP-Examples-2/ to your working directory.
  2. View branches and remotes using git remote -v. You can see that it is set with the github.com server. This is a common project hosting site.
  3. View current commits using git log.
  4. Pull using git pull.
  5. Check current commits using git log. What is new?

Exercise Git-2.3: Resolving a conflict

  1. In this exercise, I have set up simple get repository, all ready to do a pull and make a conflict.

  2. Change to the directory ~/scip/git/git-conflict/.

  3. Run git log, git diff and git status just to make sure that everything is clean and you know what's going on (no untracked changes, no surprises).

  4. Pull changes from the default remote:

    1
    $ git pull
    

    You will see a big note about a conflict:

    Auto-merging code1.py
    CONFLICT (content): Merge conflict in code1.py
    Automatic merge failed; fix conflicts and then commit the result.
    
  5. We will now resolve the conflict. Run git status to see the situation. It should (again) say that code1.py is the file with conflicts:

    # Unmerged paths:
    #   (use "git add/rm <file>..." as appropriate to mark resolution)
    #
    #       both modified:      code1.py
    
  6. Look at git diff. This is an advanced diff with two columns with + signs indicating what comes from each side.

  7. Open code1.py in an editor. You will see conflict marks:

    <<<<<<< HEAD
    from scipy.stats import gamma
    =======
    from scipy.stats import binom
    >>>>>>> 5de531032424ab6afe5576ee817e0ace9e9937d7
    

    Between <<<<<<< and ======= is what you have done (in HEAD). Between ======= and >>>>>>> is what is changed on the server (in commit 5de5310).

  8. You see that one side imported numpy, and the other imported scipy. There's no problem with doing both of these, but since they happened on the same line, git doesn't try to guess how to put them together. A more complicated case would be edits to the same line.

    To resolve this conflict, we need to import both gamma and binom from scipy.stats. Remove the two parts, and the conflict markers, and make one line having all changes together. The top of the file should look like this after you do the resolution:

    ...
    import scipy
    from scipy.stats import binom, gamma
    import scipy.linalg
    import numpy
    
  9. We will check status to make sure things are OK. Run git diff and see the added and changed lines. This form of diff is particularly useful:

    - from scipy.stats import gamma
     -from scipy.stats import binom
    ++from scipy.stats import binom, gamma
    
  10. Run git add code1.py to tell git that we are done resolving this conflict and prepare it for committing. Run git status before and after this to see what changes. (Hint: it should change from Unmerged paths: to Changes to be committed:.

  11. Run git commit. An editor will open with a pre-filled commit message (it remembers that you were doing a merge) if you want. You can adjust this if needed, for example if you need to explain how you reconciled two opposing features. Since there is nothing to add, just save and close.

  12. Run git log and you should see that all changes are recorded, as well as the merge commit.

Exercise Git-2.4: Bonus: A full cycle of contribution

  1. In this exercise, you will clone a repository from github, add and edit some files, and send the change back. This is a full cycle of what you would do if you are contributing to a real project.

  2. First, clone the repository. The repository you will be cloning is that of this lecture itself. Clone using the git clone command. This makes a local copy of a repository on some server.

    1
    git clone https://github.com/rkdarst/scicomp.git
    

    You will now find a new directory scicomp in your current directory. Change into it.

    1
    cd scicomp/
    
  3. Now, you need to find some change to make. There are several options here. You can make a serious change that you would like to contribute to this talk, and I will probably actually use it. Or, you can just make some random test edits for your own practice. Go edit the files. This talk is at tut/scip/git.rst.

  4. Commit the changes. Use a good commit message, since someone else will be reading it to judge your commit!

  5. Now, you have to get your commits from your computer to me. Since you don't have rights to push directly to the repository, you will need to send me a patch. You could open a pull request on github, but that is beyond the scope of this tutorial. To do this, we will use git format-patch. We use do it with one argument of "the last upstream commit". We can use the keyword origin/master for this.

    1
    2
    $ git format-patch origin/master
    0001-COMMIT_TITLE.patch
    

    You can look at the .patch file to see the format. It is formatted like a raw email.

  6. Now, you need to get this file (the new .patch) to me. Command line email isn't set up on triton, so you should copy and attach this file to an email to me (rkd@zgib.net). You could copy and paste it directly into an email, but certain mail programs can mess up whitespace and line wrapping, which will cause the patch to not apply cleanly which means it is hard to use.

  7. Double-bonus: Research the "pull request" model of contributions. Github has good documentation on this. Emailing patches is a little bit old-fashioned, but still always works. Using the power of project hosting sites, you can more easily send changes, discuss them, and get them merged.

How does this work in practice?

Conflict notes

Working to reduce conflicts

Other conflict resolution options

Project management systems (e.g. Gitlab and GitHub)

Conclusion

The end

Next steps

Summary of commands: basic

The commands needed, as we know them now.

Summary of commands: sharing and collaborating

These are the extra commands we have learned today.

References

The "staging area" or "index"

Other things to try

Here are some ideas for independent study that you need to try yourself: