Understanding Distributed Version Control Systems

This post is based on a short article I wrote to explain DVCS to my colleagues at work (hence the comparison with TFS). For those wanting to go further, I strongly recommend Eric Sink’s book Version Control By Example. I’ll be presenting some of this material at an upcoming session at NxtGenUG in Southampton.

The last decade has seen a lot of innovation in version control systems, with the rise of the Distributed Version Control System the most notable development. Distributed Version Control Systems (DVCS) are an idea that was first implemented in the 90s, but in the last five years, open source DVCS such as Git and Mercurial have rapidly started to gain market share over the more established centralized version control systems, such as SourceSafe, TFS or Subversion.

The Centralized Model

In the centralised version control systems (CVCS) that you are already familiar with, every developer connects to the central server and asks for the latest version of the source code. They then work locally on their machine until they have completed a feature and then they check in. As soon as they check in, those changes are made available on the central server for other developers. If someone checked in before them, they must get latest and merge before they check in.

Understanding DVCS with DAGs

To understand Distributed Version Control systems, you need to think of your repository in terms of a directed acyclic graph (a “DAG”). Each node in the graph represents a revision, and each arrow represents which revision the current revision changes (it’s ‘parent’ node), so the arrows go in the opposite direction to time.

Imagine that our repository has had three commits (1, 2 and 3). This can be represented by the following simple DAG. Revision 3 is a change based on revision 2, and revision 2 is a change based on revision 1.

Cloning

With a CVCS, we would do a get latest and just get the state of all files at revision 3. However, in the world of DVCS, we do a clone instead. This means that we take a copy of the whole DAG. (n.b. This makes some people concerned about speed, but a clone is only done once, and DVCS are well known for their much faster speeds than CVCS). Suppose BILL decides to take a clone from the server. Now his local copy looks like this:

In other words, it is identical to the SERVER. This is why people say that you don’t need to have a central server for DVCS, since we could take away SERVER and nothing would be lost. However, in practice, it usually makes sense to designate one computer as the central repository.

Committing

Now BILL wants to make some changes. He writes some code and performs a commit. Now his DAG has an extra node in it:

However, unlike a checkin in TFS or another CVCS, nothing has been sent to the server. Revision 4 is only on BILL’s machine.

BILL can actually carry on developing and do another commit. Now he has two local revisions that the SERVER doesn’t have.

Already, the benefits should be obvious. It is like BILL has his own personal branch to work on. If Revision 5 was a mistake, he can roll back to Revision 4 and try again. He is able to make regular and often small checkins without fear that he will break the build by committing to the SERVER.

Pushing

At some point, BILL is ready to share his work with everyone else so he does a push. This simply compares his DAG (which contains Revisions 1-5) with the SERVER’s DAG (which currently still only has revisions 1-3). When he does a push, the DVCS works out that the server needs revisions 4 and 5, so it appends them onto its own DAG. Now the server’s DAG is identical to BILL’s again.

Pulling

But what if someone else got in there before BILL? Maybe the SERVER now looks like this, with revisions 10 and 11 having been pushed from someone else:

Maybe we would like this to happen when we push:

But this is not allowed (and in fact not possible). Revision 4’s parent is Revision 3, not Revision 11, so we can’t just stick it on the end of our DAG. What will happen is much the same as with a CVCS, which will block you from checking in, and tell you that you need to do a get latest to merge those changes into your working copy. A DVCS will warn you that you probably don’t want to do a push because it will create two “heads”, and instead you should do a pull and a merge.

So BILL does a pull, which is the opposite of a push. It looks at the SERVER and sees what nodes it has that he doesn’t. In this case, it is revision 10 and revision 11. They are pulled from the SERVER and added to his DAG. But notice that now both Revision 4 and Revision 10 are changes to Revision 3, and it means that our repository now has two “heads” - revision 5 and revision 11.

The good news for BILL is that his local commits, 4 and 5 are still perfectly safe and intact. Unlike a get latest with CVCS, no automatic merging has taken place that could overwrite or break anything he has already done. The merge takes place as a separate step.

Merging

Now that BILL has the latest changes from the SERVER, he performs a merge. If there are no conflicts then this is quite trivial. If someone else has changed the same bit of code as him, then he must use the typical merge tools you are already familiar with to select which change is the correct one – there is no getting round this. However, once he has done the merge, he then makes another local commit, in this case, revision 6. Now his repository has only one head and is ready to be pushed to the SERVER.

What are the benefits of DVCS over CVCS?

Hopefully some of the benefits of DVCS over CVCS are already apparent, but let me list a few.

First, it promotes little and often checkins. With a CVCS, you only want to check in if you are 100% sure that your work is fully tested and won’t break the build. This encourages developers to work sometimes for weeks at a time without checking in. With DVCS, you can check in dozens of times a day, and still only push to the SERVER when you are ready. (n.b. it is up to you whether you want to combine all your local checkins in to one before you push to the server. Different developers have different philosophies about this).

Second, it gives every developer multiple personal branches. It is common to have to context switch in the middle of working on something (“drop everything and fix this bug right now”). Whilst TFS has Shelvesets and Workspaces that allow you to separate out multiple tasks you are working on, they are rather cumbersome to use, and end up getting neglected. With DVCS, it is trivially easy to create another local branch (or clone, depending on your preference), to work on the new feature. You can even merge freely between your local branches if necessary.

Third, it allows ad-hoc teams. You don’t have to push everything via the central server. If you are working with another developer and want to share some work-in-progress changes you have made, you can simply pull from each other. Despite both developers now having the same revisions in their local DAG, each one will only get pushed to the server once, and will be attributed to the developer who wrote the code. TFS shelvesets cannot do this.

Fourth, it gives complete branching flexibility. With TFS you can only safely merge into a parent or child branch. This restriction is completely removed with DVCS. You are free to pull or push to whatever branch you want. Obviously there will still need to be processes in place that say what branches ought to be pushed to.

Fifth, it allows for much more flexible handling of merges. Unlike with a CVCS you are not forced to deal with them the moment you do a get latest. Your local repository can have two heads for a time, allowing you to defer the merge until you are ready. It also no longer means that the first developer to check in wins, while the second is lumbered with the merge. You can ask the developer who made the conflicting changes to pull from your local repository, perform the merge, and then you can pull that merge back from them.

Sixth, it allows disconnected working. If you need to work from home without network access, you can still commit locally and push when you have access to the network. This is also good for outsourced teams. They can make commits to their clone of your repository, and you can pull those revisions into your own without ever needing to give them commit access to your own server.

Seventh, it is great for backup. The central server could suffer a catastrophic disk failure, but it can be instantly recreated by cloning from a developer’s repository. Also, a developer can easily backup their work in progress by pushing to a repository on another computer.

What are the limitations of DVCS?

So far I have painted a very positive picture of DVCS. Is there anything it doesn’t do well?

One feature offered by some CVCS that you will probably lose is the ability to lock files so that only one person can work on them at a time. This is most often needed for binary assets. Obviously if your developers are disconnected from a central server, they have no way of knowing whether someone else is also working on a file. However, some DVCS have extensions allowing you to manage files that need to be locked via a central server.

DVCS are not typically a good choice if you want to store very large files or to have very large repositories where people might not want to do a get latest of every folder in the repository. Most users of DVCS simply develop practices that don’t require storing huge binaries in source control, and split vast projects up into smaller sub-projects.

With a DVCS, history is immutable. If you want to modify revision 2, or expunge it altogether from the history of your source control, you can get into trouble, as you effectively end up with a completely new DAG. When the next user does a push, the old revision 2 and its ancestors will come back. You have to make the change on a central repository and then ask all your developers to destroy their local repositories and re-clone from the central one to permanently change history.

How do I get started?

There’s nothing stopping you trying out DVCS for yourself, and I strongly recommend it for any small projects you write. Even if you are the only developer it is great to have a commit history to remind you what you were doing, and the ability to go back in time if you make a mistake. I use Mercurial for all my personal projects, and if you ever collaborate on open source projects you will likely end up needing to learn git at some point, as it is the most popular.