Tuesday, 12 February 2008
The Need for Distributed Version Control in the Enterprise
I'm looking at software configuration management tools again. My last three projects at work have all used subversion. This was a big improvement over CVS, but I'm growing tired of dealing with needing to push everything back through a central server to share it with others (or to share it between my computers). I'm also finding that the pains of branching and merging really limit the pace of change. These are the problems that some of the new SCM tools like git and mercurial were created to solve. I'm convinced that enterprises need distributed version control.
It seems many people think tools like git an mercurial are only appropriate for large open source projects, and aren't necessarily relevant to the enterprise. I submit that enterprise development teams need the new tools just as much, if for slightly different reasons. Basically, as enterprise developers, we need to write code in a feature oriented way and then group completed features together into releases. There are two approaches to dealing with how to allocate features to releases: you either decide the features in a release up front and work until you are done, or you decide the time frame for a release and pull in the completed features to it and push incomplete ones to the next release. My thesis here today is that the latter method is better, but it really stresses your ability to manage branching and merging, and that tools like git and mercurial are simply better at managing this.
The big problem in software development (as everyone knows) is that nobody can figure out how to estimate the duration of a task very well. If you think you have a solution to this problem, slap yourself, because you don't. The variability in estimating means that doing the "feature scoped" method incurs heavy waste as all features in a release go out at the pace of the slowest one. So most developers finish their tasks well before the slowest and then have degraded productivity until the next release. These problems get worse the more features are in a branch.
The degraded productivity goes away if developers can push and pull features between releases. In fact, the cost of estimating wrongly drops if we can juggle features easily, because now aren't holding a bunch of working features when we estimate wrong.
But you can start to see how all these problems rely on nimbleness in source control tools. Branching, merging, and conflict resolution have to be easy, quick, and simple. At any given time you might have a released branch, a branch under QA/test, and usually several different release branches in development: the next few minor releases and also the next major release. A feature's code can be juggled around among the unreleased versions. For example, if we find a bug, we want to apply the fix to all in process development branches. We create a branch just for this bug, maybe, and merge it's changes into all the other branches. Often we juggle a feature out of one branch and into another because it isn't ready in time. Whenever we create multiple development branches, we have to merge the earlier ones into the latter ones a lot.
The fact is that merging is extremely painful in subversion for several reasons. First, subversion doesn't really think in terms of branches. The directories in your /branches folder are really nothing more than ordinary directories. Worse, subversion doesn't help you at all in terms of keep in track of completed merge operations. Keeping track of merged revisions manually simply doesn't scale - the process becomes error prone, and anyone who has done a lot of branching and merging in subversion has stories to tell on the theme of "oops".
In fact, if branching was really cheap it might be better to create a branch per feature. This would let release managers wait until the last possible moment to merge changes: they'd look in the ticket tracking system, see which ones are "ready for merge" and pull them in, or better yet have the feature developer do it, resolving conflicts first. Many of us are used to doing this with patch files, but the problem there is an ordinary diff can't understand directory structures and how things in different places really are different versions of "the same" thing. And there's also the problem of the directory structure itself.
As the number of branches goes up, there will be a natural desire for peer-to-peer merging capabilities. Your peer has a bug? Pull his branch. If two people collaborate on a feature, why make them talk to a middleman to share changes? Especially when those changes may be broken or half-baked? Ever been annoyed by doing a subversion update and getting a bunch of changes you didn't want yet? Worse, ever seen one that broke everything? What if you only pull in things you want, when you are ready and know it's coming? What if half the time you don't even have to do that because the other guy is pulling your stuff first? Now distributed SCM seems a lot more reasonable. What if you have multiple computers (home/work or PC/laptop or both). Wouldn't it be nice to share changes with yourself, even if they don't even compile at the moment? Do it in one step instead of two. Oh, and now you and your coworker can collaborate outside of the office. So go to a cafe and get some work done together.
It looks like Git and Mercurial both have a compelling improvement to offer enterprise development shops. Git is slightly faster and has a more comprehensive command set, but Mercurial's commands will be more familiar to SVN/CVS users and Mercurial has good windows support. Lots of big projects are using these tools: Linux uses Git, Java uses Mercurial. We're starting to see more medium and small shops use these too. Which is better? They're both good, and there's a tradeoff of power vs simplicity and speed vs portability. Pick the first option in either tradeoff and you should look at Git, pick the second and you should use Mercurial.
Posted by at 3:02 AM in the internet, web, web 2.0 and beyond
