Dropping history with Mercurial

One feature we’ve wanted from hg for a while is the ability to drop history. It’s a natural thing to want, after all; with any sufficiently active repo, you’ll eventually need to drop history. For us, this is only an issue with our single most active tree, which weighs in at about 120k changesets and 2.3G. Given that there is no magic “drop history” button to press, what can you do? One approach would be to abandon your old history entirely, copying the state at tip to a brand new repo with no history at all. But this is problematic. You keep your history around for a reason. You need it to handle merges, and it’s nice to have when you try to understand the origin of a certain change in your tree.

Another approach is to create a new repo with a subset of your history. hg convert can help you do that. You can pick a base revision x, and hg convert will create for you a new repo that has revisions corresponding to x and all its descendants. Note that this new tree is, in hg terminology, unrelated to the original, which is to say, you can’t directly merge between them. This appoach will keep around a subset of your history, which if you pick x carefully, will let you do merges cleanly.

But there are problems here too. Cutting a new tree like this requires an all-at-once conversion from the full repos to the truncated ones. For a set of developers that are actively working on dozens of branches around the clock, this is hard to swallow.

We’ve found a solution to this problem that seems to get around all of these issues. We essentially maintain two worlds, a full world that has a complete history, and a truncated world that omits all history before a certain revision x, and we’ve built what we call a convert daemon that ferries patches back and forth between the two worlds.

The end result is that you can pull from and make changes to either the full or truncated worlds, so it just doesn’t matter that much where you make them. This is allowing us to migrate slowly to the truncated repos, without introducing communication barriers between those who have and haven’t made the jump. Plus, you can use the more efficient truncated world for day-to-day use, but can still go back to the full world if you want to dive into older history.

Building the Daemon

So how do you build a convert daemon? Our solution is based on hg convert, which does the grotty work of actually converting changesets from one world to the other. But you need more than that to create a bi-directional bridge.

The convert daemon is built on top of two repositories, which we’ll call full-convert and trunc-convert, corresponding to the full and truncated world respectively. These two repos are multi-headed, and the basic workflow is to push a revision to one repo, and ask for the daemon to convert it and push it to the other repo.

You can imagine the interface to the convert daemon including the following two functions:

(* converts revision from cvt-full to cvt-trunc. Returns None if the provided
  revision is not in cvt-full. If it returns [Some rev],
  then that rev is available in the cvt-trunc repo. This
  transformation drops history from the full world. *)
val forward_convert : revision -> revision option

(* like [forward_convert], but for pushing from cvt-trunc to cvt-full. *)
val backward_convert : revision -> revision option

There are some invariants that should hold. First, for any single revision, multiple calls to forward_convert (or backward_convert) should always produce the same output revision. Also, running forward_convert and then backward_convert should be the identity. Note that the conversion mechanism in hg convert is not necessarily deterministic, so to make sure that this holds, you need a consistent revision map that keeps track of which revisions have been converted.

Once you have this core abstraction in mind, the implementation of the daemon is pretty straight ahead. The next question is, how do you use this daemon to tie together the full and truncated worlds? For us, doing this requires figuring out the interplay between the compile daemon and the convert daemon.

Integrating with the compile daemon

As described in a post I put up a few years back, our development process depends critically on a compile daemon. A compile-daemon managed tree actually has two related repositories, a primary repo, and a staging repo. The primary repo is where you pull from to get a clean, compiling version of the tree. The staging repo is where you push your proposed changes for consideration for inclusion into the primary repo. Staging is multi-headed, meaning that you push changes to it without merging.

The compile daemon’s job is to grab heads out of staging and see if they can be brought into the primary repo. The compile daemon checks that a head merges cleanly, compiles it, and runs the unit tests. If everything passes, then the revision in question (along with the new merge node) is added to the primary tree. Otherwise, it is marked as rejected.

The compile daemon and the convert daemon work together. Notably, you have just one compile daemon between the full and truncated worlds, and the convert daemon is set up to route patches from both worlds through it. The following picture describes this merge flow.

convert daemon
layout

The green arrow shows the actions of the compile daemon, bringing in patches from full-staging to full. (Note that we could just as well move the compile daemon to the truncated world, and indeed, it would probably be marginally more efficient, since cloning would be faster.) Clients are shown pulling from full and pushing to full-staging, and similarly pulling from trunc and pushing to trunc-staging.

The purple and blue arrows show the actions of the convert daemon. The convert daemon pulls heads from trunc-staging and converts and pushes them to full-staging. From full-staging, those patches will be considered for inclusion by the compile daemon. Then, whatever shows up in full will be pushed over to trunc (and trunc-staging. It’s confusing to have things in trunc but not trunc-staging) by the convert daemon.

That’s basically the whole system. We’ve been using it gingerly for the last month, and it seems to be working quite solidly. I expect we will transition all of our development over to these new truncated trees in the next couple of months.

One interesting side effect of all this is that you can do other transformations on your tree using the convert daemon. If in your past people have committed absurdly large files, for example, the convert daemon can weed those out. It seems like there are a lot of potential applications for this kind of bidirectional bridge.