I just stumbled across a post from earlier this year by Graydon Hoare, of Rust fame. The post is about what he calls the “Not Rocket Science Rule”, which says that you should automatically maintain a repository that never fails its tests. The advantages of the NRS rule are pretty clear. By ensuring that you never break the build, you shield people from having to deal with bugs that could easily have been caught automatically.

There are a few ways of implementing the NRS rule. Graydon describes the now-defunct Monotone VCS as a system meant to implement this in a distributed way. But you can also write a simple set of scripts that handle it in a centralized manner. To do this, you create a build-bot that evaluates every pull request, merges (or rebases) it with the latest tip as necessary, and then makes sure that it builds cleanly and that all the tests pass before allowing it to become the new released tip.

Graydon notes that lots of people get this wrong. They have automated test suites, but the test suites are run on code after it’s integrated with the main branch, not before. This is a lot worse, exposing people to buggy code, and, more subtly, because it means that when things break, you have to scramble to figure out who broke the build so you can get them to fix it.

If the NRS rule is such a good idea, why isn’t it used more often? Part of the reason, I suspect, is the lack of good tools. Indeed, when we wanted to do this at Jane Street, we ended up building our own.

But I think there’s more to it than that. Another problem with the NRS rule is that it’s not clear how to make it scale. To see why, consider the naive scheme Graydon mentions for implementing NRS: for each pull request, you merge it with the tip, compile and build the result, run all the unit tests, and then, if all goes well, bless that as the new tip of the repo. This process is sequential, deciding on a single, linear order in which the merges happen, building and testing each proposed change along the way.

This means if verifying a pull-request takes m minutes, and you have n pull requests, the release time is going to take at least m * n minutes. At one point, this was a rather serious problem for us. We had lots of people constantly pushing small changes, each of which had to make its way through the build-bot. And some of these runs end by rejecting the feature, which means that a single pull request might need to make multiple attempts to get released. To make matters yet worse, a full build of the tree was expensive, taking two hours and up.

The end result is that things would get backed up, so that a change would sometimes take 24 hours to get through the build-bot. We knew we needed to fix this, and here are some of the ideas we tried out.

  • Simple speculation: If you have a high reject rate, you can increase your throughput by evaluating pull requests in parallel. The first one that succeeds becomes the next tip, and all failing requests are rejected back to their owner. Note that while each failure represents progress, multiple successes don’t, since the second successful pull request would still has to be merged with the new tip and tested yet again before it can be integrated.

  • Merging speculation: Simple speculation only helps if you have lots of failed requests that need to be weeded out. If you want to speed things up further, you can speculate en masse by merging multiple requests together, and releasing them if they collectively build and pass all the tests. If the group fails, you don’t know for sure which of the requests was at fault, so you need to do some individual requests to ensure forward progress.

  • Faster builds: Simply making your build faster helps a lot. We did a lot of work in this direction, including writing our own build system, Jenga that sped our builds up by quite a bit. In addition to making from-scratch builds faster, we also worked to make incremental builds reliable. This made it possible for changes that only touched a small fraction of the tree to be verified very quickly.

  • Make fewer pull requests: This isn’t always possible, or even advisable, but other things being equal, a workflow with fewer proposed changes will get through the build-bot faster.

Interestingly, a change in our workflow did massively reduce the number of requests made to the build bot, which really did improve matters for us. This derived from a change in our approach to code review. We moved from a (slightly crazy, definitely unscaleable) system where changes were submitted for integration to the main branch before review, and review of that branch was done later. We moved to a system where a feature is kept on a separate branch until it is fully reviewed, and only integrated after.

One side effect of this change is to batch up the integration requests, so that rather than integrate your changes bit by bit, you integrate it en masse when the feature is done. Our main repository went from accepting hundreds of requests a week to just thirty or fourty.

The above tricks can tweak the constants, but don’t change the asymptotics. If the size of our development team grew by a factor of 10, or 100, these fixes notwithstanding, I would expect our approach to the NRS rule to break down.

A different approach to scaling up is to make the release process hierarchical. This is similar in spirit to the Linux kernel development model. There, there are “lieutenants” who are responsible for individual subsystems and who maintain their own process for deciding which patches to integrate, and making sure via review and testing that the patches are good enough to be migrated upstream.

Iron, our new code review and management supports something in this vein called hierarchical features. Essentially, Iron allows you to organize your release process as a tree of release processes. By keeping the branching factor small, you reduce the amount of work that each release process needs to handle, and thus you can push more patches through the process. Effectively, release processes lower down in the tree batch together features, so they can go through the central process in a single round through the build-bot.

We already take advantage of this at a small scale, both by organizing long-running release processes, and by spinning up smaller release processes where an ad-hoc group can collaborate on an interconnected set of features, then merging them together to release them as a unit. Throughout, Iron carefully tracks the code review that was done.

The end result is that the not-rocket-science rule requires a bit more rocket science than it seems at first. Like many ideas, it gets trickier at scale.