Centralizing distributed version control, revisited

7 years ago, I wrote a blog post about how we at Jane Street were using our distributed version control system (hg, though the story would be the same for git) in a partially centralized way. Essentially, we built a centralized repo and a continuous integration system whose job was to merge in new changesets. The key responsibility of this system was to make sure that a change was rejected unless it merged, compiled and tested cleanly.

This half-distributed, half-centralized approach let us enjoy the benefits of a DVCS, while still getting the coherence and easy of sharing that comes from having a central authoritative source.

Since then, our development tools have changed a lot, including the arrival of a new code review and release management system called Iron. In writing Iron we discovered that centralization was valuable in ways we hadn't considered before. In particular, despite the fact that good support for merging is central to a DVCS, centralization is actually a critical ingredient to making merges work better.

To understand how centralization can help, let's talk about one reason why merging is a fraught process to begin with.

The criss-cross merge

The basic approach to merging in a DVCS like hg or git is pretty simple. Here are the basic steps that are taken to merge two heads A and B.

  • Find the greatest common ancestor (GCA(A,B)) of the heads to be merged.
  • Compute the patch from that base point to one of the two heads, say, A.
  • Take the patch you just computed, and apply it to B. Conflicts appear when the patch, which was actually based on GCA(A,B), doesn't apply cleanly to B. The result of this process is the merge.

The above discussion oversimplifies the story by assuming there's a well defined GCA, but this just isn't always true. To see why, consider a repository staring with a root revision R, and two revisions made independently on top of R.

  A
 /
R
 \
  B

Now, imagine that two different developers each concurrently decide to merge the heads A and B and do some further development. Note that in both of the cases shown below, the GCA for the merge between A and B is R.

Developer 1                Developer 2

  A---C--D                    A     
 /   /                       / \        
R   /                       R   \       
 \ /                         \   \  
  B                           B---E--F

Now, if we bring these two separate histories together into one repo, we have something like this.

  A---C--D
 / \ /
R   \
 \ / \
  B---E--F

Now, what happens if we want to merge D and F? In particular, what is GCA(D,F)? Both A and B are common ancestors, but neither one is greater than the other. In this case, there are in some sense two different GCAs, or, more precisely, there are multiple maximal common ancestors, or MCAs. This case is often described as a criss-cross merge, and is the source of much wailing and gnashing of teeth among developers and users of DVCSs.

git and hg have different ways of dealing with the case of multiple MCAs. By default, hg just picks one of the MCAs arbitrarily and does the merge based on that. Given that different choices of the merge base will lead to different results, making that choice arbitrarily is pretty disturbing.

git, on the other hand, has a strategy called recursive merge that repeatedly merges together the MCAs, and then uses that merged MCA as the basis for computing the diffs to A and B on which the final merge will be based. And hg has a new strategy called bid merge that is willing to make different choices as to the GCA to use on a file by file basis.

None of these approaches amount to principled solutions, and while they work better in some cases and worse in others, they all sometimes lead to bad results. It's tempting to look for a way out of this conundrum altogether, by avoiding the possibility of criss cross merges in the first place.

Avoiding the criss-cross merge

For those who haven't read my previous posts about how Iron approaches merges, I'll describe it briefly here. Iron organizes its branches into a hierarchy: every repository has a root feature, and that feature can have children, and those can have children as well. Thus, our main repository, called Jane, has a root feature called jane, and one can develop changes to jane in child features, such as jane/Core.Applicative or jane/quickcheck.

Critically, the merging of features is constrained. Note that in Iron, every feature is defined by its base and tip revision, where the diff between those two revisions is effectively the contents of the feature. Here are some of the key operations allowed on features.

  • fe release, moves changes from a child feature into a parent. This can only be done once the child feature is fully merged with its parent, and has the effect of setting the tip of parent to be the tip of the child, and typically deleting the child.

As an example, if the jane/quickcheck feature is based at the current tip of jane (and is fully reviewed, and all its tests pass), then calling fe release jane/quickcheck will move the tip of jane forward to be equal to the tip of jane/quickcheck, and will delete jane/quickcheck.

  • fe rebase lets you merge a feature with its parents, effectively pulling changes from a parent feature into a child. This has the effect of changing the base of the feature to be the tip of its parent, and the tip of the feature to be the result of the merge.

So, if other features have been released into jane since the jane/Core.Applicative feature was created, then the base of jane/Core.Applicative will no longer be the tip of jane. Calling fe rebase jane/Core.Applicative will merge the tip of jane/Core.Applicative with the tip of jane, and will set the base of jane/Core.Applicative to the tip of jane.

  • fe rename, which in addition to allowing you to simply change the name of a feature, also lets you introduce a parent-child relationship between features that didn't previously have one. e.g., calling fe rename jane/Core.Applicative jane/quickcheck/Core.Applicative causes the Core.Applicative feature to become a child of, and so be able to depend on the changes in, thequickcheck feature.

All of these operations are implemented against a single, centralized server which keeps track of the state of all our features. This centralization lets Iron enforce some useful invariants along the way, critically, that the GCA of a feature and its parent is well defined, and is equal to the base of the feature. This simple property turns out to outlaw criss-cross merges, which avoids all of the mess we described earlier.

The happy outcome turns out to depend critically on the fact that we built a central server that could enforce the invariants in question, or, more precisely, that we built a consistent service

discovered by chance, the existence of the central server is key to enforcing the necessary invariant. In particular, the scenario of two different users concurrently releasing into the same feature or rebasing the same feature simply isn't possible when there's a centralized monitor determining who goes first.

In retrospect, this shouldn't be too surprising. The criss-cross merge is really a result of concurrency, and the idea that introducing a lock (which is what a centralized server does for you) can be used to exclude unwanted concurrent executions in a distributed systems should surprise no one.

In the end, you can trace it all back to the CAP theorem: If you want progress while partitioned, you need to give up on consistency in some way. And criss cross merges are caused by a kind of inconsistency.

Centralization obviously has downsides, but I think Iron picks a nice point along the spectrum here: writing code is totally doable while disconnected, but operations like rebase and release that affect how information is shared between features requires you to be connected. I think it's a small price to pay to never have to deal with a criss-cross merge.

Making making better

We spend a lot of time and effort on training new people, and it never stops for long. Right now our winter-intern class is ending; in five months we'll have a slew of new interns to get up to speed, and a few months after that we'll have an incoming class of new hires.

A big part of our new-hire training is OCaml Bootcamp, a month-long immersion program for non-dev hires (mostly trading, but some from other areas like research, systems, and accounting). We don't think everyone at Jane Street should be a full-on software developer, but writing code is such a useful way to get things done that it's worth teaching the basics to a broad set of people.

Teaching programming, especially to people who are not planning on becoming software developers, is an opportunity to reflect on how unnecessarily hard programming can be. There's a huge learning curve as you struggle to learn your way around the Unix command-line, or figure out the key commands for controlling a 1970's era text editor like Emacs or Vi, or puzzle through the semantics of a distributed version control system like Mercurial. And all of that is before you even start writing code!

To me, this serves as a reminder of the importance of good tools. The quality of your tools can increase or decrease the steepness of the learning curve, and they also affect your day-to-day efficiency after you've climbed up that hill.

Tools are easy to undervalue. Most of our development time is spent, as it should be, on our first order problems -- writing the code that powers the systems that let us trade. And the connection between better tools and better trading can seem tenuous.

But really it's not tenuous at all. If you spend all your time on first order problems, you'll discover you're not solving them as fast as you should be. Getting things done effectively requires optimizing your own productivity, and to do that, you have to spend some time sharpening your tools.

And we've done a fair bit of sharpening. One recent example is Iron, a new code review and release management system that we started using last summer. Last year, we also rolled out a new build system called Jenga, which greatly simplified and sped up the process of compiling our code. Plus, we switched to a new version of OCaml, which includes a big set of improvements, some of which were specifically aimed at improving our development process [7]. And we funded some former interns to improve Merlin, a fantastic tool that provides IDE-like features like context-sensitive autocompletion in a way that can be easily integrated into multiple editors.

Jane Street is a pretty small shop --- we have fewer than 65 full time developers --- but even at our modest scale, spending time on tools is a big win. But it's really about more than dev tools. Thinking about how to make the people around you more effective informs how you work more generally, changing how you design libraries, how you manage services, and how (and whether!) you write documentation.

And in addition to making for a more effective organization, it's also a more pleasant way to live.

13 Virtues

Very early on in his life, while on lengthy voyage from London to Philadelphia, Ben Franklin created a system of thirteen virtues to live his life by. He spent the remainder of his days giving special focus to one virtue per week in a 13 week cycle, as well as noting the virtues he failed to live up to at the end of each day.

Over time he credited the system with making him more productive and more fulfilled.

My aspirations are not so lofty, but in the spirit of the new year, I present Ben's thirteen virtues as they relate to code review and discussion. Nothing here is meant to be taken as gospel, but together they give me a path towards the type of collaboration we value at Jane Street.

My simple hope is to waste less time, produce better code, and have fewer arguments over the next 12 months than I did over the last.

Temperance

Review thoroughly, but not to the point of exhaustion. Be mindful of the limits of review.

Silence

Say only things that clearly benefit the code; avoid trifling comments and tangents.

Order

Create the structure (files, modules and types) necessary to give every concept a clear place. Be suspicious of catchall modules.

Resolution

Respond to feedback and change requests quickly. Momentum is important.

Frugality

Don't waste people's time with frivolous review, careless comments, or code that isn't ready for review. Attention is expensive.

Industry

Prefer to respond with working code over additional commentary. Focus review on immediately productive outcomes instead of uncertain concerns.

Sincerity

Come to discussions with an innocent mind. Engage in code review with the clear goal of helping.

Justice

Weigh code decisions on the evidence at hand today, and not on personal preferences, prejudices, or obsolete past constraints.

Moderation

Avoid extremes in both style and approach. Incorporate strong views slowly.

Cleanliness

Spend time to lay out code in a clear and visually pleasing way. When reviewing, leave the code neater than you found it.

Tranquility

Don't become impassioned or incensed over trifles. Engage in all conversation with an open balanced tone and a sense of patience.

Chastity

Proliferate new ideas through the code base cautiously and be aware that even a good idea may not work in all places.

Humility

Take a modest view of your own contributions and feedback. Be unpretentious and respectful in your comments. Accept that you may be wrong.

Inspecting the Environment of a Running Process

Sometimes its useful to be able see the values of environment variables in running processes. We can use the following test program to see how well we can accomplish this:

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
        int n;
        char *envstr;
        while((n = scanf("%as", &envstr)) != EOF) {
                putenv(envstr);
        }
        return 0;
}

This program just reads strings from stdin and then basically passes them on to `putenv(3)1 so we have any easy way to modify our environment.

Now, lets run it with env -i to reset the environment to something pretty sparse:

[cperl@localhost ~]$ gcc -Wall t.c -o t
[cperl@localhost ~]$ env -i FOO=bar ./t

First, lets see what we can get out of /proc/{pid}/environ, as googling for this problem will undoubtedly point you in this direction (including ps eww which reads /proc/{pid}/environ):

[cperl@localhost ~]$ cat /proc/$(pgrep -x t)/environ | xargs -r0 -n1 echo
FOO=bar

Great, so that looks like its our answer!

Unfortunately, /proc/{pid}/environ only reflects the environment of the process when it started and does not reflect any calls that process might have made to putenv(3) or setenv(3) (you can experiment with the above program substituting in setenv(3) for putenv(3) and playing with overwrite to see what you get).

We can see that if we feed some data into our program, causing calls to putenv(3):

[cperl@localhost ~]$ env -i FOO=bar ./t
BAR=baz

And then check /proc/{pid}/environ again:

[cperl@localhost ~]$ cat /proc/$(pgrep -x t)/environ | xargs -r0 -n1 echo
FOO=bar

However, we can verify the data is really there if we attach with gdb and iterate over the environ(7) array directly:

[cperl@localhost ~]$ gdb ./t $(pgrep -x t)
...
(gdb) set $i = 0
(gdb) while (environ[$i] != 0)
 >print environ[$i++]
 >end
$1 = 0x7fffc8e42fec "FOO=bar"
$2 = 0x12d4080 "BAR=baz"

Unfortunately, I'm not aware of any other way to get this "dynamic environment info" (except for other ptrace based solutions). Obviously attaching to production processes with gdb (or ptrace in general) isn't a great idea. Most of the time you'll probably be fine inspecting /proc/{pid}/environ and verifying (via source code inspection) that that process you care about doesn't make any calls to putenv(3) or setenv(3) for the variables whose values you are interested in.

If you have any better ideas about how to get this information, please share in the comments!

How to choose a teaching language

If you were teaching a programming course, what language would you teach it in?

I like this question because it has any number of good answers, each quite different from the other, and each emblematic of a different approach to what programming is about.

The first formal programming class I took was COS 217 at Princeton, taught by the excellent (and at the time, I thought, terrifying) Anne Rogers. The course was (and is) taught in C, and the intellectual approach of the class was to start from the machine. Instead of just learning to program in C, we learned about how the machines we were programming to worked. That was where I first encountered instruction counters, stack frames, registers and the memory hierarchy. It was exhilarating.

Where C encourages you to start with the machine, Scheme wants you to start at the mathematical foundations of computation. You don't need to know what the lambda caluclus is to appreciate Scheme's slim core, and the way in which you can build a rich and vibrant computational world on top of it. That core is expressive enough that it makes it natural to introduce ideas that come up in multiple different languages, including functional and imperative techniques, object orientation, and logic programming.

The classic course in this vein is MIT's 6.001, also known as SICP, The Structure and Interpretation of Computer Programming. Sadly, 6.001 has been retired at MIT, but the book lives on, and is a worthwhile read even if you took your last CS course years ago.

MIT replaced SICP with a course based on Python, and this reflects a broader trend. As was highlighted by an informal study by Philip Guo, lots of schools now teach Python, particularly for early introductory courses. I have mixed feelings about this choice. Python is a wonderfully friendly language, but that friendliness is bundled together with some problems.

This was made apparent to me in part by my experience with students who chose to code in their interviews in Python. In many ways, Python is the ideal interview language, since its concise and readable syntax makes the constraints of coding on the whiteboard more bearable. But what I saw was that students who learn Python often walk away with a rather rough model of the semantics of the language. You might be surprised at what fraction of students who have programmed extensively in Python can't guess how Python lists might be implemented, to say nothing of their ability to explain the semantics of language features like generators or decorators.

This isn't really a knock on Python. After all, there's something great about a tool that lets you get things done without fully understanding how it works. But in different ways, both Scheme and C encourage you to understand what's going on from the ground up, and there's a certain pedagogical power in that. All in, I think Python is a great choice for an early introductory course, particularly one meant for those who aren't going to end up as computer scientists or full-time programmers. But I'd be leery of using it outside of those circumstances.

A development that I'm personally rather encouraged by is the emergence of statically typed functional languages, ML in particular, as teaching tools. Over the last few years, I've had the pleasure of visiting and lecturing at a number of schools that teach using OCaml or SML, including Brown, Cornell, Penn, Princeton, CMU and Harvard.

ML has gained ground for good reasons. First, it shares much of Scheme's elegant intellectual foundations, even though its core isn't quite as charmingly minimalistic as Scheme's. But ML has a wider range than Scheme because it allows you to show students the role of types in programming. Despite that greater range, OCaml and SML are relatively simple languages, which matters even more for teaching than it does for everyday use.

The only choice I've seen a lot of that I can't reconcile myself to is Java. Java is of course a widely used industrial language, but that doesn't mean it's a good teaching language. In my view, a key requirement for a teaching language is simplicity, and all the other choices I mentioned are simple in one way or another: C is a minimal layer on top of the machine; Scheme and ML are based on simple mathematical models of computation; and Python is a language that people find simple to use.

Java isn't simple in any of these ways. It's not particularly easy to get started with, as indicated by all the things you need to tell students to ignore rather than understand. (Yeah, public static void main, I'm looking at you!) Nor does it have a simple and transparent execution model like C. And an elegant computational core calculus like the one at the heart of Scheme and ML is nowhere in sight. The only real advantage I see for Java is vocational, and that doesn't seem to me like argument enough.

The thing to consider when you're picking a language to teach is that you're not just picking a bit of infrastructure for your students to program with during the course. You're picking an intellectual frame through which they will see all the lessons you teach them. And you should pick that frame with care.

Interviewing At Jane Street

Welcome to our version of the seemingly obligatory post about technical interviews. This topic has been covered by a lot of people already, so I'm going to do my best to not repeat all of the copious advice already out there.

Like many companies, we are looking for extremely talented technical people, and we have a challenging interview process that we think does a good job of selecting people who will do well here.

That said, we know that we miss lots of good people too. Some of that is because of the awkwardness of the interview process itself: time is short, the questions are weirdly artificial, and, of course, people get nervous. It's made even worse by the fact that programming on a whiteboard, a web browser, or even just on a computer that isn't yours is a bit like playing classical guitar with mittens on. It can put even an accomplished person off their game.

Missing out on good people makes us sad.

That's what this post is for. We hope that by talking a bit about what we're looking for, the ways people do poorly, and how we think you might be able to prepare, that we'll reduce the mitten handicap - at least a bit.

What Are We Looking For?

From our perspective, the main thing we want to figure out when we interview someone is: are they someone we want to work with?

That seems obvious enough, but it's a point that can get lost in the puzzles and whiteboard coding of an interview. Really, we think of our interviews as little simulations of what it's like to work together with the candidate. And while at the time, it may seem to the candidate that the interview is all about solving the problem, it's really not. We're much more interested in learning about how you work than we are in whether you actually finish whatever problem we put in front of you.

It's not that technical skill is irrelevant --- far from it. But it's only part of the story. Just as important to us is whether we can have a productive and fun discussion with the candidate.

To that end, we try to avoid algorithm bingo and puzzles with clever "aha" solutions. We prefer more open-ended problems that have no single right answer, since they give us more freedom to work together, and to see how the candidates' thinking plays out.

That sounds nice enough, but it's a bit high-level and hand-wavy. So here's a more concrete list of suggestions that follow from our overall approach.

Be nice

The smartest, best person in the world won't get hired at Jane Street if they aren't courteous and pleasant to talk to most of the time. Almost nothing will end your interview faster than being rude, pushy, or obnoxious.

Remember, we are looking for someone to work with, not just someone who can win an argument.

Be clear

And by clear, we mean simple and to the point. Use words and examples that get the core idea across to the widest technical audience possible.

Avoid showy, highly technical or deeply obscure terms of art, especially if you don't fully understand them. In the best case we'll likely just ask exactly what you meant by "hylomorphism", which wastes precious time. In the worst case it will become clear that you should have said "metamorphism" instead, which is just embarrassing for everyone involved.

Know what you don't know

Naturally we like it when candidates are good at solving the problems we put in front of them. But just as important, perhaps more important, is their ability to think reasonably about their own level of understanding.

In other words, we really like people who can express appropriate levels of confidence: admitting ignorance when they're unsure, and speaking confidently when they have the basis to do so. At a basic level this means being willing to say, "I don't know" rather than guessing and hoping when we ask you about something you aren't familiar with.

Know your language

Code is a wonderful language for communicating complex ideas because it provides a clear, concise and unambiguous way of expressing them. But, like any foreign language, it takes a lot of time and practice to get really comfortable.

We need you to be comfortable with it, because we communicate ideas in code a lot.

Now, comfortable doesn't mean that you have to be the world's best coder, or that you need to have memorized your favorite algorithms book. It means that you should be able to read and write code in at least one language without constant access to reference materials for common things, such as:

  • Standard control structures (loops/if-then/etc.)
  • Function, module, class, type, etc. definitions
  • Common data types like arrays, lists, hash tables/maps/dictionaries
  • Exceptions and other error handling techniques

Also, pick coding tools you understand well. Don't pick a functional language to make us happy. We'd much prefer you use a language that you know well. Similarly, when picking which features of the language to use, pick the ones you understand best. We're not going to be impressed with your elegant use of Python decorators if you don't really understand the details of what they do.

In other words, pick a simple, clunky solution that you understand over a fancy, elegant one that you don't.

Remember CS 101

We've hired plenty of successful people who didn't have a traditional college background in CS, and we certainly don't require a masters or a PhD. That said, we need you to have a solid working knowledge of core computer science concepts, including:

  • Abstraction layers like functions, objects, and modules

  • Basic algorithms and data structures, including binary search, sorting, hashing, breadth/depth first search, hashtables, binary trees and heaps.

  • Techniques for estimating CPU and memory costs, including big-O notation.

So if you can't for the life of you recall what amortized analysis is, and you can't nimbly bang out the code for a depth-first search it's probably worth reviewing some of this material.

Think about real computers

Depending on your point of view it's either a sad or beautiful fact that the most elegant code can only run on top of the giant, complex, odd stack of parts and abstractions that is a real computer. Since we need programs to actually run, we need people who understand the inner workings of that behemoth.

Now, this doesn't mean that we quiz every candidate about deep kernel internals, or the differences between SDRAM vs SGRAM. But for some jobs in systems development we do expect a fair amount of detailed knowledge, and in general it's a big plus if in your thinking you can take into account things like cache effects, IO patterns, memory representations, and the capabilities of real CPUs.

What We Don't Look For

  • Perfection. Our questions are often designed to be open ended enough that even the best people we've seen couldn't answer them fully in the time alloted. We want to keep the conversation going to learn everything we can, and we don't expect that you'll answer everything 100% perfectly.

  • We don't ask developers mental math, or math olympiad questions despite what you might have read online. Dev interviews are about programming.

  • We don't ask developers logic puzzles about pirates, people who only tell the truth, or which door the tiger is behind. Dev interviews are about programming.

How do people do poorly?

The most common issue is, of course, that some candidates just don't have the background for the job they are applying for. The solution to that is to learn more, and practice more. But there are a few other less obvious reasons that interviews with otherwise technically good people can go awry.

They're careless

One of the most common pieces of negative feedback from our interviewers is that the candidate was careless. This usually doesn't mean that the candidate didn't make progress, or didn't have good insights. It means that the candidate didn't think carefully and systematically about how their program might go wrong.

We care that you make progress, but we are rarely concerned about finishing a problem. In fact, many of the problems are designed to go on for far longer than the average interview length. It's better to step back and check your work carefully before you claim that it is finished than to rush.

They talk more than they code

Some candidates do a good job of talking through the problem and explaining their thinking, but never quite get around to concretely answering the question. We want to hear your ideas, but we also want to see you produce concrete solutions, which almost always involves writing code. There are lots of things that we can't learn about a candidate without seeing their code and talking through the details.

Take some time at the beginning to think about the solution and to talk about your plans, but make sure you start writing code - even if it isn't the code you would write if you had more time.

They don't generalize

We try to keep our interviews interactive, and we'll often stop candidates to ask about something they have just done, or to point out something that we think might be confusing or incorrect. We understand that we've seen these problems over and over again, and that you are coming to them fresh, so you shouldn't worry just because we've found a problem with your solution.

What you should do is generalize the advice. If we point out that you missed a case, consider other cases you might have missed. If we remind you of an invariant you forgot, find a way to protect yourself from making the mistake in other places in your code.

They say one thing and do another

We love it when a plan comes together, but it's extra hard to watch when a good plan falls apart on execution. If you hear a question, and discuss a plan of attack with your interviewer, do what you claim you will do. Don't change the plan in the middle, or drop it entirely in favor of a better idea without some discussion. You have a very limited amount of time to describe your solution in code, and executing a decent plan well is better than producing a Frankenstein's monster of 3 different plans that doesn't quite come to life.

If you do get partway through and start to lose faith step back and talk about it. Explain exactly why you are concerned, and whether you think it might be fatally flawed at the core, or just not ideal. If there really is a fatal flaw and you've seen it, we'll help you get out of the jam, and we'll appreciate that you articulated it. If it's just not quite perfect we'll likely encourage you to continue.

So, what can you do to prepare?

This part is short and sweet. Build something - from scratch - in a language you like. Don't stop short. Build the whole thing.

Now, show it to the smartest people you know, get feedback, tear it down and build it again with what you've learned.

Repeat with a new problem.

We are looking for people to build real things with us, and practice really does make perfect.

What the interns have wrought: RPC_parallel and Core_profiler

We're in the midst of intern hiring season, and so we get a lot of questions about what it's like to be an intern at Jane Street. One of the things people most want to know is what kind of projects they might work on as an intern.

That's of course hard to answer precisely, since we don't yet know what next summer's intern projects are going to be. But you can get a sense by seeing some of what interns have done in the past. To that end, I thought I'd describe a couple of intern projects that were done this past summer.

Rpc_parallel

Rpc_parallel is a new library written by Todd Lubin (who will be returning to Jane Street full-time next fall), aimed at making parallel programming in OCaml simpler and safer.

Writing programs that take advantage of multiple cores, and even of multiple computers, is of course a necessity. In OCaml, such parallelism is typically achieved by writing multiple single-threaded programs that use message passing to coordinate.

We have lots of parallel message-passing systems, but while these share a lot of infrastructure (notably, the Async_rpc library), we've never really settled on a lightweight way to construct complete parallel programs. Instead, each such program tends to have its own little process coordination framework.

A while back, we wrote a library called Async_parallel, (described here). Async_parallel does a lot of things right -- in particular, it makes it easy to spin up and manage new processes and distribute work between them.

But Async_parallel has a problem. It is based on OCaml's marshal facility, which allows you to serialize arbitrary data, including functions, between processes. This sounds great, but it has a dark side: marshalling arbitrary functions turns out to be error prone. In particular, in order to send a function over the wire, you also have to send a copy of any data that that function implicitly depends on.

Unfortunately, it's hard to know what kind of data is hidden behind a function, which can cause a few different forms of trouble: it might send much more data than you expect, it might fail unexpectedly if it hits one of the forms of data that can't be marshalled, or it might lead to crazy and hard-to-predict behavior if some of the data required by the function is meant to be mutable shared state.

Instead, we wanted a library that was more typeful and constrained in terms of what was sent over the wire. This pushed us towards a design where, at the cost of some extra verbosity, we explicitly declare the types of data that is sent. In exchange, we get a system that is easier to understand and debug.

One of the great things about Rpc_parallel is how fast it came together. Todd got it into a sufficiently good state by the middle of the summer that he was able to use it for his other projects (interns typically have at least two major projects over the course of the summer).

Rpc_parallel also benefitted from some world-hopping collaboration. Interns spend at least a week in an office other than their home office, and Todd ended up visiting Hong Kong. While there, he ended up spending a lot of time talking and working with Chengqi Song, who had a lot of experience with Async_parallel. Out of those discussions came a complete redesign and rewrite of the library, factoring out the core primitives for coordinating across multiple processes, and making the overall library simpler and more general.

By the end of the summer, a few other people picked up and starting using it for other projects, and last week, it was released open source, so you can take a look at it yourself on github.

Core_profiler

Profiling is surprisingly hard, and as such it's perhaps unsurprising that there are lots of ways of doing it. If you want to understand the cost of an individual operation, you might want a micro-benchmarking library like Haskell's Criterion, or our own Core_bench. If you're trying to understand properties of a whole application, like which lines of code it's spending its time on, or how many cache-misses are occurring, you might want to use something like Linux's perf tools, which use CPU-level counters to efficiently gather profiling statistics from your program.

Another useful technique is for the programmer to modify the source to add explicit probes that keep track of when certain program points are reached, and write out that information to a log that can be analyzed after the fact.

Daniel Richman (who will be returning for an internship next summer) worked along with Roshan James (formerly an intern himself, now fulltime) on a library called Core_profiler, which aims to make the use of such probes easy and efficient. Efficiency matters quite a bit, because if logging a probe takes more time than the thing you're trying to measure, you basically can't extract reliable data. Keeping the overhead small, therefore, is a must.

Accordingly, a lot of Daniel's time was spent thinking very carefully about how to write the probes in a way that would only minimally disrupt the execution of the program. He started a simple but less efficient library, called Stats_reporting, which took about 60ns and two words of allocation per probe, and started machining it down from there.

The first step was to avoid all allocation, which meant we could no longer use bin-prot, our standard binary serialization technique, since bin-prot requires that you allocate an OCaml object representing the data to be serialized. So they moved to using an internal library called Protogen for generating zero-allocation serialization code.

That brought us down to about 30ns, and zero allocations. We then decided to try out writing our own hand-rolled binary protocol, so we could have a yet-more-closely optimized binary layout. That brought us down to about 20-25ns. The next hack was to customize the binary format yet more by packing multiple values into a single, 63-bit OCaml integer. That, plus some cleverness to make sure that writes were word-aligned, brought the cost of a simple probe down to 12ns. In addition to the design changes, Daniel also spent a lot of time carefully examining the assembly code, to make sure that there were no surprises from the code generator.

We're pretty happy with the end result. We think that Core_profiler probes are now a good bit cheaper than the dynamic probes you can insert using a system like DTrace, and are in any case the best tool we have for tracking the performance of even relatively quick code-paths. And, as a nice bonus to all this optimization, the offline processing got about twice as fast as a side effect of the runtime improvements.


There's more to be said about both of these projects, and about the many other projects that were done this summer. If you're interested in applying, you can do so here:

http://janestreet.com

You don't need to know anything about finance or functional programming to apply. But if you come, you'll learn a ton about both by the time you're done.

What is gained and lost with 63-bit integers?

Almost every programming language uses 64-bit integers on typical modern Intel machines. OCaml uses a special 63-bit representation. How does it affect OCaml?

OCaml int memory representation

Most of OCaml's types are in memory represented as a header followed by data. The header is a 64-bit integer containing the length of the data and a tag. Tag is a rough classification of the type. The only OCaml's types which differ from this are ints and sometimes floats.

Floats normally have header and data, data being the value of the float itself. This representation is called "boxed". If a record's field is float, record's data will actually contain the pointer to the float data. The only exceptions are records with only floats and float arrays, whose data instead of pointers contain the values of floats. This representation is called "unboxed".

Values of type int are never stored as header and data (boxed). Int x is stored as (x << 1) | 1, where << is left shift and | is bitwise or, hence its least significant bit is always set. Pointers are word aligned, so they will never have this bit set, hence how ints and pointers are discerned. It is assumed that much of typical data is integers, so this is done to significantly improve performance:

  • there's no need to dereference a pointer when getting an int
  • no memory allocation is needed when creating ints
  • less work for the garbage collector
  • less memory fragmentation
  • no memory is needed for int headers

Distinguishing whether a value is int or pointer is as simple as testing x & 1, so this feature doesn't slow down garbage collector, polymorphic hash, polymorphic compare and whatever else structurally inspects data. One should note that this doesn't apply to the types int32 and int64, which are always boxed.

Penalty

Having the extra bit comes with a price - arithmetic operations are more complicated. For example

  • x + y is translated to CPU instructions x + y - 1
  • x * y is translated to CPU instructions (x >> 1) * (y - 1) + 1
  • x / y is translated to CPU instructions (((x >> 1) / (y >> 1)) << 1) + 1
  • x lsl y is translated to CPU instructions ((x - 1) << (y >> 1)) + 1

Sometimes this penalty is small or nonexistent. For instance there's no need to fix the bit in x + y - z. Only one bit fixing is needed for all five additions in x + y + z + w + u + v.

Another help is the Intel CPU instruction LEA, which can compute the sum of three integers with a single instruction, like x + y - 1. Unfortunately, LEA became very slow in the recent generations of CPUs. Intel doesn't suggest this will change.

This benchmark (test.ml) tries to estimate the difference in the performance. The results from Sandy Bridge show about 2 times speed difference in arithmetic operations. Assembly can be examined by compiling using "ocamlopt -S test.ml".

speed(ns)       63-bit   64-bit   slowdown
add independent 0.327588 0.121502 2.696156
add   dependent 0.328160 0.169375 1.937477
mul independent 0.614824 0.328060 1.874120
mul   dependent 1.094343 0.328872 3.327565
lsl independent 0.359828 0.166088 2.166489
lsl   dependent 0.762251 0.177468 4.295151
xor independent 0.249350 0.122900 2.028886
xor   dependent 0.404255 0.170380 2.372668

Agner's instruction tables show that the difference is even bigger with later generations of CPUs. For instance, Haswell can do four integer adds per cycle versus one LEA.

Conclusion

The benefits of unboxed ints are amazing. On the other hand, arithmetic operations are significantly slower. How much do arithmetic operations affect an average program? Could we have a solution which would keep ints unboxed but have fast arithmetic operations?

Clearly Failing

The Parable Of The Perfect Connection

Every programmer in the Intertube connected era eventually has to write, or at least use, an API for a network service - something like a database, a message queue, or web service. And, each and every one of them begins with the enthusiasm of the recently inducted as they realize that they can reach out their hand and control something else. And, each and every one of them experiences that moment of frustration and anger when they realize that their buddy out in cyberspace is a bit of a flake.

Now, we aren't talking about a seriously unreliable friend. In fact, your buddy isn't really unreliable at all. He's there 99.9% of the time, and even when he's out for a quick coffee break he tends to come back quickly. Besides, you don't have any real control over him. He's maintained by some other people in a bunker far far away. Those people are confusing, hard to reach, and don't seem to care about your problems. So, you do what countless programmers have done in the past...

You write a loop.

let rec connect_until_success host_and_port =
  connect host_and_port
  >>= function
  | Ok t -> t
  | Error _ ->
    after (sec 5.)
    >>= fun () ->
    connect_until_success host_and_port

Because you are feeling helpful you enshrine the loop in an API to help other people, because, after all, your buddy is pretty reliable, and it would be a shame if other people had to deal with all the nasty complexity that you've just programmed away.

There are a lot of classic twists and variations on this core storyline:

  • count the number of failures and give up after x tries (x is usually 3 or 1)

  • back off exponentially so you don't "hammer" the service

  • don't wait at all and actually hammer the service in a tight loop because latency is important

  • log the error, because someone will look at the logs carefully. Then retry.

  • keep careful track of the time of the last failure, and always retry, unless the last retry was "recent", because one blip makes sense but not two.

  • return an elaborate type that encompasses all possible failure modes, including the fact that we are retrying. Maybe deliver that information in a side channel stream of updates.

  • forget giving people a connect method at all. Just give them a query method and handle the pesky connection details away from prying eyes. You get bonus points if the API doesn't look like you can ever fail.

Hidden Failure Is Still Just Failure

Sadly, the problem isn't in the cleverness of the technical wizardry you use to cover up for your buddy, it's the fact that covering up failure is just another form of failing.

The connection failed. Not telling the world outside of your API is like hiding a bad grade from your parents. They might not catch you once or twice, but you still got the bad grade, and eventually they are going to notice that something is very very wrong - likely after things have really gone off the rails.

Which leads us to three useful principles of failure that apply to self-healing network connections, and most other failure besides.

Fail Quickly, Clearly, and Cleanly

When you design an API, or a system, or even a big complex collection of systems, and you think about how it should fail, make sure that the failure is:

  • Quick: Taking too long to fail is a cardinal sin. Don't retry a thousand times, don't get an hour deep into a computation only to realize that one of the config parameters is bad, and don't forget to add a timeout when the other side might never respond. The sooner you can tell the outside world that you have failed the sooner it can react.

  • Clear: Make sure that your failure behavior is clear, well documented, and can't be missed in a decently written program. It should be obvious from a read of the API and shouldn't require a dip into the underlying code to understand. Beyond that, don't mumble when you fail (I'm looking at you errno in C). Similarly, don't go on about all the little nuances surrounding your failure with a 20 case variant response. Most API consumers only care about the binary state of failure in the code. The details are generally uninteresting outside of debug logs and human readable messages.

  • Clean: Clean up anything and everything you can after you fail, as aggressively as you can. That means close your file descriptors, free your memory, kill your child process. Work harder than normal to make the cleanup portion of your code simple and obviously correct. But still remember to be quick. Do your work after you tell everyone that you have failed if there is any chance that you won't succeed. Don't be that function/program/system that never responds again because it hung trying to clean up before it reported the error.

How Should It Look?

Something like the following API, comments and all.

This makes heavy use of some nice things from our publicly released libraries. If you aren't already familiar with them you can take a deeper look here.

If you want the TLDR version, you really only need to understand Deferred and Or_error to get the gist.

A Deferred is a value that will get filled in at some point in the future (these are sometimes called promises), and when you read it here it just means that the function doesn't return immediately - usually because some network communication needs to happen to get the result.

Or_error is a fancy way of saying, "this might work, or it might give you an error". Returning an Or_error forces the caller to check for an error case in a very clear and explicit way. It's our standard way in an API to indicate that a function might not succeed because, unlike a comment about an exception that might be thrown, or a special return value (like NULL), Or_error can't be missed.

So, if you see something like:

response Or_error.t Deferred.t

You can read it as, "this won't return immediately, and when it does it will either be an error, or a response".

type t

(** connect to the service, returning t or an error if the connection could not
    be established. *)
val connect : ?timeout:Time.Span.t -> ... -> t Or_error.t Deferred.t

(** a simple helper function that calls connect with the original parameters.
    The passed in t is always closed when reconnect is called.  Multiple calls
    to reconnect on the same t will result in multiple connections. *)
val reconnect : t -> t Or_error.t Deferred.t

(** connects to the service and runs the provided function if successful.
    If the connection fails or [f] raises an Error is returned.  [close] is
    automatically called on [t] when [f] completes or raises. *)
val with_t
  :  ?timeout:Time.Span.t
  -> ...
  -> f:(fun t -> 'a Deferred.t)
  -> 'a Or_error.t Deferred.t

(** If timeout is not given it defaults to a sensible value. *)
val query : t -> ?timeout -> ... -> response Or_error.t Deferred.t

val query_exn : t -> ?timeout -> ... -> response Deferred.t

(** If timeout is not given it defaults to a sensible value.  The returned
    reader will be closed when the underlying connection is closed, either by
    choice or error.  It is a good idea for the update type to express the closed
    error to differentiate a normal close from an error close.  *)
val pipe_query
  :  t
  -> ?timeout:Time.Span.t
  -> ...
  -> update Pipe.Reader.t Or_error.t Deferred.t

val pipe_query_exn : t -> ?timeout -> ... -> update Pipe.Reader.t Deferred.t

(** close is idempotent and may be called many times.  It will never raise or
    block.  Once close has been called all future queries will return Error
    immediately.  A query in flight will return error as soon as possible. *)
val close : t -> unit

(** fulfilled when t is closed for any reason *)
val closed : t -> unit Deferred.t

(** closed is an error state.  Once a connection is in an error state it will
    never recover. *)
val state : t -> unit Or_error.t

Seriously, Never?

Up until now I've been making the case for try once, fail quickly and clearly, and I think that much, if not most of the time, it's the argument that should hold. But the world is a complex place. Sometimes things fail, and somebody somewhere has to try again. So where should that happen, and what should we consider when we start talking about retry logic?

How will this stack?

Loops stack poorly and lead to confusing non-linear behavior. This means that you should usually confine retry logic to a component near the bottom or the top of your stack of abstractions. Near the bottom is nice, because, like TCP, everyone can rely on the behavior. Near the top is nice because you have the most knowledge of the whole system there and can tune the behavior appropriately. Most network service API's are in the middle somewhere.

Can I opt out?

TCP sits on top of UDP and provides a solid retry mechasnism that works really well for most of the world, but it would be a mistake in design to only expose the TCP stack. If you are going to provide a self-healing connection/query system as part of your API, make sure to build and expose the low level simple API too. This lets clients with needs you didn't anticipate interact in the way that they want.

Love shouldn't be forever

It's more likely to be a mistake to try forever than to retry once, or for a set period of time. It's one thing to protect a client against a transient failure, but when the transient error lasts for minutes or hours, it's probably time to give up.

Your resource usage should be bounded

Loops, especially loops that create and clean up resources, have a tendency to consume more than their fair share. This is especially true when the loop is trying to cover for an error case, where things like resource cleanup might not work entirely as advertised. So, it's on the writer of a loop to test it heavily and to have strong bounds on how much CPU, memory, file handles, bound ports, etc. a single self-healing connection can take. Getting this right is hard, and you should be nervous about doing it quickly.

How bad is failure?

It's much easier to justify a looping retry if it's the only thing keeping a large complex system from crashing completely, and it's correspondingly harder to justify when it covers just one more case that any client needs to deal with anyway. For instance, a retry loop on my database connection might cleanly cover the occasional intermitent outage, but there are probably real reasons that the database might be out (network failure, bad credentials, maintenance window), and my program likely has to handle this case well anyway.

Not all failure is created equal

Some failures justify a retry. Some failures don't. It's important in retry logic to avoid big try/with blocks that catch any and every error on the assumption that any query or connection will eventually succeed. Retrying because my connection closed is different than retrying my malformed query. Sadly you can't always tell the difference between the two cases, but that doesn't mean you shouldn't make an effort.

You still have to consider failure

You can use a retry loop to limit errors above a certain abstraction boundary, or to limit the impact of small glitches, but you can't recover gracefully from all of the errors all of the time. When you add a retry loop to your system at any level stop to consider what should happen when the error is a real error and isn't transient. Who is going to see it? What should they do about it? What state will clients be in?

It's easier to solve a specific problem than a general one

It's much easier to come up with retry logic that makes sense for a particular application in a particular environment than it is to come up with retry logic that is generically good for all clients. This should push you to confine retry logic to clients/API's that have a single well considered role and to keep it out of API's that may be used in many different contexts.

Quick, Clear, and Clean still (mostly) apply

Even when you are considering retry logic, make sure you think about getting stuck (quick), getting debug information about your state to the outside world (clear), and keeping resource usage bounded (clean).

The ML Workshop looks fantastic

I'm a little biased, by being on the steering committee, but this year's ML workshop looks really interesting. Here's a link to the program:

http://okmij.org/ftp/ML/ML14.html

It has a bunch of interesting papers, including work on cleaning up and simplifying first-class modules, type-level module aliases, new approaches in type-directed test generation, a couple of papers on implicits (how they've worked out in Scala, and a proposed approach in OCaml), and more.

The OCaml users-and-developers meeting is also looking good, though for me there's a clear favorite among the presentations: OCaml Labs' announcement of version 1 of the OCaml Platform. I think that's a pretty important milestone for the language.

All told, it's shaping up to be a good ICFP.

4