Building a lower-latency GC

We've been doing a bunch of work recently on improving the responsiveness of OCaml's garbage collector. I thought it would be worth discussing these developments publicly to see if there was any useful feedback to be had on the ideas that we're investigating.

The basic problem is a well-known one: GCs can introduce unpredictable pauses into your application, and depending on how your GC is configured, these pauses can be quite long. Unpredictable latencies are a problem in a wide variety of applications, from trading systems to web stacks.

One approach people often take is to avoid using the allocator altogether: pool all your objects, and never allocate anything else. You can even keep many of your pooled objects outside of the heap.

This works, but makes for a less pleasant coding experience (and code that is trickier and harder to reason about.) So while pooling is a valuable technique, we'd like to have a GC that lets you run with low latencies without sacrificing the ability to allocate.

What are the problems?

OCaml's garbage collector is already pretty good from a latency perspective. Collection of the major heap in OCaml is incremental, which means that collection of the major heap can be done in small slices spread out over time, so no single transaction need experience the full latency of walking the major heap. Also, collection of the minor heap is pretty fast, and OCaml programs tend to do pretty well with a relatively small minor heap --- typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.

Still, there are problems with OCaml's collector.

No profiling

There's no good way in the stock runtime to see how long the different parts of collection take, and that makes it hard to optimize.

False promotions

OCaml's generational collector is very simple: objects are typically allocated first on the minor heap, where the work is effectively three inlined instructions to bump a pointer and check whether you've hit the end. When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.

In a typical functional workload, most of your allocations are short-lived, and so most of the minor heap is dead by the time you do the minor collection, so the walk of the minor heap can be quite cheap. But there's always a small number of false promotions, objects that would have become unreachable shortly, but were promoted because the minor collection came at an inconvenient time.

Clocking

One fundamental issues with the stock runtime is that the collector is clocked in terms of minor allocations --- ignoring, critically, the amount of time that has gone by.

This clocking makes sense for many applications, but if you're building a server that needs to respond to bursty traffic with low and predictable latencies, this is the opposite of what you want. Really, what you'd prefer to do is to defer GC work when you're busy, instead scheduling it at times when the application would otherwise be idle.

One solution here is to allow the application to drive the scheduling of the GC, but the runtime in its current form doesn't really support doing this. In particular, while you can choose to explicitly run a major slice, the collector accounting doesn't take note of the work that has been done that way, so the major collector works just as hard as it did previously.

Furthermore, the major slice always forces a minor collection. But running minor collections all the time is problematic in its own right, since if you run them when the minor heap is too small, then you'll end up accidentally promoting a rather large fraction of your minor allocations.

Insufficient incrementality

While the major collector is mostly incremental, not everything about it runs incrementally. In particular, when the major collector hits an array, it walks the array all at once. This is problematic if you start using large arrays, which does happen when one is using pooling techniques. Similarly, the collector is not incremental when it comes to scanning GC roots.

Immediate accounting

The stock runtime decides how big of a major slice to do based on how much was promoted to the major heap in the last minor collection. This is part of a heuristic that is meant to make sure that the collector keeps up with the rate of allocation, without running needlessly when the application isn't promoting much.

But the accounting is in some sense too immediate: if you do a lot of promotion in a given cycle, you're forced to do the collection immediately. While all that work needs to be done, it's not clear that it needs to be done immediately. Again, for responsive systems, it's often better to push off work until after the busy times.

Making it better

Happily, Damien Doligez, the author of OCaml's GC, has been visiting with us for the last few months, and has been doing a lot of good work to improve the runtime, and in particular to address the concerns raised above. Here's the summary of the changes made thus far.

Better profiling

A set of probes was added to the GC, allowing us to record in a quite detailed way every phase of the collection process. This is quite detailed, telling you the phase (marking vs sweeping) and the sub-phase, as well as keeping track of a collection of useful counters. This is available in the instrument branch.

Aging

Damien has also implemented aging in the minor heap. Aging is a technique whereby objects stay in the minor heap for several minor collections before being promoted to the major heap. The goal of aging is to reduce the amount of false promotion.

Better incrementalization

Several of the stages of the collector have been made interruptible, including scanning of arrays and of the roots. The effect here is to reduce the worst-case delays imposed by the collector. This is in the low-latency branch.

Separating major slices from minor collections

In the stock runtime, major slices and minor collections are always done together. In the low-latency branch, you can run one without the other, and you can basically run them at any time. This has a couple of advantages --- one is that it's essentially another form of incrementalization, allowing you to do less work per GC pause.

The other is that it gives you more freedom to schedule collections when you want to. One way we're looking at using this is to have an application-level job that wakes up periodically, and does a heuristic check to see if the system appears busy. If it doesn't, then it schedules some GC work, and it may choose to do either a minor collection or a major slice. A minor collection would only be chosen in the case that the minor heap is bigger than some configured level, to avoid too much false promotion; but a major collection can be done at any time.

Smoothed work-accounting

Instead of just keeping track of the amount of work that needs to be done in the next major slice, the GC in the low-latency branch tracks work that must be done over the next n major slices, by keeping these numbers in a circular buffer.

The runtime also uses these buckets for keeping track of extra work that has been done by application-forced major slices. A forced major slice takes work away from the front-most bucket, potentially bringing the bucket to negative territory.

When the runtime checks if it needs to do a major slice, it looks at the first bucket. If it's got a positive amount of work in it, then that work is done in that slice, if possible. Whatever is left over (which may be positive or negative) is spread out uniformly over the next n buckets.

Segmented free lists

A big part of the cost of minor collections is the cost of finding free blocks. One observation is that in many OCaml applications, block sizes are quite small. One way of taking advantage of this is to have a set of size-segregated free-lists, for a configurable set of sizes. e.g., one could have a different free list for blocks with 1, 2, 3 and 4 slots.

This is still ongoing (read: not working yet), but it will show up in the multi-free-list branch eventually.

How is it going?

This is all very much a work in progress, but the results so far have been quite promising. By using a version of the compiler with most of these changes and with an application-driven job that forces major slices in quiet times, we were able to reduce tail latencies by a factor of 3 in a real production application. That's pretty good considering that we've done essentially no parameter tuning at this point.

That said, some of the results are less promising. We were somewhat disappointed to see that when doing more traditional batch jobs, aging didn't provide a significant improvement in overall compute time. It seems like in many applications, aging saves some on promotion, but the minor collection itself gets a little more expensive, and these seem to nearly cancel out.

This seems especially surprising given that aging is present in most GCs, including those for Java's HotSpot, the .NET CLR, and GHC. Given that everyone seems to use aging, I would have expected aging to have a quite noticeable benefit for lots of workloads, not just carefully tuned packet processors.

A call for help

The progress we've made so far is quite promising, but a lot of things are still up in the air. The reason that I wanted to post about it now is that I was hoping to hear feedback from others who have experience dealing with similar issues in other languages.

So, if you have thoughts about the techniques we've tried for making OCaml's runtime more responsive, or suggestions for other techniques we should consider, please comment!

Faster OCaml to C calls

The official OCaml documentation "Interfacing C with OCaml" doesn't document some interesting performance features.

C functions with no OCaml allocation

A C call can allocate OCaml data and pass it back to OCaml, for example using caml_copy_string(s). Between the C call allocating OCaml data and passing it back, it has to make sure that OCaml's Garbage Collector doesn't collect it, as the Garbage Collector can be triggered during the C call. There's an intricate mechanism which assures that, part of which are CAMLparam, CAMLlocal and CAMLreturn.

This mechanism can be bypassed if the C call is not going to allocate any OCaml data. This can yield performance benefits especially in shorter functions. To bypass it, CAMLparam, CAMLlocal and CAMLreturn should not be used and the primitive should be declared with "noalloc".

For example, OCaml's compare is not smart to avoid branch mispredictions for floats. Moving comparison to C speeds it up a little bit. "noalloc" speeds it up a lot.

float compare            8.93 ns
float_compare_c          7.88 ns
float_compare_c_noalloc  5.32 ns

external float_compare_noalloc : float -> float -> int =
  "float_compare_noalloc_stub" "noalloc"

CAMLprim value float_compare_noalloc_stub(value vf, value vg)
{
  double f = Double_val(vf);
  double g = Double_val(vg);
  return Val_int((f > g) - (f < g) + (f == f) - (g == g));
}

C functions with float arguments and float return

Since C code must use boxed OCaml floats, any unboxed float must be boxed prior to the C call. This is not cheap, especially for fast functions. This boxing can be avoided if the C call accepts and returns only floats.

For example, float_min can be replaced with a single CPU instruction. Unfortunately, the C implementation is much slower because of boxing floats:

float_min                  6.09 ns
float_min_c               15.92 ns

let float_min (x : float) y = if x < y then x else y

CAMLprim value float_min_stub(value x, value y)
{
  CAMLparam2(x, y);
  CAMLlocal1(v);
  double z = Double_val(y);

  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(Double_val(x)));
  v = caml_copy_double(z);
  CAMLreturn(v);
}

external float_min_c : float -> float -> float = "float_min_stub"

To avoid boxing, C function's arguments and return should be "double", CAMLparam, CAMLlocal and CAMLreturn should be avoided and the primitive should include "float" and both interpreted and compiled implementation:

float_min_c_float          1.95 ns

external float_min_inan_c_float : float -> float -> float
  = "float_min_inan_float_bytecode" "float_min_inan_float_stub" "float"

CAMLprim double float_min_inan_float_stub(double x, double y)
{
  double z = y;
  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(x));
  return z;
}

C functions with float arguments and non-float return

We might be able to further speed up float_compare_noalloc if we avoided boxing. Alas, that function returns integer so it's impossible to use "float". Is it still possible to avoid boxing? The answer is yes, by simply converting float to int.

float_compare_c_float    3.73 ns

CAMLprim double float_compare_float_stub(double f, double g)
{
  return (f > g) - (f < g) + (f == f) - (g == g);
}

external float_compare_float : float -> float -> float
  = "float_compare_float_bytecode" "float_compare_float_stub" "float"
let float_compare_float x y = int_of_float (float_compare_float x y)

Why GADTs matter for performance

When GADTs (Generalized Algebraic Data Types) landed in OCaml, I wasn't particularly happy about it. I assumed that it was the kind of nonsense you get when you let compiler writers design your programming language.

Which is to say that the standard GADT examples all seem to be about the kinds of things that compiler writers do, like embed domain-specific languages or build typed abstract-syntax trees. But it didn't seem particularly relevant for the kind of systems programming that I think about.

But it became apparent pretty quickly that I was wrong. In particular, since GADTs have landed, at Jane Street we've found lots of examples where GADTs are important for performance, of all things. The theme in these examples is that GADTs enable you to tweak your memory representations in ways that would otherwise be painful or impossible to do safely in OCaml.

The Problem of Polymorphism

I'd like to walk through a simple example that illustrates this aspect of GADTs, but first, a few words about OCaml's memory representation. OCaml's polymorphism is in an important way backed on that memory representation. In particular, consider a simple polymorphic function like List.iter, which has the following type:

  1. val iter: 'a list -> f:('a -> unit) -> unit

The polymorphic type tells you that List.iter can operate on lists of any type, and in OCaml, this is achieved with a single compiled version of the code. This is possible because the memory representation of the elements of a list are uniform: you can always refer to an OCaml value in a single word, either as a pointer to a heap-allocated value, or as an immediate that fits inside that word.

That means that some OCaml datatypes are less efficient space-wise than you might imagine. Arrays, for example, take the same amount of space per element whether those elements are bytes, 32-bit ints, or 64-bit ints. (There's actually some special magic in the compiler for float arrays, though this is probably more trouble than it's worth, as described by Alain Frisch here. But let's ignore float arrays for now.)

OCaml does have a tighter representations for byte arrays, called bytes. But it's a completely different type, and so building a general purpose data structure that uses bytes when it would make sense, and ordinary arrays otherwise, is a little awkward.

Controlling memory representation without GADTs

Let's see what happens if we try to design (without GADTs) an array type that sometimes uses the general array representation and sometimes uses bytes.

You could imagine representing such a value using an ordinary variant.

  1. type 'a t = | Array of 'a array
  2. | Bytes of bytes

We could then implement each of the operations we want on our new array type, implementing each operation differently depending on the particular representation. Let's see what happens if we just take this idea and run with it, implementing all the required functions in the most straightforward way.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array of 'a array
  4. | Bytes of bytes
  5.  
  6. let of_bytes x : char t = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length = function
  10. | Array a -> Array.length a
  11. | Bytes s -> Bytes.length s
  12.  
  13. let get t i =
  14. match t with
  15. | Array a -> a.(i)
  16. | Bytes s -> s.[i]
  17.  
  18. let set t i v =
  19. match t with
  20. | Array a -> a.(i) <- v
  21. | Bytes s -> s.[i] <- v
  22.  
  23. end;;
  24.  
  25. module Compact_array :
  26. sig
  27. type 'a t = Array of 'a array | Bytes of bytes
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : char t -> int -> char
  32. val set : char t -> int -> char -> unit
  33. end

This seems pretty good at first glance, but the inferred types aren't quite what we want. In particular, get and set only work with Compact_arrays containing characters. If you think about how type inference works, it's not really all that surprising. If you think about the code for get:

  1. let get t i =
  2. match t with
  3. | Array a -> Array.get a i
  4. | String s -> String.get s i

The OCaml compiler is looking for a single type to assign to the return value for all the cases of the match. Given that String.get always returns a char, then Compact_array.get will be restricted to only returning a char.

One way to work around this problem is to essentially implement what we want as a poor-man's object. Here, we just write the code separately for the different cases, and stuff those functions into a record full of closures. Here's how that looks.

  1. module Compact_array = struct
  2.  
  3. type 'a t = { len: unit -> int
  4. ; get: int -> 'a
  5. ; set: int -> 'a -> unit
  6. }
  7.  
  8. let of_string s =
  9. { len = (fun () -> String.length s)
  10. ; get = (fun i -> String.get s i)
  11. ; set = (fun i x -> String.set s i x)
  12. }
  13.  
  14. let of_array a =
  15. { len = (fun () -> Array.length a)
  16. ; get = (fun i -> Array.get a i)
  17. ; set = (fun i x -> Array.set a i x)
  18. }
  19.  
  20. let length t = t.len ()
  21. let get t i = t.get i
  22. let set t i x = t.set i x
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = {
  28. len : unit -> int;
  29. get : int -> 'a;
  30. set : int -> 'a -> unit;
  31. }
  32. val of_string : bytes -> char t
  33. val of_array : 'a array -> 'a t
  34. val length : 'a t -> int
  35. val get : 'a t -> int -> 'a
  36. val set : 'a t -> int -> 'a -> unit
  37. end

This more or less solves the problem, but it's still not really the memory representation we want. In particular, we have to allocate three closures for each Compact_array.t, and this number of closures will only go up as we add more functions whose behavior depends on the underlying array.

GADTs to the rescue

Let's go back to our failed variant-based implementation, but rewrite it using the GADT syntax. Note that we're not trying to change the types at all this time, just rewriting the same type we had before in the language of GADTs.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> 'a t

The syntax of this declaration suggests thinking about variant constructor like Array or Bytes as functions from the constructor arguments to the type of the resulting value, with the thing to the right of the : roughly corresponding to the type signature of the constructor.

Note that for the Array constructor, the type value of 'a depends on the type of the argument:

  1. > Array [|1;2;3|];;
  2. - : int t = Array [|1; 2; 3|]
  3. > Array [|"one";"two";"three"|];;
  4. - : bytes t = Array [|"one"; "two"; "three"|]

But for the Bytes constructor, the type 'a in the type is still free.

  1. > Bytes "foo";;
  2. - : 'a t = Bytes "foo"

This is really the problematic case, because we'd like for Bytes "foo" for the parameter 'a to by char, since in the Bytes case, that's what the element type of our array is.

Because GADTs give us the ability to specify the type on the right-hand side of the arrow, we can get that.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> char t

Now, the Bytes constructor behaves as we'd like it too.

  1. > Bytes "foo";;
  2. - : char t = Bytes "foo"

Now let's see what happens when we try to write the length function.

  1. > let length t =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : char t -> int = <fun>

Disappointingly, we're again stuck with a function that doesn't have the right type. In particular, the compiler has decided that this function can only operate on char t, when we want it to work for arrays of any type.

But the problem now is that type inference in the presence of GADTs is difficult, and the compiler needs a little help. Roughly speaking, without some hints, OCaml's type system will try to identify all types as having a single value within a given function. But in this case, we need a type variable which might have different values in different branches of a match statement.

We can do this by creating a locally-abstract type el to represent the type parameter of t (and the element type), and annotating t accordingly.

  1. > let length (type el) (t:el t) =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : 'a t -> int = <fun>

Now we see that we get the right type. We can push this approach through to get a complete implementation.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array : 'a array -> 'a t
  4. | Bytes : bytes -> char t
  5.  
  6. let of_bytes x = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length (type el) (t:el t) =
  10. match t with
  11. | Array a -> Array.length a
  12. | Bytes s -> Bytes.length s
  13.  
  14. let get (type el) (t:el t) i : el =
  15. match t with
  16. | Array a -> Array.get a i
  17. | Bytes s -> Bytes.get s i
  18.  
  19. let set (type el) (t:el t) i (v:el) =
  20. match t with
  21. | Array a -> Array.set a i v
  22. | Bytes s -> Bytes.set s i v
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = Array : 'a array -> 'a t | Bytes : bytes -> char t
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : 'a t -> int -> 'a
  32. val set : 'a t -> int -> 'a -> unit
  33. end

As I said at the beginning, this is really just an example of the more general theme. GADTs are about more than clever typed interpreters; they're a powerful mechanism for building easy to use abstractions that give you more precise control of your memory representation. And getting the right memory representation is often critical for building high performance applications.

A lighter Core

We recently released a version of our open source libraries with a much anticipated change --- Async_kernel, the heart of the Async concurrent programming library, now depends only on Core_kernel rather than on Core.

This sounds like a dull, technical change, and it kind of is. But it's also part of a larger project to make our libraries more lightweight and portable, and so suitable for a wider array of users and applications.

We've actually been working on these issues for a while now, and this seems like a good time to review some of the changes we've made over the years, and what's still to come.

Reorganizing for portability

Core has always had dependencies on Unix, including OCaml's Unix library, as well as some other parts of the Unix environment, like the Unix timezone files. This has long been a problem for porting to Windows, but more recently, the issue has loomed for two other increasingly important platforms for OCaml: Javascript and Mirage.

To help fix this problem, in 2013 we released a library called Core_kernel, which is the portable subset of Core that avoids Unixisms as well as things like threads that don't match well with the Javascript and Mirage back-ends.

In the same vein, we refactored Async, our concurrent programming library, into a set of layers (modeled on the design of the similar Lwt library) that both clarified the design and separated out the platform specific bits. Async_kernel is the lowest level and most portable piece, hosting the basic datastructures and abstractions. Async_unix adds a Unix-specific scheduler, and Async_extra builds further os-specific functionality on top.

Until recently, the fly in this ointment was that Async_kernel still depended on Core, rather than Core_kernel, because only Core had a time library. Making Async_kernel only require Core_kernel was a bigger project than you might imagine, in the end leading us to change Timing_wheel, a core datastructure for Async and several other critical libraries at Jane Street, to use an integer representation of time instead of the float-based one from Core.

Already, some experiments are underway to take advantage of this change, including some internal efforts to get Async working under javascript, and external efforts to get cohttp's Async back-end to only depend on Async_kernel.

I'm hoping that yet more of this kind of work will follow.

Module Aliases

One long-running annoyance with OCaml is the lack of an effective namespace mechanism. For a long time, the only choice was OCaml's packed modules, which let you take a collection of modules and merge them together into one mega-module. Some kind of namespace mechanism is essential at scale, and so we used packed modules throughout our libraries.

Unfortunately, packed modules have serious downsides, both in terms of compilation time and executable sizes. We've been talking to people about this and looking for a solution for a long time. You can check out this epic thread on the platform list if you want to see some of the ensuing conversation.

A solution to this problem finally landed in OCaml 4.02, in the form of module aliases. I'll skip the detailed explanation (you can look here if you want to learn more), but the end result was great: our compilation times immediately went down by more than a factor of 3, and it gave us a path towards dropping packed modules altogether, thus reducing executable sizes and making incremental compilation massively more efficient.

The work on dropping packed modules has already landed internally, and will hopefully make it to the external release in a few months. The benefit to executable size is significant, with typical executables dropping in size by a factor of 2, but there is more to do. OCaml doesn't have aggressive dead code elimination, and that can lead to a lot of unnecessary stuff getting linked in. We're looking at some improvements we can make to cut down the dependency tree, but better dead code elimination at the compiler would really help.

Sharing basic types

Interoperability between Core and other OCaml libraries is generally pretty good: Core uses the same basic types (e.g., string, list, array, option) as other OCaml code, and that makes it pretty easy to mix and match libraries.

That said, there are some pain points. For example, Core uses a Result type (essentially, type ('a,'b) result = Ok of 'a | Error of 'b) quite routinely, and lots of other libraries use very similar types. Unfortunately, these libraries each have their own incompatible definitions.

The solution is to break out a simple type that the different libraries can share. After some discussion with the people behind some of the other libraries in question, I made a pull request to the compiler to add a result type to the stdlib.

This is a small thing, but small things matter. I hope that by paying attention to this kind of small issue, we can help keep interoperability between Core and the rest of the OCaml ecosystem smooth.

Eliminating camlp4

One concern I've heard raised about Core and Jane Street's other libraries is their reliance on camlp4. camlp4 is a somewhat divisive piece of infrastructure: it's long been the only decent way to do metaprogramming in OCaml, and as such has been enormously valuable; but it's also a complex and somewhat unloved piece of infrastructure that lots of people want to avoid.

camlp4 also makes tooling a lot more complicated, since there's no single syntax to target. Dev tools like ocp-indent and the excellent merlin have some terrible hacks to support some of the most common camlp4 syntax extensions, but the situation is clearly untenable.

You do need camlp4 to build Core, but you don't need camlp4 to use it, and in practice, that's good enough for most use cases. But for people who want to avoid camlp4 entirely, it's still a nuisance. Moreover, while you don't need camlp4 to use Core, it is convenient. For example, a lot of Core's idioms work best when you provide s-expression serializers for your types, and the sexplib syntax extension is an awfully convenient way to generate those functions.

Our plan is to simply eliminate our dependency on camlp4 entirely over the next 6 months, by switching to using ppx and extension points, a new approach to metaprogramming in OCaml that, like module aliases, landed in 4.02. We're currently rewriting all of our syntax extensions, and building tools to automatically migrate the code that depends on camlp4. People who want to continue to use the old camlp4 extensions are welcome to continue doing so, but we're cutting our dependency on them.


Even at the end of all this, we don't expect that Core and Async will suit everyone --- that's a hard bar to cross for any software package. But we do hope that through these efforts, an ever wider set of developers will be able to take advantage of the work we've done.

Centralizing distributed version control, revisited

7 years ago, I wrote a blog post about how we at Jane Street were using our distributed version control system (hg, though the story would be the same for git) in a partially centralized way. Essentially, we built a centralized repo and a continuous integration system whose job was to merge in new changesets. The key responsibility of this system was to make sure that a change was rejected unless it merged, compiled and tested cleanly.

This half-distributed, half-centralized approach let us enjoy the benefits of a DVCS, while still getting the coherence and easy of sharing that comes from having a central authoritative source.

Since then, our development tools have changed a lot, including the arrival of a new code review and release management system called Iron. In writing Iron we discovered that centralization was valuable in ways we hadn't considered before. In particular, despite the fact that good support for merging is central to a DVCS, centralization is actually a critical ingredient to making merges work better.

To understand how centralization can help, let's talk about one reason why merging is a fraught process to begin with.

The criss-cross merge

The basic approach to merging in a DVCS like hg or git is pretty simple. Here are the basic steps that are taken to merge two heads A and B.

  • Find the greatest common ancestor (GCA(A,B)) of the heads to be merged.
  • Compute the patch from that base point to one of the two heads, say, A.
  • Take the patch you just computed, and apply it to B. Conflicts appear when the patch, which was actually based on GCA(A,B), doesn't apply cleanly to B. The result of this process is the merge.

The above discussion oversimplifies the story by assuming there's a well defined GCA, but this just isn't always true. To see why, consider a repository staring with a root revision R, and two revisions made independently on top of R.

  A
 /
R
 \
  B

Now, imagine that two different developers each concurrently decide to merge the heads A and B and do some further development. Note that in both of the cases shown below, the GCA for the merge between A and B is R.

Developer 1                Developer 2

  A---C--D                    A     
 /   /                       / \        
R   /                       R   \       
 \ /                         \   \  
  B                           B---E--F

Now, if we bring these two separate histories together into one repo, we have something like this.

  A---C--D
 / \ /
R   \
 \ / \
  B---E--F

Now, what happens if we want to merge D and F? In particular, what is GCA(D,F)? Both A and B are common ancestors, but neither one is greater than the other. In this case, there are in some sense two different GCAs, or, more precisely, there are multiple maximal common ancestors, or MCAs. This case is often described as a criss-cross merge, and is the source of much wailing and gnashing of teeth among developers and users of DVCSs.

git and hg have different ways of dealing with the case of multiple MCAs. By default, hg just picks one of the MCAs arbitrarily and does the merge based on that. Given that different choices of the merge base will lead to different results, making that choice arbitrarily is pretty disturbing.

git, on the other hand, has a strategy called recursive merge that repeatedly merges together the MCAs, and then uses that merged MCA as the basis for computing the diffs to A and B on which the final merge will be based. And hg has a new strategy called bid merge that is willing to make different choices as to the GCA to use on a file by file basis.

None of these approaches amount to principled solutions, and while they work better in some cases and worse in others, they all sometimes lead to bad results. It's tempting to look for a way out of this conundrum altogether, by avoiding the possibility of criss cross merges in the first place.

Avoiding the criss-cross merge

For those who haven't read my previous posts about how Iron approaches merges, I'll describe it briefly here. Iron organizes its branches into a hierarchy: every repository has a root feature, and that feature can have children, and those can have children as well. Thus, our main repository, called Jane, has a root feature called jane, and one can develop changes to jane in child features, such as jane/Core.Applicative or jane/quickcheck.

Critically, the merging of features is constrained. Note that in Iron, every feature is defined by its base and tip revision, where the diff between those two revisions is effectively the contents of the feature. Here are some of the key operations allowed on features.

  • fe release, moves changes from a child feature into a parent. This can only be done once the child feature is fully merged with its parent, and has the effect of setting the tip of parent to be the tip of the child, and typically deleting the child.

As an example, if the jane/quickcheck feature is based at the current tip of jane (and is fully reviewed, and all its tests pass), then calling fe release jane/quickcheck will move the tip of jane forward to be equal to the tip of jane/quickcheck, and will delete jane/quickcheck.

  • fe rebase lets you merge a feature with its parents, effectively pulling changes from a parent feature into a child. This has the effect of changing the base of the feature to be the tip of its parent, and the tip of the feature to be the result of the merge.

So, if other features have been released into jane since the jane/Core.Applicative feature was created, then the base of jane/Core.Applicative will no longer be the tip of jane. Calling fe rebase jane/Core.Applicative will merge the tip of jane/Core.Applicative with the tip of jane, and will set the base of jane/Core.Applicative to the tip of jane.

  • fe rename, which in addition to allowing you to simply change the name of a feature, also lets you introduce a parent-child relationship between features that didn't previously have one. e.g., calling fe rename jane/Core.Applicative jane/quickcheck/Core.Applicative causes the Core.Applicative feature to become a child of, and so be able to depend on the changes in, thequickcheck feature.

All of these operations are implemented against a single, centralized server which keeps track of the state of all our features. This centralization lets Iron enforce some useful invariants along the way, critically, that the GCA of a feature and its parent is well defined, and is equal to the base of the feature. This simple property turns out to outlaw criss-cross merges, which avoids all of the mess we described earlier.

The happy outcome turns out to depend critically on the fact that we built a central server that could enforce the invariants in question, or, more precisely, that we built a consistent service

discovered by chance, the existence of the central server is key to enforcing the necessary invariant. In particular, the scenario of two different users concurrently releasing into the same feature or rebasing the same feature simply isn't possible when there's a centralized monitor determining who goes first.

In retrospect, this shouldn't be too surprising. The criss-cross merge is really a result of concurrency, and the idea that introducing a lock (which is what a centralized server does for you) can be used to exclude unwanted concurrent executions in a distributed systems should surprise no one.

In the end, you can trace it all back to the CAP theorem: If you want progress while partitioned, you need to give up on consistency in some way. And criss cross merges are caused by a kind of inconsistency.

Centralization obviously has downsides, but I think Iron picks a nice point along the spectrum here: writing code is totally doable while disconnected, but operations like rebase and release that affect how information is shared between features requires you to be connected. I think it's a small price to pay to never have to deal with a criss-cross merge.

Making making better

We spend a lot of time and effort on training new people, and it never stops for long. Right now our winter-intern class is ending; in five months we'll have a slew of new interns to get up to speed, and a few months after that we'll have an incoming class of new hires.

A big part of our new-hire training is OCaml Bootcamp, a month-long immersion program for non-dev hires (mostly trading, but some from other areas like research, systems, and accounting). We don't think everyone at Jane Street should be a full-on software developer, but writing code is such a useful way to get things done that it's worth teaching the basics to a broad set of people.

Teaching programming, especially to people who are not planning on becoming software developers, is an opportunity to reflect on how unnecessarily hard programming can be. There's a huge learning curve as you struggle to learn your way around the Unix command-line, or figure out the key commands for controlling a 1970's era text editor like Emacs or Vi, or puzzle through the semantics of a distributed version control system like Mercurial. And all of that is before you even start writing code!

To me, this serves as a reminder of the importance of good tools. The quality of your tools can increase or decrease the steepness of the learning curve, and they also affect your day-to-day efficiency after you've climbed up that hill.

Tools are easy to undervalue. Most of our development time is spent, as it should be, on our first order problems -- writing the code that powers the systems that let us trade. And the connection between better tools and better trading can seem tenuous.

But really it's not tenuous at all. If you spend all your time on first order problems, you'll discover you're not solving them as fast as you should be. Getting things done effectively requires optimizing your own productivity, and to do that, you have to spend some time sharpening your tools.

And we've done a fair bit of sharpening. One recent example is Iron, a new code review and release management system that we started using last summer. Last year, we also rolled out a new build system called Jenga, which greatly simplified and sped up the process of compiling our code. Plus, we switched to a new version of OCaml, which includes a big set of improvements, some of which were specifically aimed at improving our development process [7]. And we funded some former interns to improve Merlin, a fantastic tool that provides IDE-like features like context-sensitive autocompletion in a way that can be easily integrated into multiple editors.

Jane Street is a pretty small shop --- we have fewer than 65 full time developers --- but even at our modest scale, spending time on tools is a big win. But it's really about more than dev tools. Thinking about how to make the people around you more effective informs how you work more generally, changing how you design libraries, how you manage services, and how (and whether!) you write documentation.

And in addition to making for a more effective organization, it's also a more pleasant way to live.

13 Virtues

Very early on in his life, while on lengthy voyage from London to Philadelphia, Ben Franklin created a system of thirteen virtues to live his life by. He spent the remainder of his days giving special focus to one virtue per week in a 13 week cycle, as well as noting the virtues he failed to live up to at the end of each day.

Over time he credited the system with making him more productive and more fulfilled.

My aspirations are not so lofty, but in the spirit of the new year, I present Ben's thirteen virtues as they relate to code review and discussion. Nothing here is meant to be taken as gospel, but together they give me a path towards the type of collaboration we value at Jane Street.

My simple hope is to waste less time, produce better code, and have fewer arguments over the next 12 months than I did over the last.

Temperance

Review thoroughly, but not to the point of exhaustion. Be mindful of the limits of review.

Silence

Say only things that clearly benefit the code; avoid trifling comments and tangents.

Order

Create the structure (files, modules and types) necessary to give every concept a clear place. Be suspicious of catchall modules.

Resolution

Respond to feedback and change requests quickly. Momentum is important.

Frugality

Don't waste people's time with frivolous review, careless comments, or code that isn't ready for review. Attention is expensive.

Industry

Prefer to respond with working code over additional commentary. Focus review on immediately productive outcomes instead of uncertain concerns.

Sincerity

Come to discussions with an innocent mind. Engage in code review with the clear goal of helping.

Justice

Weigh code decisions on the evidence at hand today, and not on personal preferences, prejudices, or obsolete past constraints.

Moderation

Avoid extremes in both style and approach. Incorporate strong views slowly.

Cleanliness

Spend time to lay out code in a clear and visually pleasing way. When reviewing, leave the code neater than you found it.

Tranquility

Don't become impassioned or incensed over trifles. Engage in all conversation with an open balanced tone and a sense of patience.

Chastity

Proliferate new ideas through the code base cautiously and be aware that even a good idea may not work in all places.

Humility

Take a modest view of your own contributions and feedback. Be unpretentious and respectful in your comments. Accept that you may be wrong.

How to choose a teaching language

If you were teaching a programming course, what language would you teach it in?

I like this question because it has any number of good answers, each quite different from the other, and each emblematic of a different approach to what programming is about.

The first formal programming class I took was COS 217 at Princeton, taught by the excellent (and at the time, I thought, terrifying) Anne Rogers. The course was (and is) taught in C, and the intellectual approach of the class was to start from the machine. Instead of just learning to program in C, we learned about how the machines we were programming to worked. That was where I first encountered instruction counters, stack frames, registers and the memory hierarchy. It was exhilarating.

Where C encourages you to start with the machine, Scheme wants you to start at the mathematical foundations of computation. You don't need to know what the lambda caluclus is to appreciate Scheme's slim core, and the way in which you can build a rich and vibrant computational world on top of it. That core is expressive enough that it makes it natural to introduce ideas that come up in multiple different languages, including functional and imperative techniques, object orientation, and logic programming.

The classic course in this vein is MIT's 6.001, also known as SICP, The Structure and Interpretation of Computer Programming. Sadly, 6.001 has been retired at MIT, but the book lives on, and is a worthwhile read even if you took your last CS course years ago.

MIT replaced SICP with a course based on Python, and this reflects a broader trend. As was highlighted by an informal study by Philip Guo, lots of schools now teach Python, particularly for early introductory courses. I have mixed feelings about this choice. Python is a wonderfully friendly language, but that friendliness is bundled together with some problems.

This was made apparent to me in part by my experience with students who chose to code in their interviews in Python. In many ways, Python is the ideal interview language, since its concise and readable syntax makes the constraints of coding on the whiteboard more bearable. But what I saw was that students who learn Python often walk away with a rather rough model of the semantics of the language. You might be surprised at what fraction of students who have programmed extensively in Python can't guess how Python lists might be implemented, to say nothing of their ability to explain the semantics of language features like generators or decorators.

This isn't really a knock on Python. After all, there's something great about a tool that lets you get things done without fully understanding how it works. But in different ways, both Scheme and C encourage you to understand what's going on from the ground up, and there's a certain pedagogical power in that. All in, I think Python is a great choice for an early introductory course, particularly one meant for those who aren't going to end up as computer scientists or full-time programmers. But I'd be leery of using it outside of those circumstances.

A development that I'm personally rather encouraged by is the emergence of statically typed functional languages, ML in particular, as teaching tools. Over the last few years, I've had the pleasure of visiting and lecturing at a number of schools that teach using OCaml or SML, including Brown, Cornell, Penn, Princeton, CMU and Harvard.

ML has gained ground for good reasons. First, it shares much of Scheme's elegant intellectual foundations, even though its core isn't quite as charmingly minimalistic as Scheme's. But ML has a wider range than Scheme because it allows you to show students the role of types in programming. Despite that greater range, OCaml and SML are relatively simple languages, which matters even more for teaching than it does for everyday use.

The only choice I've seen a lot of that I can't reconcile myself to is Java. Java is of course a widely used industrial language, but that doesn't mean it's a good teaching language. In my view, a key requirement for a teaching language is simplicity, and all the other choices I mentioned are simple in one way or another: C is a minimal layer on top of the machine; Scheme and ML are based on simple mathematical models of computation; and Python is a language that people find simple to use.

Java isn't simple in any of these ways. It's not particularly easy to get started with, as indicated by all the things you need to tell students to ignore rather than understand. (Yeah, public static void main, I'm looking at you!) Nor does it have a simple and transparent execution model like C. And an elegant computational core calculus like the one at the heart of Scheme and ML is nowhere in sight. The only real advantage I see for Java is vocational, and that doesn't seem to me like argument enough.

The thing to consider when you're picking a language to teach is that you're not just picking a bit of infrastructure for your students to program with during the course. You're picking an intellectual frame through which they will see all the lessons you teach them. And you should pick that frame with care.

Interviewing At Jane Street

Welcome to our version of the seemingly obligatory post about technical interviews. This topic has been covered by a lot of people already, so I'm going to do my best to not repeat all of the copious advice already out there.

Like many companies, we are looking for extremely talented technical people, and we have a challenging interview process that we think does a good job of selecting people who will do well here.

That said, we know that we miss lots of good people too. Some of that is because of the awkwardness of the interview process itself: time is short, the questions are weirdly artificial, and, of course, people get nervous. It's made even worse by the fact that programming on a whiteboard, a web browser, or even just on a computer that isn't yours is a bit like playing classical guitar with mittens on. It can put even an accomplished person off their game.

Missing out on good people makes us sad.

That's what this post is for. We hope that by talking a bit about what we're looking for, the ways people do poorly, and how we think you might be able to prepare, that we'll reduce the mitten handicap - at least a bit.

What Are We Looking For?

From our perspective, the main thing we want to figure out when we interview someone is: are they someone we want to work with?

That seems obvious enough, but it's a point that can get lost in the puzzles and whiteboard coding of an interview. Really, we think of our interviews as little simulations of what it's like to work together with the candidate. And while at the time, it may seem to the candidate that the interview is all about solving the problem, it's really not. We're much more interested in learning about how you work than we are in whether you actually finish whatever problem we put in front of you.

It's not that technical skill is irrelevant --- far from it. But it's only part of the story. Just as important to us is whether we can have a productive and fun discussion with the candidate.

To that end, we try to avoid algorithm bingo and puzzles with clever "aha" solutions. We prefer more open-ended problems that have no single right answer, since they give us more freedom to work together, and to see how the candidates' thinking plays out.

That sounds nice enough, but it's a bit high-level and hand-wavy. So here's a more concrete list of suggestions that follow from our overall approach.

Be nice

The smartest, best person in the world won't get hired at Jane Street if they aren't courteous and pleasant to talk to most of the time. Almost nothing will end your interview faster than being rude, pushy, or obnoxious.

Remember, we are looking for someone to work with, not just someone who can win an argument.

Be clear

And by clear, we mean simple and to the point. Use words and examples that get the core idea across to the widest technical audience possible.

Avoid showy, highly technical or deeply obscure terms of art, especially if you don't fully understand them. In the best case we'll likely just ask exactly what you meant by "hylomorphism", which wastes precious time. In the worst case it will become clear that you should have said "metamorphism" instead, which is just embarrassing for everyone involved.

Know what you don't know

Naturally we like it when candidates are good at solving the problems we put in front of them. But just as important, perhaps more important, is their ability to think reasonably about their own level of understanding.

In other words, we really like people who can express appropriate levels of confidence: admitting ignorance when they're unsure, and speaking confidently when they have the basis to do so. At a basic level this means being willing to say, "I don't know" rather than guessing and hoping when we ask you about something you aren't familiar with.

Know your language

Code is a wonderful language for communicating complex ideas because it provides a clear, concise and unambiguous way of expressing them. But, like any foreign language, it takes a lot of time and practice to get really comfortable.

We need you to be comfortable with it, because we communicate ideas in code a lot.

Now, comfortable doesn't mean that you have to be the world's best coder, or that you need to have memorized your favorite algorithms book. It means that you should be able to read and write code in at least one language without constant access to reference materials for common things, such as:

  • Standard control structures (loops/if-then/etc.)
  • Function, module, class, type, etc. definitions
  • Common data types like arrays, lists, hash tables/maps/dictionaries
  • Exceptions and other error handling techniques

Also, pick coding tools you understand well. Don't pick a functional language to make us happy. We'd much prefer you use a language that you know well. Similarly, when picking which features of the language to use, pick the ones you understand best. We're not going to be impressed with your elegant use of Python decorators if you don't really understand the details of what they do.

In other words, pick a simple, clunky solution that you understand over a fancy, elegant one that you don't.

Remember CS 101

We've hired plenty of successful people who didn't have a traditional college background in CS, and we certainly don't require a masters or a PhD. That said, we need you to have a solid working knowledge of core computer science concepts, including:

  • Abstraction layers like functions, objects, and modules

  • Basic algorithms and data structures, including binary search, sorting, hashing, breadth/depth first search, hashtables, binary trees and heaps.

  • Techniques for estimating CPU and memory costs, including big-O notation.

So if you can't for the life of you recall what amortized analysis is, and you can't nimbly bang out the code for a depth-first search it's probably worth reviewing some of this material.

Think about real computers

Depending on your point of view it's either a sad or beautiful fact that the most elegant code can only run on top of the giant, complex, odd stack of parts and abstractions that is a real computer. Since we need programs to actually run, we need people who understand the inner workings of that behemoth.

Now, this doesn't mean that we quiz every candidate about deep kernel internals, or the differences between SDRAM vs SGRAM. But for some jobs in systems development we do expect a fair amount of detailed knowledge, and in general it's a big plus if in your thinking you can take into account things like cache effects, IO patterns, memory representations, and the capabilities of real CPUs.

What We Don't Look For

  • Perfection. Our questions are often designed to be open ended enough that even the best people we've seen couldn't answer them fully in the time alloted. We want to keep the conversation going to learn everything we can, and we don't expect that you'll answer everything 100% perfectly.

  • We don't ask developers mental math, or math olympiad questions despite what you might have read online. Dev interviews are about programming.

  • We don't ask developers logic puzzles about pirates, people who only tell the truth, or which door the tiger is behind. Dev interviews are about programming.

How do people do poorly?

The most common issue is, of course, that some candidates just don't have the background for the job they are applying for. The solution to that is to learn more, and practice more. But there are a few other less obvious reasons that interviews with otherwise technically good people can go awry.

They're careless

One of the most common pieces of negative feedback from our interviewers is that the candidate was careless. This usually doesn't mean that the candidate didn't make progress, or didn't have good insights. It means that the candidate didn't think carefully and systematically about how their program might go wrong.

We care that you make progress, but we are rarely concerned about finishing a problem. In fact, many of the problems are designed to go on for far longer than the average interview length. It's better to step back and check your work carefully before you claim that it is finished than to rush.

They talk more than they code

Some candidates do a good job of talking through the problem and explaining their thinking, but never quite get around to concretely answering the question. We want to hear your ideas, but we also want to see you produce concrete solutions, which almost always involves writing code. There are lots of things that we can't learn about a candidate without seeing their code and talking through the details.

Take some time at the beginning to think about the solution and to talk about your plans, but make sure you start writing code - even if it isn't the code you would write if you had more time.

They don't generalize

We try to keep our interviews interactive, and we'll often stop candidates to ask about something they have just done, or to point out something that we think might be confusing or incorrect. We understand that we've seen these problems over and over again, and that you are coming to them fresh, so you shouldn't worry just because we've found a problem with your solution.

What you should do is generalize the advice. If we point out that you missed a case, consider other cases you might have missed. If we remind you of an invariant you forgot, find a way to protect yourself from making the mistake in other places in your code.

They say one thing and do another

We love it when a plan comes together, but it's extra hard to watch when a good plan falls apart on execution. If you hear a question, and discuss a plan of attack with your interviewer, do what you claim you will do. Don't change the plan in the middle, or drop it entirely in favor of a better idea without some discussion. You have a very limited amount of time to describe your solution in code, and executing a decent plan well is better than producing a Frankenstein's monster of 3 different plans that doesn't quite come to life.

If you do get partway through and start to lose faith step back and talk about it. Explain exactly why you are concerned, and whether you think it might be fatally flawed at the core, or just not ideal. If there really is a fatal flaw and you've seen it, we'll help you get out of the jam, and we'll appreciate that you articulated it. If it's just not quite perfect we'll likely encourage you to continue.

So, what can you do to prepare?

This part is short and sweet. Build something - from scratch - in a language you like. Don't stop short. Build the whole thing.

Now, show it to the smartest people you know, get feedback, tear it down and build it again with what you've learned.

Repeat with a new problem.

We are looking for people to build real things with us, and practice really does make perfect.

What the interns have wrought: RPC_parallel and Core_profiler

We're in the midst of intern hiring season, and so we get a lot of questions about what it's like to be an intern at Jane Street. One of the things people most want to know is what kind of projects they might work on as an intern.

That's of course hard to answer precisely, since we don't yet know what next summer's intern projects are going to be. But you can get a sense by seeing some of what interns have done in the past. To that end, I thought I'd describe a couple of intern projects that were done this past summer.

Rpc_parallel

Rpc_parallel is a new library written by Todd Lubin (who will be returning to Jane Street full-time next fall), aimed at making parallel programming in OCaml simpler and safer.

Writing programs that take advantage of multiple cores, and even of multiple computers, is of course a necessity. In OCaml, such parallelism is typically achieved by writing multiple single-threaded programs that use message passing to coordinate.

We have lots of parallel message-passing systems, but while these share a lot of infrastructure (notably, the Async_rpc library), we've never really settled on a lightweight way to construct complete parallel programs. Instead, each such program tends to have its own little process coordination framework.

A while back, we wrote a library called Async_parallel, (described here). Async_parallel does a lot of things right -- in particular, it makes it easy to spin up and manage new processes and distribute work between them.

But Async_parallel has a problem. It is based on OCaml's marshal facility, which allows you to serialize arbitrary data, including functions, between processes. This sounds great, but it has a dark side: marshalling arbitrary functions turns out to be error prone. In particular, in order to send a function over the wire, you also have to send a copy of any data that that function implicitly depends on.

Unfortunately, it's hard to know what kind of data is hidden behind a function, which can cause a few different forms of trouble: it might send much more data than you expect, it might fail unexpectedly if it hits one of the forms of data that can't be marshalled, or it might lead to crazy and hard-to-predict behavior if some of the data required by the function is meant to be mutable shared state.

Instead, we wanted a library that was more typeful and constrained in terms of what was sent over the wire. This pushed us towards a design where, at the cost of some extra verbosity, we explicitly declare the types of data that is sent. In exchange, we get a system that is easier to understand and debug.

One of the great things about Rpc_parallel is how fast it came together. Todd got it into a sufficiently good state by the middle of the summer that he was able to use it for his other projects (interns typically have at least two major projects over the course of the summer).

Rpc_parallel also benefitted from some world-hopping collaboration. Interns spend at least a week in an office other than their home office, and Todd ended up visiting Hong Kong. While there, he ended up spending a lot of time talking and working with Chengqi Song, who had a lot of experience with Async_parallel. Out of those discussions came a complete redesign and rewrite of the library, factoring out the core primitives for coordinating across multiple processes, and making the overall library simpler and more general.

By the end of the summer, a few other people picked up and starting using it for other projects, and last week, it was released open source, so you can take a look at it yourself on github.

Core_profiler

Profiling is surprisingly hard, and as such it's perhaps unsurprising that there are lots of ways of doing it. If you want to understand the cost of an individual operation, you might want a micro-benchmarking library like Haskell's Criterion, or our own Core_bench. If you're trying to understand properties of a whole application, like which lines of code it's spending its time on, or how many cache-misses are occurring, you might want to use something like Linux's perf tools, which use CPU-level counters to efficiently gather profiling statistics from your program.

Another useful technique is for the programmer to modify the source to add explicit probes that keep track of when certain program points are reached, and write out that information to a log that can be analyzed after the fact.

Daniel Richman (who will be returning for an internship next summer) worked along with Roshan James (formerly an intern himself, now fulltime) on a library called Core_profiler, which aims to make the use of such probes easy and efficient. Efficiency matters quite a bit, because if logging a probe takes more time than the thing you're trying to measure, you basically can't extract reliable data. Keeping the overhead small, therefore, is a must.

Accordingly, a lot of Daniel's time was spent thinking very carefully about how to write the probes in a way that would only minimally disrupt the execution of the program. He started a simple but less efficient library, called Stats_reporting, which took about 60ns and two words of allocation per probe, and started machining it down from there.

The first step was to avoid all allocation, which meant we could no longer use bin-prot, our standard binary serialization technique, since bin-prot requires that you allocate an OCaml object representing the data to be serialized. So they moved to using an internal library called Protogen for generating zero-allocation serialization code.

That brought us down to about 30ns, and zero allocations. We then decided to try out writing our own hand-rolled binary protocol, so we could have a yet-more-closely optimized binary layout. That brought us down to about 20-25ns. The next hack was to customize the binary format yet more by packing multiple values into a single, 63-bit OCaml integer. That, plus some cleverness to make sure that writes were word-aligned, brought the cost of a simple probe down to 12ns. In addition to the design changes, Daniel also spent a lot of time carefully examining the assembly code, to make sure that there were no surprises from the code generator.

We're pretty happy with the end result. We think that Core_profiler probes are now a good bit cheaper than the dynamic probes you can insert using a system like DTrace, and are in any case the best tool we have for tracking the performance of even relatively quick code-paths. And, as a nice bonus to all this optimization, the offline processing got about twice as fast as a side effect of the runtime improvements.


There's more to be said about both of these projects, and about the many other projects that were done this summer. If you're interested in applying, you can do so here:

http://janestreet.com

You don't need to know anything about finance or functional programming to apply. But if you come, you'll learn a ton about both by the time you're done.

4