CPU Registers and OCaml

Even though registers are a low-level CPU concept, having some knowledge about them can help write faster code. Simply put, a CPU register is a storage for a single variable. CPU can keep data in memory or cache or in registers and registers are often much faster. Furthermore, some operations are possible only when the data is in registers. Hence, the OCaml compiler tries to keep as many variables as it can in the registers.

Code with more than 13 variables is slow?

Consider this primitive statistics computation:

let stats xs ys =
  let len = Array.length xs in
  if len Array.length ys then
    raise not_of_same_length;
  let sx = ref 0 in
  let sy = ref 0 in
  let sxx = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let sax = ref 0 in
  let say = ref 0 in
  for i = 0 to len - 1 do
    let x = Array.unsafe_get xs i in
    let y = Array.unsafe_get ys i in
    let ax = abs x in
    let ay = abs y in
    let xx = x * x in
    let yy = y * y in
    let xy = x * y in
    sx := !sx + x;
    sy := !sy + y;
    sxx := !sxx + xx;
    syy := !syy + yy;
    sxy := !sxy + xy;
    sax := !sax + ax;
    say := !say + ay;
  done;
  !sx, !sy, !sax, !say, !sxx, !syy, !sxy

Rearranging just a few lines produces code 1.5-2x faster:

let x = Array.unsafe_get xs i in
    sx := !sx + x;
    let xx = x * x in
    sxx := !sxx + xx;
    let ax = abs x in
    sax := !sax + ax;
    let y = Array.unsafe_get ys i in
    sy := !sy + y;
    let xy = x * y in
    sxy := !sxy + xy;
    let ay = abs y in
    say := !say + ay;
    let yy = y * y in
    syy := !syy + yy;

Why is that? CPU has just a few registers:

  • 13 which can contain integers and pointers to arrays and records
  • 16 which can contain floating point numbers

If the code has more than that many variables, OCaml compiler has to park the extra variables in memory and this parking is called spilling. Actually, spilling may happen even when there are less variables, because for example some operations like integer multiplication use extra registers.

Therefore, it's good to try to keep the number of frequently used variables to 13 or less, or to rearrange the code so that fewer variables are used at the same time.

The OCaml compiler can show spilled variables when called with the option -dreload.

Calling a single function makes subsequent code slower?

If a function is called, all of the active registers are spilled because it is not known whether the called function will need those registers. That can often be the largest penalty when calling a function, assuming the function is not inlined.

Let's change the previous function:

let stats xs ys =
  let sx  = ref 0 in
  let sax = ref 0 in
  let sxx = ref 0 in
  let sy  = ref 0 in
  let say = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let len = Array.length xs in
  if len <> Array.length ys then
    failwith (Printf.sprintf "Arrays not of the same length: %d %d"
      len (Array.length ys));
  for i = 0 to len - 1 do
    ... 

This produces 1.35x slower code simply because OCaml compiler will spill all of the variables because of Printf.sprintf. In each iteration, OCaml will pull sx from the memory and store it back.

It's a pity that this is actually not necessary. OCaml has to pull sx from the memory and store it back just once, not in each iteration. Looks like that can be improved in the OCaml compiler.

Recursive functions with more parameters are faster?

A function can get data from input parameters, from closure environment and from memory. Input parameters are normally stored in registers. Therefore, using input parameters will generally be slightly faster than the closure environment and memory. This can come in handy in recursions.

Take for example this function which finds the index of the given integer in the array:

let index (a : int array) e =
  let len = Array.length a in
  let rec loop i =
    if i < len then begin
      if Array.unsafe_get a i = e then Some i
      else loop (i + 1)
    end else None
  in
  loop 0

The variables e and len are pulled from closure environment in each iteration. Moving them to input parameters speeds up the function 1.15-1.33 times:

let index (a : int array) e =
  let rec loop e len i =
    if i < len then begin
      if e = Array.unsafe_get a i then Some i
      else loop e len (i + 1)
    end else None
  in
loop e (Array.length a) 0

Further reading

Processor Register

Register Allocation

Reverse web proxy in ~50 lines of BASH

In the spirit of reinventing the wheel for fun, I hacked this together as a quick challenge to myself last week. It's a little rough around the edges, but I thought it was too cute not to share. If you have any bug fixes, please post them in the comments.

#!/bin/bash

set -e -u

function usage() {
  echo "USAGE: ${0} parent {port} {lower_bound_backend_port} {upper_bound_backend_port}"
}

[[ $# < 1 ]] && usage && exit 1

mode=${1};shift
case ${mode} in
  parent)
    PORT=${1};shift
    LOWER=${1};shift
    UPPER=${1};shift
    socat TCP-LISTEN:${PORT},fork,reuseaddr "EXEC:${0} child ${LOWER} ${UPPER}"
    ;;
  child)
    LOWER=${1};shift
    UPPER=${1};shift
    COUNT=0
    PORT=$(shuf -i ${LOWER}-${UPPER} -n 1)
    let "RANGE = UPPER - LOWER"
    SUCCESS=false
    while ((COUNT <= RANGE)) || ${SUCCESS}; do 
      set +e
      if socat STDIN TCP:127.0.0.1:${PORT},connect-timeout=2; then
        SUCCESS=true
        break
      else
        echo "unable to connect to port ${PORT}" >&2
      fi
      set -e
      let COUNT+=1
      let PORT+=1
      if ((PORT > UPPER )); then
        let 'PORT = LOWER'
      fi
    done
    if ! ${SUCCESS}; then
      echo "HTTP/1.1 500 Internal Server Error"
      echo
      echo "All REDACTED servers are down.  Please report to REDACTED@janestreet.com."
      exit 1
    fi
    ;;
  *)
    usage
    exit 1
    ;;
esac

Building a lower-latency GC

We've been doing a bunch of work recently on improving the responsiveness of OCaml's garbage collector. I thought it would be worth discussing these developments publicly to see if there was any useful feedback to be had on the ideas that we're investigating.

The basic problem is a well-known one: GCs can introduce unpredictable pauses into your application, and depending on how your GC is configured, these pauses can be quite long. Unpredictable latencies are a problem in a wide variety of applications, from trading systems to web stacks.

One approach people often take is to avoid using the allocator altogether: pool all your objects, and never allocate anything else. You can even keep many of your pooled objects outside of the heap.

This works, but makes for a less pleasant coding experience (and code that is trickier and harder to reason about.) So while pooling is a valuable technique, we'd like to have a GC that lets you run with low latencies without sacrificing the ability to allocate.

What are the problems?

OCaml's garbage collector is already pretty good from a latency perspective. Collection of the major heap in OCaml is incremental, which means that collection of the major heap can be done in small slices spread out over time, so no single transaction need experience the full latency of walking the major heap. Also, collection of the minor heap is pretty fast, and OCaml programs tend to do pretty well with a relatively small minor heap --- typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.

Still, there are problems with OCaml's collector.

No profiling

There's no good way in the stock runtime to see how long the different parts of collection take, and that makes it hard to optimize.

False promotions

OCaml's generational collector is very simple: objects are typically allocated first on the minor heap, where the work is effectively three inlined instructions to bump a pointer and check whether you've hit the end. When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.

In a typical functional workload, most of your allocations are short-lived, and so most of the minor heap is dead by the time you do the minor collection, so the walk of the minor heap can be quite cheap. But there's always a small number of false promotions, objects that would have become unreachable shortly, but were promoted because the minor collection came at an inconvenient time.

Clocking

One fundamental issues with the stock runtime is that the collector is clocked in terms of minor allocations --- ignoring, critically, the amount of time that has gone by.

This clocking makes sense for many applications, but if you're building a server that needs to respond to bursty traffic with low and predictable latencies, this is the opposite of what you want. Really, what you'd prefer to do is to defer GC work when you're busy, instead scheduling it at times when the application would otherwise be idle.

One solution here is to allow the application to drive the scheduling of the GC, but the runtime in its current form doesn't really support doing this. In particular, while you can choose to explicitly run a major slice, the collector accounting doesn't take note of the work that has been done that way, so the major collector works just as hard as it did previously.

Furthermore, the major slice always forces a minor collection. But running minor collections all the time is problematic in its own right, since if you run them when the minor heap is too small, then you'll end up accidentally promoting a rather large fraction of your minor allocations.

Insufficient incrementality

While the major collector is mostly incremental, not everything about it runs incrementally. In particular, when the major collector hits an array, it walks the array all at once. This is problematic if you start using large arrays, which does happen when one is using pooling techniques. Similarly, the collector is not incremental when it comes to scanning GC roots.

Immediate accounting

The stock runtime decides how big of a major slice to do based on how much was promoted to the major heap in the last minor collection. This is part of a heuristic that is meant to make sure that the collector keeps up with the rate of allocation, without running needlessly when the application isn't promoting much.

But the accounting is in some sense too immediate: if you do a lot of promotion in a given cycle, you're forced to do the collection immediately. While all that work needs to be done, it's not clear that it needs to be done immediately. Again, for responsive systems, it's often better to push off work until after the busy times.

Making it better

Happily, Damien Doligez, the author of OCaml's GC, has been visiting with us for the last few months, and has been doing a lot of good work to improve the runtime, and in particular to address the concerns raised above. Here's the summary of the changes made thus far.

Better profiling

A set of probes was added to the GC, allowing us to record in a quite detailed way every phase of the collection process. This is quite detailed, telling you the phase (marking vs sweeping) and the sub-phase, as well as keeping track of a collection of useful counters. This is available in the instrument branch.

Aging

Damien has also implemented aging in the minor heap. Aging is a technique whereby objects stay in the minor heap for several minor collections before being promoted to the major heap. The goal of aging is to reduce the amount of false promotion.

Better incrementalization

Several of the stages of the collector have been made interruptible, including scanning of arrays and of the roots. The effect here is to reduce the worst-case delays imposed by the collector. This is in the low-latency branch.

Separating major slices from minor collections

In the stock runtime, major slices and minor collections are always done together. In the low-latency branch, you can run one without the other, and you can basically run them at any time. This has a couple of advantages --- one is that it's essentially another form of incrementalization, allowing you to do less work per GC pause.

The other is that it gives you more freedom to schedule collections when you want to. One way we're looking at using this is to have an application-level job that wakes up periodically, and does a heuristic check to see if the system appears busy. If it doesn't, then it schedules some GC work, and it may choose to do either a minor collection or a major slice. A minor collection would only be chosen in the case that the minor heap is bigger than some configured level, to avoid too much false promotion; but a major collection can be done at any time.

Smoothed work-accounting

Instead of just keeping track of the amount of work that needs to be done in the next major slice, the GC in the low-latency branch tracks work that must be done over the next n major slices, by keeping these numbers in a circular buffer.

The runtime also uses these buckets for keeping track of extra work that has been done by application-forced major slices. A forced major slice takes work away from the front-most bucket, potentially bringing the bucket to negative territory.

When the runtime checks if it needs to do a major slice, it looks at the first bucket. If it's got a positive amount of work in it, then that work is done in that slice, if possible. Whatever is left over (which may be positive or negative) is spread out uniformly over the next n buckets.

Segmented free lists

A big part of the cost of minor collections is the cost of finding free blocks. One observation is that in many OCaml applications, block sizes are quite small. One way of taking advantage of this is to have a set of size-segregated free-lists, for a configurable set of sizes. e.g., one could have a different free list for blocks with 1, 2, 3 and 4 slots.

This is still ongoing (read: not working yet), but it will show up in the multi-free-list branch eventually.

How is it going?

This is all very much a work in progress, but the results so far have been quite promising. By using a version of the compiler with most of these changes and with an application-driven job that forces major slices in quiet times, we were able to reduce tail latencies by a factor of 3 in a real production application. That's pretty good considering that we've done essentially no parameter tuning at this point.

That said, some of the results are less promising. We were somewhat disappointed to see that when doing more traditional batch jobs, aging didn't provide a significant improvement in overall compute time. It seems like in many applications, aging saves some on promotion, but the minor collection itself gets a little more expensive, and these seem to nearly cancel out.

This seems especially surprising given that aging is present in most GCs, including those for Java's HotSpot, the .NET CLR, and GHC. Given that everyone seems to use aging, I would have expected aging to have a quite noticeable benefit for lots of workloads, not just carefully tuned packet processors.

A call for help

The progress we've made so far is quite promising, but a lot of things are still up in the air. The reason that I wanted to post about it now is that I was hoping to hear feedback from others who have experience dealing with similar issues in other languages.

So, if you have thoughts about the techniques we've tried for making OCaml's runtime more responsive, or suggestions for other techniques we should consider, please comment!

Faster OCaml to C calls

The official OCaml documentation "Interfacing C with OCaml" doesn't document some interesting performance features.

C functions with no OCaml allocation

A C call can allocate OCaml data and pass it back to OCaml, for example using caml_copy_string(s). Between the C call allocating OCaml data and passing it back, it has to make sure that OCaml's Garbage Collector doesn't collect it, as the Garbage Collector can be triggered during the C call. There's an intricate mechanism which assures that, part of which are CAMLparam, CAMLlocal and CAMLreturn.

This mechanism can be bypassed if the C call is not going to allocate any OCaml data. This can yield performance benefits especially in shorter functions. To bypass it, CAMLparam, CAMLlocal and CAMLreturn should not be used and the primitive should be declared with "noalloc".

For example, OCaml's compare is not smart to avoid branch mispredictions for floats. Moving comparison to C speeds it up a little bit. "noalloc" speeds it up a lot.

float compare            8.93 ns
float_compare_c          7.88 ns
float_compare_c_noalloc  5.32 ns

external float_compare_noalloc : float -> float -> int =
  "float_compare_noalloc_stub" "noalloc"

CAMLprim value float_compare_noalloc_stub(value vf, value vg)
{
  double f = Double_val(vf);
  double g = Double_val(vg);
  return Val_int((f > g) - (f < g) + (f == f) - (g == g));
}

C functions with float arguments and float return

Since C code must use boxed OCaml floats, any unboxed float must be boxed prior to the C call. This is not cheap, especially for fast functions. This boxing can be avoided if the C call accepts and returns only floats.

For example, float_min can be replaced with a single CPU instruction. Unfortunately, the C implementation is much slower because of boxing floats:

float_min                  6.09 ns
float_min_c               15.92 ns

let float_min (x : float) y = if x < y then x else y

CAMLprim value float_min_stub(value x, value y)
{
  CAMLparam2(x, y);
  CAMLlocal1(v);
  double z = Double_val(y);

  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(Double_val(x)));
  v = caml_copy_double(z);
  CAMLreturn(v);
}

external float_min_c : float -> float -> float = "float_min_stub"

To avoid boxing, C function's arguments and return should be "double", CAMLparam, CAMLlocal and CAMLreturn should be avoided and the primitive should include "float" and both interpreted and compiled implementation:

float_min_c_float          1.95 ns

external float_min_inan_c_float : float -> float -> float
  = "float_min_inan_float_bytecode" "float_min_inan_float_stub" "float"

CAMLprim double float_min_inan_float_stub(double x, double y)
{
  double z = y;
  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(x));
  return z;
}

C functions with float arguments and non-float return

We might be able to further speed up float_compare_noalloc if we avoided boxing. Alas, that function returns integer so it's impossible to use "float". Is it still possible to avoid boxing? The answer is yes, by simply converting float to int.

float_compare_c_float    3.73 ns

CAMLprim double float_compare_float_stub(double f, double g)
{
  return (f > g) - (f < g) + (f == f) - (g == g);
}

external float_compare_float : float -> float -> float
  = "float_compare_float_bytecode" "float_compare_float_stub" "float"
let float_compare_float x y = int_of_float (float_compare_float x y)

Why GADTs matter for performance

When GADTs (Generalized Algebraic Data Types) landed in OCaml, I wasn't particularly happy about it. I assumed that it was the kind of nonsense you get when you let compiler writers design your programming language.

Which is to say that the standard GADT examples all seem to be about the kinds of things that compiler writers do, like embed domain-specific languages or build typed abstract-syntax trees. But it didn't seem particularly relevant for the kind of systems programming that I think about.

But it became apparent pretty quickly that I was wrong. In particular, since GADTs have landed, at Jane Street we've found lots of examples where GADTs are important for performance, of all things. The theme in these examples is that GADTs enable you to tweak your memory representations in ways that would otherwise be painful or impossible to do safely in OCaml.

The Problem of Polymorphism

I'd like to walk through a simple example that illustrates this aspect of GADTs, but first, a few words about OCaml's memory representation. OCaml's polymorphism is in an important way backed on that memory representation. In particular, consider a simple polymorphic function like List.iter, which has the following type:

  1. val iter: 'a list -> f:('a -> unit) -> unit

The polymorphic type tells you that List.iter can operate on lists of any type, and in OCaml, this is achieved with a single compiled version of the code. This is possible because the memory representation of the elements of a list are uniform: you can always refer to an OCaml value in a single word, either as a pointer to a heap-allocated value, or as an immediate that fits inside that word.

That means that some OCaml datatypes are less efficient space-wise than you might imagine. Arrays, for example, take the same amount of space per element whether those elements are bytes, 32-bit ints, or 64-bit ints. (There's actually some special magic in the compiler for float arrays, though this is probably more trouble than it's worth, as described by Alain Frisch here. But let's ignore float arrays for now.)

OCaml does have a tighter representations for byte arrays, called bytes. But it's a completely different type, and so building a general purpose data structure that uses bytes when it would make sense, and ordinary arrays otherwise, is a little awkward.

Controlling memory representation without GADTs

Let's see what happens if we try to design (without GADTs) an array type that sometimes uses the general array representation and sometimes uses bytes.

You could imagine representing such a value using an ordinary variant.

  1. type 'a t = | Array of 'a array
  2. | Bytes of bytes

We could then implement each of the operations we want on our new array type, implementing each operation differently depending on the particular representation. Let's see what happens if we just take this idea and run with it, implementing all the required functions in the most straightforward way.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array of 'a array
  4. | Bytes of bytes
  5.  
  6. let of_bytes x : char t = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length = function
  10. | Array a -> Array.length a
  11. | Bytes s -> Bytes.length s
  12.  
  13. let get t i =
  14. match t with
  15. | Array a -> a.(i)
  16. | Bytes s -> s.[i]
  17.  
  18. let set t i v =
  19. match t with
  20. | Array a -> a.(i) <- v
  21. | Bytes s -> s.[i] <- v
  22.  
  23. end;;
  24.  
  25. module Compact_array :
  26. sig
  27. type 'a t = Array of 'a array | Bytes of bytes
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : char t -> int -> char
  32. val set : char t -> int -> char -> unit
  33. end

This seems pretty good at first glance, but the inferred types aren't quite what we want. In particular, get and set only work with Compact_arrays containing characters. If you think about how type inference works, it's not really all that surprising. If you think about the code for get:

  1. let get t i =
  2. match t with
  3. | Array a -> Array.get a i
  4. | String s -> String.get s i

The OCaml compiler is looking for a single type to assign to the return value for all the cases of the match. Given that String.get always returns a char, then Compact_array.get will be restricted to only returning a char.

One way to work around this problem is to essentially implement what we want as a poor-man's object. Here, we just write the code separately for the different cases, and stuff those functions into a record full of closures. Here's how that looks.

  1. module Compact_array = struct
  2.  
  3. type 'a t = { len: unit -> int
  4. ; get: int -> 'a
  5. ; set: int -> 'a -> unit
  6. }
  7.  
  8. let of_string s =
  9. { len = (fun () -> String.length s)
  10. ; get = (fun i -> String.get s i)
  11. ; set = (fun i x -> String.set s i x)
  12. }
  13.  
  14. let of_array a =
  15. { len = (fun () -> Array.length a)
  16. ; get = (fun i -> Array.get a i)
  17. ; set = (fun i x -> Array.set a i x)
  18. }
  19.  
  20. let length t = t.len ()
  21. let get t i = t.get i
  22. let set t i x = t.set i x
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = {
  28. len : unit -> int;
  29. get : int -> 'a;
  30. set : int -> 'a -> unit;
  31. }
  32. val of_string : bytes -> char t
  33. val of_array : 'a array -> 'a t
  34. val length : 'a t -> int
  35. val get : 'a t -> int -> 'a
  36. val set : 'a t -> int -> 'a -> unit
  37. end

This more or less solves the problem, but it's still not really the memory representation we want. In particular, we have to allocate three closures for each Compact_array.t, and this number of closures will only go up as we add more functions whose behavior depends on the underlying array.

GADTs to the rescue

Let's go back to our failed variant-based implementation, but rewrite it using the GADT syntax. Note that we're not trying to change the types at all this time, just rewriting the same type we had before in the language of GADTs.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> 'a t

The syntax of this declaration suggests thinking about variant constructor like Array or Bytes as functions from the constructor arguments to the type of the resulting value, with the thing to the right of the : roughly corresponding to the type signature of the constructor.

Note that for the Array constructor, the type value of 'a depends on the type of the argument:

  1. > Array [|1;2;3|];;
  2. - : int t = Array [|1; 2; 3|]
  3. > Array [|"one";"two";"three"|];;
  4. - : bytes t = Array [|"one"; "two"; "three"|]

But for the Bytes constructor, the type 'a in the type is still free.

  1. > Bytes "foo";;
  2. - : 'a t = Bytes "foo"

This is really the problematic case, because we'd like for Bytes "foo" for the parameter 'a to by char, since in the Bytes case, that's what the element type of our array is.

Because GADTs give us the ability to specify the type on the right-hand side of the arrow, we can get that.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> char t

Now, the Bytes constructor behaves as we'd like it too.

  1. > Bytes "foo";;
  2. - : char t = Bytes "foo"

Now let's see what happens when we try to write the length function.

  1. > let length t =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : char t -> int = <fun>

Disappointingly, we're again stuck with a function that doesn't have the right type. In particular, the compiler has decided that this function can only operate on char t, when we want it to work for arrays of any type.

But the problem now is that type inference in the presence of GADTs is difficult, and the compiler needs a little help. Roughly speaking, without some hints, OCaml's type system will try to identify all types as having a single value within a given function. But in this case, we need a type variable which might have different values in different branches of a match statement.

We can do this by creating a locally-abstract type el to represent the type parameter of t (and the element type), and annotating t accordingly.

  1. > let length (type el) (t:el t) =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : 'a t -> int = <fun>

Now we see that we get the right type. We can push this approach through to get a complete implementation.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array : 'a array -> 'a t
  4. | Bytes : bytes -> char t
  5.  
  6. let of_bytes x = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length (type el) (t:el t) =
  10. match t with
  11. | Array a -> Array.length a
  12. | Bytes s -> Bytes.length s
  13.  
  14. let get (type el) (t:el t) i : el =
  15. match t with
  16. | Array a -> Array.get a i
  17. | Bytes s -> Bytes.get s i
  18.  
  19. let set (type el) (t:el t) i (v:el) =
  20. match t with
  21. | Array a -> Array.set a i v
  22. | Bytes s -> Bytes.set s i v
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = Array : 'a array -> 'a t | Bytes : bytes -> char t
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : 'a t -> int -> 'a
  32. val set : 'a t -> int -> 'a -> unit
  33. end

As I said at the beginning, this is really just an example of the more general theme. GADTs are about more than clever typed interpreters; they're a powerful mechanism for building easy to use abstractions that give you more precise control of your memory representation. And getting the right memory representation is often critical for building high performance applications.

A lighter Core

We recently released a version of our open source libraries with a much anticipated change --- Async_kernel, the heart of the Async concurrent programming library, now depends only on Core_kernel rather than on Core.

This sounds like a dull, technical change, and it kind of is. But it's also part of a larger project to make our libraries more lightweight and portable, and so suitable for a wider array of users and applications.

We've actually been working on these issues for a while now, and this seems like a good time to review some of the changes we've made over the years, and what's still to come.

Reorganizing for portability

Core has always had dependencies on Unix, including OCaml's Unix library, as well as some other parts of the Unix environment, like the Unix timezone files. This has long been a problem for porting to Windows, but more recently, the issue has loomed for two other increasingly important platforms for OCaml: Javascript and Mirage.

To help fix this problem, in 2013 we released a library called Core_kernel, which is the portable subset of Core that avoids Unixisms as well as things like threads that don't match well with the Javascript and Mirage back-ends.

In the same vein, we refactored Async, our concurrent programming library, into a set of layers (modeled on the design of the similar Lwt library) that both clarified the design and separated out the platform specific bits. Async_kernel is the lowest level and most portable piece, hosting the basic datastructures and abstractions. Async_unix adds a Unix-specific scheduler, and Async_extra builds further os-specific functionality on top.

Until recently, the fly in this ointment was that Async_kernel still depended on Core, rather than Core_kernel, because only Core had a time library. Making Async_kernel only require Core_kernel was a bigger project than you might imagine, in the end leading us to change Timing_wheel, a core datastructure for Async and several other critical libraries at Jane Street, to use an integer representation of time instead of the float-based one from Core.

Already, some experiments are underway to take advantage of this change, including some internal efforts to get Async working under javascript, and external efforts to get cohttp's Async back-end to only depend on Async_kernel.

I'm hoping that yet more of this kind of work will follow.

Module Aliases

One long-running annoyance with OCaml is the lack of an effective namespace mechanism. For a long time, the only choice was OCaml's packed modules, which let you take a collection of modules and merge them together into one mega-module. Some kind of namespace mechanism is essential at scale, and so we used packed modules throughout our libraries.

Unfortunately, packed modules have serious downsides, both in terms of compilation time and executable sizes. We've been talking to people about this and looking for a solution for a long time. You can check out this epic thread on the platform list if you want to see some of the ensuing conversation.

A solution to this problem finally landed in OCaml 4.02, in the form of module aliases. I'll skip the detailed explanation (you can look here if you want to learn more), but the end result was great: our compilation times immediately went down by more than a factor of 3, and it gave us a path towards dropping packed modules altogether, thus reducing executable sizes and making incremental compilation massively more efficient.

The work on dropping packed modules has already landed internally, and will hopefully make it to the external release in a few months. The benefit to executable size is significant, with typical executables dropping in size by a factor of 2, but there is more to do. OCaml doesn't have aggressive dead code elimination, and that can lead to a lot of unnecessary stuff getting linked in. We're looking at some improvements we can make to cut down the dependency tree, but better dead code elimination at the compiler would really help.

Sharing basic types

Interoperability between Core and other OCaml libraries is generally pretty good: Core uses the same basic types (e.g., string, list, array, option) as other OCaml code, and that makes it pretty easy to mix and match libraries.

That said, there are some pain points. For example, Core uses a Result type (essentially, type ('a,'b) result = Ok of 'a | Error of 'b) quite routinely, and lots of other libraries use very similar types. Unfortunately, these libraries each have their own incompatible definitions.

The solution is to break out a simple type that the different libraries can share. After some discussion with the people behind some of the other libraries in question, I made a pull request to the compiler to add a result type to the stdlib.

This is a small thing, but small things matter. I hope that by paying attention to this kind of small issue, we can help keep interoperability between Core and the rest of the OCaml ecosystem smooth.

Eliminating camlp4

One concern I've heard raised about Core and Jane Street's other libraries is their reliance on camlp4. camlp4 is a somewhat divisive piece of infrastructure: it's long been the only decent way to do metaprogramming in OCaml, and as such has been enormously valuable; but it's also a complex and somewhat unloved piece of infrastructure that lots of people want to avoid.

camlp4 also makes tooling a lot more complicated, since there's no single syntax to target. Dev tools like ocp-indent and the excellent merlin have some terrible hacks to support some of the most common camlp4 syntax extensions, but the situation is clearly untenable.

You do need camlp4 to build Core, but you don't need camlp4 to use it, and in practice, that's good enough for most use cases. But for people who want to avoid camlp4 entirely, it's still a nuisance. Moreover, while you don't need camlp4 to use Core, it is convenient. For example, a lot of Core's idioms work best when you provide s-expression serializers for your types, and the sexplib syntax extension is an awfully convenient way to generate those functions.

Our plan is to simply eliminate our dependency on camlp4 entirely over the next 6 months, by switching to using ppx and extension points, a new approach to metaprogramming in OCaml that, like module aliases, landed in 4.02. We're currently rewriting all of our syntax extensions, and building tools to automatically migrate the code that depends on camlp4. People who want to continue to use the old camlp4 extensions are welcome to continue doing so, but we're cutting our dependency on them.


Even at the end of all this, we don't expect that Core and Async will suit everyone --- that's a hard bar to cross for any software package. But we do hope that through these efforts, an ever wider set of developers will be able to take advantage of the work we've done.

Centralizing distributed version control, revisited

7 years ago, I wrote a blog post about how we at Jane Street were using our distributed version control system (hg, though the story would be the same for git) in a partially centralized way. Essentially, we built a centralized repo and a continuous integration system whose job was to merge in new changesets. The key responsibility of this system was to make sure that a change was rejected unless it merged, compiled and tested cleanly.

This half-distributed, half-centralized approach let us enjoy the benefits of a DVCS, while still getting the coherence and easy of sharing that comes from having a central authoritative source.

Since then, our development tools have changed a lot, including the arrival of a new code review and release management system called Iron. In writing Iron we discovered that centralization was valuable in ways we hadn't considered before. In particular, despite the fact that good support for merging is central to a DVCS, centralization is actually a critical ingredient to making merges work better.

To understand how centralization can help, let's talk about one reason why merging is a fraught process to begin with.

The criss-cross merge

The basic approach to merging in a DVCS like hg or git is pretty simple. Here are the basic steps that are taken to merge two heads A and B.

  • Find the greatest common ancestor (GCA(A,B)) of the heads to be merged.
  • Compute the patch from that base point to one of the two heads, say, A.
  • Take the patch you just computed, and apply it to B. Conflicts appear when the patch, which was actually based on GCA(A,B), doesn't apply cleanly to B. The result of this process is the merge.

The above discussion oversimplifies the story by assuming there's a well defined GCA, but this just isn't always true. To see why, consider a repository staring with a root revision R, and two revisions made independently on top of R.

  A
 /
R
 \
  B

Now, imagine that two different developers each concurrently decide to merge the heads A and B and do some further development. Note that in both of the cases shown below, the GCA for the merge between A and B is R.

Developer 1                Developer 2

  A---C--D                    A     
 /   /                       / \        
R   /                       R   \       
 \ /                         \   \  
  B                           B---E--F

Now, if we bring these two separate histories together into one repo, we have something like this.

  A---C--D
 / \ /
R   \
 \ / \
  B---E--F

Now, what happens if we want to merge D and F? In particular, what is GCA(D,F)? Both A and B are common ancestors, but neither one is greater than the other. In this case, there are in some sense two different GCAs, or, more precisely, there are multiple maximal common ancestors, or MCAs. This case is often described as a criss-cross merge, and is the source of much wailing and gnashing of teeth among developers and users of DVCSs.

git and hg have different ways of dealing with the case of multiple MCAs. By default, hg just picks one of the MCAs arbitrarily and does the merge based on that. Given that different choices of the merge base will lead to different results, making that choice arbitrarily is pretty disturbing.

git, on the other hand, has a strategy called recursive merge that repeatedly merges together the MCAs, and then uses that merged MCA as the basis for computing the diffs to A and B on which the final merge will be based. And hg has a new strategy called bid merge that is willing to make different choices as to the GCA to use on a file by file basis.

None of these approaches amount to principled solutions, and while they work better in some cases and worse in others, they all sometimes lead to bad results. It's tempting to look for a way out of this conundrum altogether, by avoiding the possibility of criss cross merges in the first place.

Avoiding the criss-cross merge

For those who haven't read my previous posts about how Iron approaches merges, I'll describe it briefly here. Iron organizes its branches into a hierarchy: every repository has a root feature, and that feature can have children, and those can have children as well. Thus, our main repository, called Jane, has a root feature called jane, and one can develop changes to jane in child features, such as jane/Core.Applicative or jane/quickcheck.

Critically, the merging of features is constrained. Note that in Iron, every feature is defined by its base and tip revision, where the diff between those two revisions is effectively the contents of the feature. Here are some of the key operations allowed on features.

  • fe release, moves changes from a child feature into a parent. This can only be done once the child feature is fully merged with its parent, and has the effect of setting the tip of parent to be the tip of the child, and typically deleting the child.

As an example, if the jane/quickcheck feature is based at the current tip of jane (and is fully reviewed, and all its tests pass), then calling fe release jane/quickcheck will move the tip of jane forward to be equal to the tip of jane/quickcheck, and will delete jane/quickcheck.

  • fe rebase lets you merge a feature with its parents, effectively pulling changes from a parent feature into a child. This has the effect of changing the base of the feature to be the tip of its parent, and the tip of the feature to be the result of the merge.

So, if other features have been released into jane since the jane/Core.Applicative feature was created, then the base of jane/Core.Applicative will no longer be the tip of jane. Calling fe rebase jane/Core.Applicative will merge the tip of jane/Core.Applicative with the tip of jane, and will set the base of jane/Core.Applicative to the tip of jane.

  • fe rename, which in addition to allowing you to simply change the name of a feature, also lets you introduce a parent-child relationship between features that didn't previously have one. e.g., calling fe rename jane/Core.Applicative jane/quickcheck/Core.Applicative causes the Core.Applicative feature to become a child of, and so be able to depend on the changes in, thequickcheck feature.

All of these operations are implemented against a single, centralized server which keeps track of the state of all our features. This centralization lets Iron enforce some useful invariants along the way, critically, that the GCA of a feature and its parent is well defined, and is equal to the base of the feature. This simple property turns out to outlaw criss-cross merges, which avoids all of the mess we described earlier.

The happy outcome turns out to depend critically on the fact that we built a central server that could enforce the invariants in question, or, more precisely, that we built a consistent service

discovered by chance, the existence of the central server is key to enforcing the necessary invariant. In particular, the scenario of two different users concurrently releasing into the same feature or rebasing the same feature simply isn't possible when there's a centralized monitor determining who goes first.

In retrospect, this shouldn't be too surprising. The criss-cross merge is really a result of concurrency, and the idea that introducing a lock (which is what a centralized server does for you) can be used to exclude unwanted concurrent executions in a distributed systems should surprise no one.

In the end, you can trace it all back to the CAP theorem: If you want progress while partitioned, you need to give up on consistency in some way. And criss cross merges are caused by a kind of inconsistency.

Centralization obviously has downsides, but I think Iron picks a nice point along the spectrum here: writing code is totally doable while disconnected, but operations like rebase and release that affect how information is shared between features requires you to be connected. I think it's a small price to pay to never have to deal with a criss-cross merge.

Making making better

We spend a lot of time and effort on training new people, and it never stops for long. Right now our winter-intern class is ending; in five months we'll have a slew of new interns to get up to speed, and a few months after that we'll have an incoming class of new hires.

A big part of our new-hire training is OCaml Bootcamp, a month-long immersion program for non-dev hires (mostly trading, but some from other areas like research, systems, and accounting). We don't think everyone at Jane Street should be a full-on software developer, but writing code is such a useful way to get things done that it's worth teaching the basics to a broad set of people.

Teaching programming, especially to people who are not planning on becoming software developers, is an opportunity to reflect on how unnecessarily hard programming can be. There's a huge learning curve as you struggle to learn your way around the Unix command-line, or figure out the key commands for controlling a 1970's era text editor like Emacs or Vi, or puzzle through the semantics of a distributed version control system like Mercurial. And all of that is before you even start writing code!

To me, this serves as a reminder of the importance of good tools. The quality of your tools can increase or decrease the steepness of the learning curve, and they also affect your day-to-day efficiency after you've climbed up that hill.

Tools are easy to undervalue. Most of our development time is spent, as it should be, on our first order problems -- writing the code that powers the systems that let us trade. And the connection between better tools and better trading can seem tenuous.

But really it's not tenuous at all. If you spend all your time on first order problems, you'll discover you're not solving them as fast as you should be. Getting things done effectively requires optimizing your own productivity, and to do that, you have to spend some time sharpening your tools.

And we've done a fair bit of sharpening. One recent example is Iron, a new code review and release management system that we started using last summer. Last year, we also rolled out a new build system called Jenga, which greatly simplified and sped up the process of compiling our code. Plus, we switched to a new version of OCaml, which includes a big set of improvements, some of which were specifically aimed at improving our development process [7]. And we funded some former interns to improve Merlin, a fantastic tool that provides IDE-like features like context-sensitive autocompletion in a way that can be easily integrated into multiple editors.

Jane Street is a pretty small shop --- we have fewer than 65 full time developers --- but even at our modest scale, spending time on tools is a big win. But it's really about more than dev tools. Thinking about how to make the people around you more effective informs how you work more generally, changing how you design libraries, how you manage services, and how (and whether!) you write documentation.

And in addition to making for a more effective organization, it's also a more pleasant way to live.

13 Virtues

Very early on in his life, while on lengthy voyage from London to Philadelphia, Ben Franklin created a system of thirteen virtues to live his life by. He spent the remainder of his days giving special focus to one virtue per week in a 13 week cycle, as well as noting the virtues he failed to live up to at the end of each day.

Over time he credited the system with making him more productive and more fulfilled.

My aspirations are not so lofty, but in the spirit of the new year, I present Ben's thirteen virtues as they relate to code review and discussion. Nothing here is meant to be taken as gospel, but together they give me a path towards the type of collaboration we value at Jane Street.

My simple hope is to waste less time, produce better code, and have fewer arguments over the next 12 months than I did over the last.

Temperance

Review thoroughly, but not to the point of exhaustion. Be mindful of the limits of review.

Silence

Say only things that clearly benefit the code; avoid trifling comments and tangents.

Order

Create the structure (files, modules and types) necessary to give every concept a clear place. Be suspicious of catchall modules.

Resolution

Respond to feedback and change requests quickly. Momentum is important.

Frugality

Don't waste people's time with frivolous review, careless comments, or code that isn't ready for review. Attention is expensive.

Industry

Prefer to respond with working code over additional commentary. Focus review on immediately productive outcomes instead of uncertain concerns.

Sincerity

Come to discussions with an innocent mind. Engage in code review with the clear goal of helping.

Justice

Weigh code decisions on the evidence at hand today, and not on personal preferences, prejudices, or obsolete past constraints.

Moderation

Avoid extremes in both style and approach. Incorporate strong views slowly.

Cleanliness

Spend time to lay out code in a clear and visually pleasing way. When reviewing, leave the code neater than you found it.

Tranquility

Don't become impassioned or incensed over trifles. Engage in all conversation with an open balanced tone and a sense of patience.

Chastity

Proliferate new ideas through the code base cautiously and be aware that even a good idea may not work in all places.

Humility

Take a modest view of your own contributions and feedback. Be unpretentious and respectful in your comments. Accept that you may be wrong.

Inspecting the Environment of a Running Process

Sometimes its useful to be able see the values of environment variables in running processes. We can use the following test program to see how well we can accomplish this:

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
        int n;
        char *envstr;
        while((n = scanf("%as", &envstr)) != EOF) {
                putenv(envstr);
        }
        return 0;
}

This program just reads strings from stdin and then basically passes them on to `putenv(3)1 so we have any easy way to modify our environment.

Now, lets run it with env -i to reset the environment to something pretty sparse:

[cperl@localhost ~]$ gcc -Wall t.c -o t
[cperl@localhost ~]$ env -i FOO=bar ./t

First, lets see what we can get out of /proc/{pid}/environ, as googling for this problem will undoubtedly point you in this direction (including ps eww which reads /proc/{pid}/environ):

[cperl@localhost ~]$ cat /proc/$(pgrep -x t)/environ | xargs -r0 -n1 echo
FOO=bar

Great, so that looks like its our answer!

Unfortunately, /proc/{pid}/environ only reflects the environment of the process when it started and does not reflect any calls that process might have made to putenv(3) or setenv(3) (you can experiment with the above program substituting in setenv(3) for putenv(3) and playing with overwrite to see what you get).

We can see that if we feed some data into our program, causing calls to putenv(3):

[cperl@localhost ~]$ env -i FOO=bar ./t
BAR=baz

And then check /proc/{pid}/environ again:

[cperl@localhost ~]$ cat /proc/$(pgrep -x t)/environ | xargs -r0 -n1 echo
FOO=bar

However, we can verify the data is really there if we attach with gdb and iterate over the environ(7) array directly:

[cperl@localhost ~]$ gdb ./t $(pgrep -x t)
...
(gdb) set $i = 0
(gdb) while (environ[$i] != 0)
 >print environ[$i++]
 >end
$1 = 0x7fffc8e42fec "FOO=bar"
$2 = 0x12d4080 "BAR=baz"

Unfortunately, I'm not aware of any other way to get this "dynamic environment info" (except for other ptrace based solutions). Obviously attaching to production processes with gdb (or ptrace in general) isn't a great idea. Most of the time you'll probably be fine inspecting /proc/{pid}/environ and verifying (via source code inspection) that that process you care about doesn't make any calls to putenv(3) or setenv(3) for the variables whose values you are interested in.

If you have any better ideas about how to get this information, please share in the comments!

4