No (functional) experience required

Jane Street is a serious functional programming shop. We use OCaml, a statically typed functional language for almost everything and have what is probably the largest OCaml codebase anywhere.

This leads lots of people to think that they shouldn't even bother applying, under the assumption that we are only interested in hiring deep functional programming gurus. I think people get to this conclusion in part because they think of functional languages, especially those with fancy type systems, as arcane tools that can only be used effectively after deep study.

To the contrary, one of the reasons we started building production systems with OCaml was that it was relatively easy to understand, even for people with no formal CS background. Since then, we've had good experiences taking students with no functional experience at all and getting them to the point of being able to complete a project in just a few weeks. We also have a very successful "OCaml Bootcamp" program, where over four weeks, we train all of the incoming traders and many other non-developer hires on OCaml and our development tools and libraries. By the end, most of them are able to create useful applications.

All of this is to say that we don't go out of our way to hire people who are already familiar with functional programming. In practice, it's just not that hard for strong programmers to pick it up after they start.

That said, a solid majority of the developers we hire do come in with functional programming experience --- but that's because of their preferences, not ours. Programmers with an interest in functional languages have an extra reason to want to work here, and so we get an unusually high number of good applicants from that pool.

There's a more general lesson here: using well-loved tools is a good way of attracting (and retaining) great developers.

Introducing Incremental

I'm pleased to announce the release of Incremental (well commented mli here), a powerful library for building self-adjusting computations, i.e., computations that can be updated efficiently when their inputs change.

At its simplest, you can think of a self-adjusting computation as a fancy spreadsheet. In a spreadsheet, each cell contains either simple data, or an equation that describes how the value in this cell should be derived from values in other cells. Collectively, this amounts to a graph-structured computation, and one of the critical optimizations in Excel is that when some of the cells change, Excel only recomputes the parts of the graph that depend on those changed cells.

What makes self-adjusting computation (or SAC) different from a spreadsheet is its dynamism. The structure of the computational graph in an SAC can change at runtime, in response to the changing input data.

This dynamism gives you quite a bit of flexibility, which can be used in different ways. Here are a few.

On-line combinatorial algorithms. Incremental is based on work by Umut Acar et. al.., on self adjusting computations (that's where the term comes from), and there, they were mostly interested in building efficient on-line versions of various combinatoral algorithms. In many cases, they could match the asymptotic complexity of custom on-line algorithms by fairly simple incrementalizations of all-at-once algorithms.

Incremental GUI construction. One simple and natural way to model a GUI application is to structure your display as a function that geneates a view from some more abstract data model.

Having a function that constructs a view from scratch at every iteration is simple, but prohibitively expensive. But if you can write this function so that it produces an incrementalized computation, then you have a solution that is both simple and efficient. We've used this technique in a number of our UIs, to good effect.

This might remind you of how functional reactive programming (FRP) is used for construction of GUIs in languages like Elm. SAC and FRP have different semantics --- FRP is mostly concerned with time-like computations, and SAC is mostly about optimizing DAG-structured computations --- but they are nonetheless closely related, especially at the implementation level. You can see my post here for a description of the broader conceptual landscape that includes both FRP and SAC.

Configurable computations. An example that comes from our own work is risk calculations. Calculating measures of risk of a portfolio involves combining data from a complex collection of interdependent models. Each of these models is dependent both on live data from the markets, and on configurations determined by users.

A config change could merely tweak a coefficient, or it could change the overall structure of the computation, say by changing the list of factors used by a given model. Incremental allows you to build a computation that can update efficiently in response to both simple data changes as well as more structural config changes, in one unified framework.

A taste of Incremental

It's hard to give a compelling example of Incremental in action in just a few lines of code, because what makes Incremental really useful is how it helps you build large and complex computations. Nonetheless, small examples can give you a sense of how the library works.

To that end, let's walk through a few small examples. To begin, we need to instantiate the Incremental functor.

  1. open Core.Std
  2. module Inc = Incremental_lib.Incremental.Make ()

Each instance thus generated is its own independent computational world. The Incremental functor is generative, meaning it mints fresh types each time it's applied, which prevents values from different incremental worlds from being mixed accidentally.

An Incremental computation always starts at its variables. Modifications to variables are how updates to input data are communicated to Incremental.

Let's write down a few variables corresponding to the dimensions of a rectangular prism.

  1. module Var = Inc.Var
  2.  
  3. (* dimensions of a rectangular prism *)
  4. let width_v = Var.create 3.
  5. let depth_v = Var.create 5.
  6. let height_v = Var.create 4.

We can use Var.watch to get the (trivial) incremental computions associated with each variable.

  1. let width = Var.watch width_v
  2. let depth = Var.watch depth_v
  3. let height = Var.watch height_v

The following is an incremental computation of the base area of the prism, and of the volume.

  1. let base_area =
  2. Inc.map2 width depth ~f:( *. )
  3. let volume =
  4. Inc.map2 base_area height ~f:( *. )

In order to get information out of an incremental computation, we need to explicitly mark which nodes we want data from by creating observer nodes. Because it knows which nodes are observed, the framework can track what parts of the computation are still necessary to the results.

  1. let base_area_obs = Inc.observe base_area
  2. let volume_obs = Inc.observe volume

In order to force the computation to run, we need to explicitly call Inc.stabilize. Here's some code that uses stabilize to run the computation and then gets the information from the observers.

  1. let () =
  2. let v = Inc.Observer.value_exn in
  3. let display s =
  4. printf "%20s: base area: %F; volume: %F\n"
  5. s (v base_area_obs) (v volume_obs)
  6. in
  7. Inc.stabilize ();
  8. display "1st stabilize";
  9. Var.set height_v 10.;
  10. display "after set height";
  11. Inc.stabilize ();
  12. display "2nd stabilize"

If we run this, we'll se the following output:

1st stabilize: base area: 25.; volume: 125.
    after set height: base area: 25.; volume: 125.
       2nd stabilize: base area: 25.; volume: 250.

Note that setting the height isn't enough to change the observed values; we need a stabilization to make that happen.

That's a fairly trivial computation, and there certainly isn't much to incrementalize. Let's try something a little more complicated: a function for merging together an array of incrementals, using some commutative and associative operator like addition or max.

  1. let rec merge ar ~f =
  2. if Array.length ar <= 1 then ar.(0)
  3. else
  4. let len = Array.length ar in
  5. let len' = len / 2 + len % 2 in
  6. let ar' =
  7. Array.init len' ~f:(fun i ->
  8. if i * 2 + 1 >= len then ar.(i*2)
  9. else Inc.map2 ar.(i*2) ar.(i*2+1) ~f)
  10. in
  11. merge ar' ~f;;

Because this is done using a binary tree as the dependency graph, the complexity of updating an element is log(n), where n is the size of the array. We can use this for, computing an average:

  1. let average ar =
  2. let sum = merge ar ~f:(+.) in
  3. Inc.map sum ~f:(fun s -> s /. float (Array.length ar))

This works, but we can do better performance-wise, at least, if our merge operation has an inverse. In that case, maintaining the sum can in principle be done on constant time, by first, removing the old value before adding in the new. Incremental has a function for taking advantage of this structure.

  1. let sum ar =
  2. Inc.unordered_array_fold ~f:(+.) ~f_inverse:(-.) ar;;

Now, let's say we want to do something a little more dynamic: in particular, what if we want to compute the average of a prefix of the given array? For that, we need to use the bind function, which allows us to produce new incremental nodes within an incremental computation.

  1. let average_of_prefix ar length =
  2. Inc.bind length (fun length ->
  3. average (Array.init length ~f:(fun i -> ar.(i))))

The type of this function is float Inc.t array -> int Inc.t -> float Inc.t, so the length of the prefix is a fully fledged part of the incremental computation. As a result, the dependency structure of this computation changes dynamically, e.g., if the value of length is 7, then the computation only depends on length and the first 7 elements of the array.

Hopefully this gives you enough of a sense of what Incremental is about to start thinking about where it might be useful for you. Note that the overhead of incremental is not inconsiderable --- on my laptop, firing a single node takes on the order of 30ns, which is far more than, say, summing numbers together. Incremental tends to be useful when the computation that is put into a single node is large relative to that overhead, or when the computational graph is large relative to the sub-graph that needs to be recomputed. Our experience has been that there are plenty of applications in this space that can benefit from Incremental.

Converting a code base from camlp4 to ppx

As with many projects in the OCaml world, at Jane Street we have been working on migrating from camlp4 to ppx. After having developed equivalent ppx rewriters for our camlp4 syntax extensions, the last step is to actually translate the code source of all our libraries and applications from the camlp4 syntax to the standard OCaml syntax with extension points and attributes.

For instance to translate code using pa_ounit and pa_test, we have to rewrite:

TEST = <:test_result< int >> ~expect:42 (f x)

to:

let%test _ = [%test_result: int] ~expect:42 (f x)

For small to medium projects it is enough to just take a couple hours to translate the source code by hand. But at Jane Street where we have a huge OCaml code base making extensive use of camlp4, it is simply not realistic. So we needed a tool to do that for us.

Writing a tool to automatically convert the syntax

Since the output of such as tool has to be accepted as the new code source that is committed in our repository, it must preserve the layout of the original file as much as possible and of course keep the comments. This mean that any approach using an AST pretty-printer would be extremely complex.

The path we choosed is to textually substitute the foreign syntaxes in the original file for the new ones. One could imagine doing that with a tool such as sed, awk, perl, ... however doing it properly would be fastidious and it would be pretty-much impossible to be 100% sure it would never translate things it is not supposed to. Plus writing perl is not as fun as writing OCaml programs.

Instead there is an easy way to find the foreign syntaxes: using camlp4 itself. To subsitute the text of foreign syntaxes the only thing we need to know is their location in the original file, and camlp4 can help us with that.

Writing dummy camlp4 syntax extensions

The idea is to write for each camlp4 syntax extension a dummy one that define the same grammar productions as the real one, but instead of generating code it simply record substitutions at certain locations.

Then we do the following:

  • parse a file with camlp4 and our dummy syntax extensions
  • apply all the recorded substitutions to the original file

This approach has the advantage of interpreting the original file in the exact same way as our regular syntax extensions. Giving us good confidence we did not change the syntactic constructions by mistake.

To do so we define this API:

(** [replace loc repl] records a text substitution that replaces the
    portion of text pointed by [loc] by [repl]. *)
val replace : Loc.t -> string -> unit

Then writing a dummy camlp4 syntax extension is pretty easy. For instance for a subset of pa_ounit:

EXTEND Gram
  GLOBAL: str_item;

  test: [[ "TEST" -> replace _loc "let%test" ]];

  name_equal:
    [[ `STRING _; "=" -> ()
     |            "=" -> replace _loc "_ ="
     ]];

  str_item:
    [[ test; name_equal; expr -> <:str_item< >>
    ]];
END

On the fly conversion and diffing the generated code

Since this tool was convenient to use, we used it to check that our newly written ppx rewriters did the same thing as the old camlp4 syntax extensions:

  • for a given OCaml source file of a library or application, we converted it using camlp4-to-ppx and saved the result
  • we processed the original file using camlp4 and the translated one using our ppx rewriters
  • in both case we saved the output of -dparsetree (human-readable version of the internal OCaml AST) and -dsource (pretty-printed code)
  • we diffed the camlp4 and ppx outputs of -dparsetree, as well as the outputs of -dsource

This was all quite easy to do with jenga. We kept looking at the generated diffs until they were all empty.

We have quite a lot of code in our camlp4 syntax extensions and converting them to ppx was a long mechanical job and so quite error-prone. Given that this diffing turned out to be really helpful to find errors.

The Camlp4 Syntax is not quite the OCaml syntax

While using this we noticed that quite a few syntaxes accepted by camlp4 are not accepted by OCaml, for instance:

let _ x = x

let f l = List.map l ~f:fun x -> x + 1

These where quite easy to fix automatically as well using camlp4-to-ppx.

Github repo and extension

We published a slightly modified version of this tool on github.

The method we used doesn't work out of the box with all syntax extensions. For instance to convert code using lwt.syntax some more work needs to be done on camlp4-to-ppx. But it is a good starting point.

CPU Registers and OCaml

Even though registers are a low-level CPU concept, having some knowledge about them can help write faster code. Simply put, a CPU register is a storage for a single variable. CPU can keep data in memory or cache or in registers and registers are often much faster. Furthermore, some operations are possible only when the data is in registers. Hence, the OCaml compiler tries to keep as many variables as it can in the registers.

Code with more than 13 variables is slow?

Consider this primitive statistics computation:

let stats xs ys =
  let len = Array.length xs in
  if len Array.length ys then
    raise not_of_same_length;
  let sx = ref 0 in
  let sy = ref 0 in
  let sxx = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let sax = ref 0 in
  let say = ref 0 in
  for i = 0 to len - 1 do
    let x = Array.unsafe_get xs i in
    let y = Array.unsafe_get ys i in
    let ax = abs x in
    let ay = abs y in
    let xx = x * x in
    let yy = y * y in
    let xy = x * y in
    sx := !sx + x;
    sy := !sy + y;
    sxx := !sxx + xx;
    syy := !syy + yy;
    sxy := !sxy + xy;
    sax := !sax + ax;
    say := !say + ay;
  done;
  !sx, !sy, !sax, !say, !sxx, !syy, !sxy

Rearranging just a few lines produces code 1.5-2x faster:

let x = Array.unsafe_get xs i in
    sx := !sx + x;
    let xx = x * x in
    sxx := !sxx + xx;
    let ax = abs x in
    sax := !sax + ax;
    let y = Array.unsafe_get ys i in
    sy := !sy + y;
    let xy = x * y in
    sxy := !sxy + xy;
    let ay = abs y in
    say := !say + ay;
    let yy = y * y in
    syy := !syy + yy;

Why is that? CPU has just a few registers:

  • 13 which can contain integers and pointers to arrays and records
  • 16 which can contain floating point numbers

If the code has more than that many variables, OCaml compiler has to park the extra variables in memory and this parking is called spilling. Actually, spilling may happen even when there are less variables, because for example some operations like integer multiplication use extra registers.

Therefore, it's good to try to keep the number of frequently used variables to 13 or less, or to rearrange the code so that fewer variables are used at the same time.

The OCaml compiler can show spilled variables when called with the option -dreload.

Calling a single function makes subsequent code slower?

If a function is called, all of the active registers are spilled because it is not known whether the called function will need those registers. That can often be the largest penalty when calling a function, assuming the function is not inlined.

Let's change the previous function:

let stats xs ys =
  let sx  = ref 0 in
  let sax = ref 0 in
  let sxx = ref 0 in
  let sy  = ref 0 in
  let say = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let len = Array.length xs in
  if len <> Array.length ys then
    failwith (Printf.sprintf "Arrays not of the same length: %d %d"
      len (Array.length ys));
  for i = 0 to len - 1 do
    ... 

This produces 1.35x slower code simply because OCaml compiler will spill all of the variables because of Printf.sprintf. In each iteration, OCaml will pull sx from the memory and store it back.

It's a pity that this is actually not necessary. OCaml has to pull sx from the memory and store it back just once, not in each iteration. Looks like that can be improved in the OCaml compiler.

Recursive functions with more parameters are faster?

A function can get data from input parameters, from closure environment and from memory. Input parameters are normally stored in registers. Therefore, using input parameters will generally be slightly faster than the closure environment and memory. This can come in handy in recursions.

Take for example this function which finds the index of the given integer in the array:

let index (a : int array) e =
  let len = Array.length a in
  let rec loop i =
    if i < len then begin
      if Array.unsafe_get a i = e then Some i
      else loop (i + 1)
    end else None
  in
  loop 0

The variables e and len are pulled from closure environment in each iteration. Moving them to input parameters speeds up the function 1.15-1.33 times:

let index (a : int array) e =
  let rec loop e len i =
    if i < len then begin
      if e = Array.unsafe_get a i then Some i
      else loop e len (i + 1)
    end else None
  in
loop e (Array.length a) 0

Further reading

Processor Register

Register Allocation

Reverse web proxy in ~50 lines of BASH

In the spirit of reinventing the wheel for fun, I hacked this together as a quick challenge to myself last week. It's a little rough around the edges, but I thought it was too cute not to share. If you have any bug fixes, please post them in the comments.

#!/bin/bash

set -e -u

function usage() {
  echo "USAGE: ${0} parent {port} {lower_bound_backend_port} {upper_bound_backend_port}"
}

[[ $# < 1 ]] && usage && exit 1

mode=${1};shift
case ${mode} in
  parent)
    PORT=${1};shift
    LOWER=${1};shift
    UPPER=${1};shift
    socat TCP-LISTEN:${PORT},fork,reuseaddr "EXEC:${0} child ${LOWER} ${UPPER}"
    ;;
  child)
    LOWER=${1};shift
    UPPER=${1};shift
    COUNT=0
    PORT=$(shuf -i ${LOWER}-${UPPER} -n 1)
    let "RANGE = UPPER - LOWER"
    SUCCESS=false
    while ((COUNT <= RANGE)) || ${SUCCESS}; do 
      set +e
      if socat STDIN TCP:127.0.0.1:${PORT},connect-timeout=2; then
        SUCCESS=true
        break
      else
        echo "unable to connect to port ${PORT}" >&2
      fi
      set -e
      let COUNT+=1
      let PORT+=1
      if ((PORT > UPPER )); then
        let 'PORT = LOWER'
      fi
    done
    if ! ${SUCCESS}; then
      echo "HTTP/1.1 500 Internal Server Error"
      echo
      echo "All REDACTED servers are down.  Please report to REDACTED@janestreet.com."
      exit 1
    fi
    ;;
  *)
    usage
    exit 1
    ;;
esac

Building a lower-latency GC

We've been doing a bunch of work recently on improving the responsiveness of OCaml's garbage collector. I thought it would be worth discussing these developments publicly to see if there was any useful feedback to be had on the ideas that we're investigating.

The basic problem is a well-known one: GCs can introduce unpredictable pauses into your application, and depending on how your GC is configured, these pauses can be quite long. Unpredictable latencies are a problem in a wide variety of applications, from trading systems to web stacks.

One approach people often take is to avoid using the allocator altogether: pool all your objects, and never allocate anything else. You can even keep many of your pooled objects outside of the heap.

This works, but makes for a less pleasant coding experience (and code that is trickier and harder to reason about.) So while pooling is a valuable technique, we'd like to have a GC that lets you run with low latencies without sacrificing the ability to allocate.

What are the problems?

OCaml's garbage collector is already pretty good from a latency perspective. Collection of the major heap in OCaml is incremental, which means that collection of the major heap can be done in small slices spread out over time, so no single transaction need experience the full latency of walking the major heap. Also, collection of the minor heap is pretty fast, and OCaml programs tend to do pretty well with a relatively small minor heap --- typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.

Still, there are problems with OCaml's collector.

No profiling

There's no good way in the stock runtime to see how long the different parts of collection take, and that makes it hard to optimize.

False promotions

OCaml's generational collector is very simple: objects are typically allocated first on the minor heap, where the work is effectively three inlined instructions to bump a pointer and check whether you've hit the end. When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.

In a typical functional workload, most of your allocations are short-lived, and so most of the minor heap is dead by the time you do the minor collection, so the walk of the minor heap can be quite cheap. But there's always a small number of false promotions, objects that would have become unreachable shortly, but were promoted because the minor collection came at an inconvenient time.

Clocking

One fundamental issues with the stock runtime is that the collector is clocked in terms of minor allocations --- ignoring, critically, the amount of time that has gone by.

This clocking makes sense for many applications, but if you're building a server that needs to respond to bursty traffic with low and predictable latencies, this is the opposite of what you want. Really, what you'd prefer to do is to defer GC work when you're busy, instead scheduling it at times when the application would otherwise be idle.

One solution here is to allow the application to drive the scheduling of the GC, but the runtime in its current form doesn't really support doing this. In particular, while you can choose to explicitly run a major slice, the collector accounting doesn't take note of the work that has been done that way, so the major collector works just as hard as it did previously.

Furthermore, the major slice always forces a minor collection. But running minor collections all the time is problematic in its own right, since if you run them when the minor heap is too small, then you'll end up accidentally promoting a rather large fraction of your minor allocations.

Insufficient incrementality

While the major collector is mostly incremental, not everything about it runs incrementally. In particular, when the major collector hits an array, it walks the array all at once. This is problematic if you start using large arrays, which does happen when one is using pooling techniques. Similarly, the collector is not incremental when it comes to scanning GC roots.

Immediate accounting

The stock runtime decides how big of a major slice to do based on how much was promoted to the major heap in the last minor collection. This is part of a heuristic that is meant to make sure that the collector keeps up with the rate of allocation, without running needlessly when the application isn't promoting much.

But the accounting is in some sense too immediate: if you do a lot of promotion in a given cycle, you're forced to do the collection immediately. While all that work needs to be done, it's not clear that it needs to be done immediately. Again, for responsive systems, it's often better to push off work until after the busy times.

Making it better

Happily, Damien Doligez, the author of OCaml's GC, has been visiting with us for the last few months, and has been doing a lot of good work to improve the runtime, and in particular to address the concerns raised above. Here's the summary of the changes made thus far.

Better profiling

A set of probes was added to the GC, allowing us to record in a quite detailed way every phase of the collection process. This is quite detailed, telling you the phase (marking vs sweeping) and the sub-phase, as well as keeping track of a collection of useful counters. This is available in the instrument branch.

Aging

Damien has also implemented aging in the minor heap. Aging is a technique whereby objects stay in the minor heap for several minor collections before being promoted to the major heap. The goal of aging is to reduce the amount of false promotion.

Better incrementalization

Several of the stages of the collector have been made interruptible, including scanning of arrays and of the roots. The effect here is to reduce the worst-case delays imposed by the collector. This is in the low-latency branch.

Separating major slices from minor collections

In the stock runtime, major slices and minor collections are always done together. In the low-latency branch, you can run one without the other, and you can basically run them at any time. This has a couple of advantages --- one is that it's essentially another form of incrementalization, allowing you to do less work per GC pause.

The other is that it gives you more freedom to schedule collections when you want to. One way we're looking at using this is to have an application-level job that wakes up periodically, and does a heuristic check to see if the system appears busy. If it doesn't, then it schedules some GC work, and it may choose to do either a minor collection or a major slice. A minor collection would only be chosen in the case that the minor heap is bigger than some configured level, to avoid too much false promotion; but a major collection can be done at any time.

Smoothed work-accounting

Instead of just keeping track of the amount of work that needs to be done in the next major slice, the GC in the low-latency branch tracks work that must be done over the next n major slices, by keeping these numbers in a circular buffer.

The runtime also uses these buckets for keeping track of extra work that has been done by application-forced major slices. A forced major slice takes work away from the front-most bucket, potentially bringing the bucket to negative territory.

When the runtime checks if it needs to do a major slice, it looks at the first bucket. If it's got a positive amount of work in it, then that work is done in that slice, if possible. Whatever is left over (which may be positive or negative) is spread out uniformly over the next n buckets.

Segmented free lists

A big part of the cost of minor collections is the cost of finding free blocks. One observation is that in many OCaml applications, block sizes are quite small. One way of taking advantage of this is to have a set of size-segregated free-lists, for a configurable set of sizes. e.g., one could have a different free list for blocks with 1, 2, 3 and 4 slots.

This is still ongoing (read: not working yet), but it will show up in the multi-free-list branch eventually.

How is it going?

This is all very much a work in progress, but the results so far have been quite promising. By using a version of the compiler with most of these changes and with an application-driven job that forces major slices in quiet times, we were able to reduce tail latencies by a factor of 3 in a real production application. That's pretty good considering that we've done essentially no parameter tuning at this point.

That said, some of the results are less promising. We were somewhat disappointed to see that when doing more traditional batch jobs, aging didn't provide a significant improvement in overall compute time. It seems like in many applications, aging saves some on promotion, but the minor collection itself gets a little more expensive, and these seem to nearly cancel out.

This seems especially surprising given that aging is present in most GCs, including those for Java's HotSpot, the .NET CLR, and GHC. Given that everyone seems to use aging, I would have expected aging to have a quite noticeable benefit for lots of workloads, not just carefully tuned packet processors.

A call for help

The progress we've made so far is quite promising, but a lot of things are still up in the air. The reason that I wanted to post about it now is that I was hoping to hear feedback from others who have experience dealing with similar issues in other languages.

So, if you have thoughts about the techniques we've tried for making OCaml's runtime more responsive, or suggestions for other techniques we should consider, please comment!

Faster OCaml to C calls

The official OCaml documentation "Interfacing C with OCaml" doesn't document some interesting performance features.

C functions with no OCaml allocation

A C call can allocate OCaml data and pass it back to OCaml, for example using caml_copy_string(s). Between the C call allocating OCaml data and passing it back, it has to make sure that OCaml's Garbage Collector doesn't collect it, as the Garbage Collector can be triggered during the C call. There's an intricate mechanism which assures that, part of which are CAMLparam, CAMLlocal and CAMLreturn.

This mechanism can be bypassed if the C call is not going to allocate any OCaml data. This can yield performance benefits especially in shorter functions. To bypass it, CAMLparam, CAMLlocal and CAMLreturn should not be used and the primitive should be declared with "noalloc".

For example, OCaml's compare is not smart to avoid branch mispredictions for floats. Moving comparison to C speeds it up a little bit. "noalloc" speeds it up a lot.

float compare            8.93 ns
float_compare_c          7.88 ns
float_compare_c_noalloc  5.32 ns

external float_compare_noalloc : float -> float -> int =
  "float_compare_noalloc_stub" "noalloc"

CAMLprim value float_compare_noalloc_stub(value vf, value vg)
{
  double f = Double_val(vf);
  double g = Double_val(vg);
  return Val_int((f > g) - (f < g) + (f == f) - (g == g));
}

C functions with float arguments and float return

Since C code must use boxed OCaml floats, any unboxed float must be boxed prior to the C call. This is not cheap, especially for fast functions. This boxing can be avoided if the C call accepts and returns only floats.

For example, float_min can be replaced with a single CPU instruction. Unfortunately, the C implementation is much slower because of boxing floats:

float_min                  6.09 ns
float_min_c               15.92 ns

let float_min (x : float) y = if x < y then x else y

CAMLprim value float_min_stub(value x, value y)
{
  CAMLparam2(x, y);
  CAMLlocal1(v);
  double z = Double_val(y);

  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(Double_val(x)));
  v = caml_copy_double(z);
  CAMLreturn(v);
}

external float_min_c : float -> float -> float = "float_min_stub"

To avoid boxing, C function's arguments and return should be "double", CAMLparam, CAMLlocal and CAMLreturn should be avoided and the primitive should include "float" and both interpreted and compiled implementation:

float_min_c_float          1.95 ns

external float_min_inan_c_float : float -> float -> float
  = "float_min_inan_float_bytecode" "float_min_inan_float_stub" "float"

CAMLprim double float_min_inan_float_stub(double x, double y)
{
  double z = y;
  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(x));
  return z;
}

C functions with float arguments and non-float return

We might be able to further speed up float_compare_noalloc if we avoided boxing. Alas, that function returns integer so it's impossible to use "float". Is it still possible to avoid boxing? The answer is yes, by simply converting float to int.

float_compare_c_float    3.73 ns

CAMLprim double float_compare_float_stub(double f, double g)
{
  return (f > g) - (f < g) + (f == f) - (g == g);
}

external float_compare_float : float -> float -> float
  = "float_compare_float_bytecode" "float_compare_float_stub" "float"
let float_compare_float x y = int_of_float (float_compare_float x y)

Why GADTs matter for performance

When GADTs (Generalized Algebraic Data Types) landed in OCaml, I wasn't particularly happy about it. I assumed that it was the kind of nonsense you get when you let compiler writers design your programming language.

Which is to say that the standard GADT examples all seem to be about the kinds of things that compiler writers do, like embed domain-specific languages or build typed abstract-syntax trees. But it didn't seem particularly relevant for the kind of systems programming that I think about.

But it became apparent pretty quickly that I was wrong. In particular, since GADTs have landed, at Jane Street we've found lots of examples where GADTs are important for performance, of all things. The theme in these examples is that GADTs enable you to tweak your memory representations in ways that would otherwise be painful or impossible to do safely in OCaml.

The Problem of Polymorphism

I'd like to walk through a simple example that illustrates this aspect of GADTs, but first, a few words about OCaml's memory representation. OCaml's polymorphism is in an important way backed on that memory representation. In particular, consider a simple polymorphic function like List.iter, which has the following type:

  1. val iter: 'a list -> f:('a -> unit) -> unit

The polymorphic type tells you that List.iter can operate on lists of any type, and in OCaml, this is achieved with a single compiled version of the code. This is possible because the memory representation of the elements of a list are uniform: you can always refer to an OCaml value in a single word, either as a pointer to a heap-allocated value, or as an immediate that fits inside that word.

That means that some OCaml datatypes are less efficient space-wise than you might imagine. Arrays, for example, take the same amount of space per element whether those elements are bytes, 32-bit ints, or 64-bit ints. (There's actually some special magic in the compiler for float arrays, though this is probably more trouble than it's worth, as described by Alain Frisch here. But let's ignore float arrays for now.)

OCaml does have a tighter representations for byte arrays, called bytes. But it's a completely different type, and so building a general purpose data structure that uses bytes when it would make sense, and ordinary arrays otherwise, is a little awkward.

Controlling memory representation without GADTs

Let's see what happens if we try to design (without GADTs) an array type that sometimes uses the general array representation and sometimes uses bytes.

You could imagine representing such a value using an ordinary variant.

  1. type 'a t = | Array of 'a array
  2. | Bytes of bytes

We could then implement each of the operations we want on our new array type, implementing each operation differently depending on the particular representation. Let's see what happens if we just take this idea and run with it, implementing all the required functions in the most straightforward way.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array of 'a array
  4. | Bytes of bytes
  5.  
  6. let of_bytes x : char t = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length = function
  10. | Array a -> Array.length a
  11. | Bytes s -> Bytes.length s
  12.  
  13. let get t i =
  14. match t with
  15. | Array a -> a.(i)
  16. | Bytes s -> s.[i]
  17.  
  18. let set t i v =
  19. match t with
  20. | Array a -> a.(i) <- v
  21. | Bytes s -> s.[i] <- v
  22.  
  23. end;;
  24.  
  25. module Compact_array :
  26. sig
  27. type 'a t = Array of 'a array | Bytes of bytes
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : char t -> int -> char
  32. val set : char t -> int -> char -> unit
  33. end

This seems pretty good at first glance, but the inferred types aren't quite what we want. In particular, get and set only work with Compact_arrays containing characters. If you think about how type inference works, it's not really all that surprising. If you think about the code for get:

  1. let get t i =
  2. match t with
  3. | Array a -> Array.get a i
  4. | String s -> String.get s i

The OCaml compiler is looking for a single type to assign to the return value for all the cases of the match. Given that String.get always returns a char, then Compact_array.get will be restricted to only returning a char.

One way to work around this problem is to essentially implement what we want as a poor-man's object. Here, we just write the code separately for the different cases, and stuff those functions into a record full of closures. Here's how that looks.

  1. module Compact_array = struct
  2.  
  3. type 'a t = { len: unit -> int
  4. ; get: int -> 'a
  5. ; set: int -> 'a -> unit
  6. }
  7.  
  8. let of_string s =
  9. { len = (fun () -> String.length s)
  10. ; get = (fun i -> String.get s i)
  11. ; set = (fun i x -> String.set s i x)
  12. }
  13.  
  14. let of_array a =
  15. { len = (fun () -> Array.length a)
  16. ; get = (fun i -> Array.get a i)
  17. ; set = (fun i x -> Array.set a i x)
  18. }
  19.  
  20. let length t = t.len ()
  21. let get t i = t.get i
  22. let set t i x = t.set i x
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = {
  28. len : unit -> int;
  29. get : int -> 'a;
  30. set : int -> 'a -> unit;
  31. }
  32. val of_string : bytes -> char t
  33. val of_array : 'a array -> 'a t
  34. val length : 'a t -> int
  35. val get : 'a t -> int -> 'a
  36. val set : 'a t -> int -> 'a -> unit
  37. end

This more or less solves the problem, but it's still not really the memory representation we want. In particular, we have to allocate three closures for each Compact_array.t, and this number of closures will only go up as we add more functions whose behavior depends on the underlying array.

GADTs to the rescue

Let's go back to our failed variant-based implementation, but rewrite it using the GADT syntax. Note that we're not trying to change the types at all this time, just rewriting the same type we had before in the language of GADTs.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> 'a t

The syntax of this declaration suggests thinking about variant constructor like Array or Bytes as functions from the constructor arguments to the type of the resulting value, with the thing to the right of the : roughly corresponding to the type signature of the constructor.

Note that for the Array constructor, the type value of 'a depends on the type of the argument:

  1. > Array [|1;2;3|];;
  2. - : int t = Array [|1; 2; 3|]
  3. > Array [|"one";"two";"three"|];;
  4. - : bytes t = Array [|"one"; "two"; "three"|]

But for the Bytes constructor, the type 'a in the type is still free.

  1. > Bytes "foo";;
  2. - : 'a t = Bytes "foo"

This is really the problematic case, because we'd like for Bytes "foo" for the parameter 'a to by char, since in the Bytes case, that's what the element type of our array is.

Because GADTs give us the ability to specify the type on the right-hand side of the arrow, we can get that.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> char t

Now, the Bytes constructor behaves as we'd like it too.

  1. > Bytes "foo";;
  2. - : char t = Bytes "foo"

Now let's see what happens when we try to write the length function.

  1. > let length t =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : char t -> int = <fun>

Disappointingly, we're again stuck with a function that doesn't have the right type. In particular, the compiler has decided that this function can only operate on char t, when we want it to work for arrays of any type.

But the problem now is that type inference in the presence of GADTs is difficult, and the compiler needs a little help. Roughly speaking, without some hints, OCaml's type system will try to identify all types as having a single value within a given function. But in this case, we need a type variable which might have different values in different branches of a match statement.

We can do this by creating a locally-abstract type el to represent the type parameter of t (and the element type), and annotating t accordingly.

  1. > let length (type el) (t:el t) =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : 'a t -> int = <fun>

Now we see that we get the right type. We can push this approach through to get a complete implementation.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array : 'a array -> 'a t
  4. | Bytes : bytes -> char t
  5.  
  6. let of_bytes x = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length (type el) (t:el t) =
  10. match t with
  11. | Array a -> Array.length a
  12. | Bytes s -> Bytes.length s
  13.  
  14. let get (type el) (t:el t) i : el =
  15. match t with
  16. | Array a -> Array.get a i
  17. | Bytes s -> Bytes.get s i
  18.  
  19. let set (type el) (t:el t) i (v:el) =
  20. match t with
  21. | Array a -> Array.set a i v
  22. | Bytes s -> Bytes.set s i v
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = Array : 'a array -> 'a t | Bytes : bytes -> char t
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : 'a t -> int -> 'a
  32. val set : 'a t -> int -> 'a -> unit
  33. end

As I said at the beginning, this is really just an example of the more general theme. GADTs are about more than clever typed interpreters; they're a powerful mechanism for building easy to use abstractions that give you more precise control of your memory representation. And getting the right memory representation is often critical for building high performance applications.

A lighter Core

We recently released a version of our open source libraries with a much anticipated change --- Async_kernel, the heart of the Async concurrent programming library, now depends only on Core_kernel rather than on Core.

This sounds like a dull, technical change, and it kind of is. But it's also part of a larger project to make our libraries more lightweight and portable, and so suitable for a wider array of users and applications.

We've actually been working on these issues for a while now, and this seems like a good time to review some of the changes we've made over the years, and what's still to come.

Reorganizing for portability

Core has always had dependencies on Unix, including OCaml's Unix library, as well as some other parts of the Unix environment, like the Unix timezone files. This has long been a problem for porting to Windows, but more recently, the issue has loomed for two other increasingly important platforms for OCaml: Javascript and Mirage.

To help fix this problem, in 2013 we released a library called Core_kernel, which is the portable subset of Core that avoids Unixisms as well as things like threads that don't match well with the Javascript and Mirage back-ends.

In the same vein, we refactored Async, our concurrent programming library, into a set of layers (modeled on the design of the similar Lwt library) that both clarified the design and separated out the platform specific bits. Async_kernel is the lowest level and most portable piece, hosting the basic datastructures and abstractions. Async_unix adds a Unix-specific scheduler, and Async_extra builds further os-specific functionality on top.

Until recently, the fly in this ointment was that Async_kernel still depended on Core, rather than Core_kernel, because only Core had a time library. Making Async_kernel only require Core_kernel was a bigger project than you might imagine, in the end leading us to change Timing_wheel, a core datastructure for Async and several other critical libraries at Jane Street, to use an integer representation of time instead of the float-based one from Core.

Already, some experiments are underway to take advantage of this change, including some internal efforts to get Async working under javascript, and external efforts to get cohttp's Async back-end to only depend on Async_kernel.

I'm hoping that yet more of this kind of work will follow.

Module Aliases

One long-running annoyance with OCaml is the lack of an effective namespace mechanism. For a long time, the only choice was OCaml's packed modules, which let you take a collection of modules and merge them together into one mega-module. Some kind of namespace mechanism is essential at scale, and so we used packed modules throughout our libraries.

Unfortunately, packed modules have serious downsides, both in terms of compilation time and executable sizes. We've been talking to people about this and looking for a solution for a long time. You can check out this epic thread on the platform list if you want to see some of the ensuing conversation.

A solution to this problem finally landed in OCaml 4.02, in the form of module aliases. I'll skip the detailed explanation (you can look here if you want to learn more), but the end result was great: our compilation times immediately went down by more than a factor of 3, and it gave us a path towards dropping packed modules altogether, thus reducing executable sizes and making incremental compilation massively more efficient.

The work on dropping packed modules has already landed internally, and will hopefully make it to the external release in a few months. The benefit to executable size is significant, with typical executables dropping in size by a factor of 2, but there is more to do. OCaml doesn't have aggressive dead code elimination, and that can lead to a lot of unnecessary stuff getting linked in. We're looking at some improvements we can make to cut down the dependency tree, but better dead code elimination at the compiler would really help.

Sharing basic types

Interoperability between Core and other OCaml libraries is generally pretty good: Core uses the same basic types (e.g., string, list, array, option) as other OCaml code, and that makes it pretty easy to mix and match libraries.

That said, there are some pain points. For example, Core uses a Result type (essentially, type ('a,'b) result = Ok of 'a | Error of 'b) quite routinely, and lots of other libraries use very similar types. Unfortunately, these libraries each have their own incompatible definitions.

The solution is to break out a simple type that the different libraries can share. After some discussion with the people behind some of the other libraries in question, I made a pull request to the compiler to add a result type to the stdlib.

This is a small thing, but small things matter. I hope that by paying attention to this kind of small issue, we can help keep interoperability between Core and the rest of the OCaml ecosystem smooth.

Eliminating camlp4

One concern I've heard raised about Core and Jane Street's other libraries is their reliance on camlp4. camlp4 is a somewhat divisive piece of infrastructure: it's long been the only decent way to do metaprogramming in OCaml, and as such has been enormously valuable; but it's also a complex and somewhat unloved piece of infrastructure that lots of people want to avoid.

camlp4 also makes tooling a lot more complicated, since there's no single syntax to target. Dev tools like ocp-indent and the excellent merlin have some terrible hacks to support some of the most common camlp4 syntax extensions, but the situation is clearly untenable.

You do need camlp4 to build Core, but you don't need camlp4 to use it, and in practice, that's good enough for most use cases. But for people who want to avoid camlp4 entirely, it's still a nuisance. Moreover, while you don't need camlp4 to use Core, it is convenient. For example, a lot of Core's idioms work best when you provide s-expression serializers for your types, and the sexplib syntax extension is an awfully convenient way to generate those functions.

Our plan is to simply eliminate our dependency on camlp4 entirely over the next 6 months, by switching to using ppx and extension points, a new approach to metaprogramming in OCaml that, like module aliases, landed in 4.02. We're currently rewriting all of our syntax extensions, and building tools to automatically migrate the code that depends on camlp4. People who want to continue to use the old camlp4 extensions are welcome to continue doing so, but we're cutting our dependency on them.


Even at the end of all this, we don't expect that Core and Async will suit everyone --- that's a hard bar to cross for any software package. But we do hope that through these efforts, an ever wider set of developers will be able to take advantage of the work we've done.

Centralizing distributed version control, revisited

7 years ago, I wrote a blog post about how we at Jane Street were using our distributed version control system (hg, though the story would be the same for git) in a partially centralized way. Essentially, we built a centralized repo and a continuous integration system whose job was to merge in new changesets. The key responsibility of this system was to make sure that a change was rejected unless it merged, compiled and tested cleanly.

This half-distributed, half-centralized approach let us enjoy the benefits of a DVCS, while still getting the coherence and easy of sharing that comes from having a central authoritative source.

Since then, our development tools have changed a lot, including the arrival of a new code review and release management system called Iron. In writing Iron we discovered that centralization was valuable in ways we hadn't considered before. In particular, despite the fact that good support for merging is central to a DVCS, centralization is actually a critical ingredient to making merges work better.

To understand how centralization can help, let's talk about one reason why merging is a fraught process to begin with.

The criss-cross merge

The basic approach to merging in a DVCS like hg or git is pretty simple. Here are the basic steps that are taken to merge two heads A and B.

  • Find the greatest common ancestor (GCA(A,B)) of the heads to be merged.
  • Compute the patch from that base point to one of the two heads, say, A.
  • Take the patch you just computed, and apply it to B. Conflicts appear when the patch, which was actually based on GCA(A,B), doesn't apply cleanly to B. The result of this process is the merge.

The above discussion oversimplifies the story by assuming there's a well defined GCA, but this just isn't always true. To see why, consider a repository staring with a root revision R, and two revisions made independently on top of R.

  A
 /
R
 \
  B

Now, imagine that two different developers each concurrently decide to merge the heads A and B and do some further development. Note that in both of the cases shown below, the GCA for the merge between A and B is R.

Developer 1                Developer 2

  A---C--D                    A     
 /   /                       / \        
R   /                       R   \       
 \ /                         \   \  
  B                           B---E--F

Now, if we bring these two separate histories together into one repo, we have something like this.

  A---C--D
 / \ /
R   \
 \ / \
  B---E--F

Now, what happens if we want to merge D and F? In particular, what is GCA(D,F)? Both A and B are common ancestors, but neither one is greater than the other. In this case, there are in some sense two different GCAs, or, more precisely, there are multiple maximal common ancestors, or MCAs. This case is often described as a criss-cross merge, and is the source of much wailing and gnashing of teeth among developers and users of DVCSs.

git and hg have different ways of dealing with the case of multiple MCAs. By default, hg just picks one of the MCAs arbitrarily and does the merge based on that. Given that different choices of the merge base will lead to different results, making that choice arbitrarily is pretty disturbing.

git, on the other hand, has a strategy called recursive merge that repeatedly merges together the MCAs, and then uses that merged MCA as the basis for computing the diffs to A and B on which the final merge will be based. And hg has a new strategy called bid merge that is willing to make different choices as to the GCA to use on a file by file basis.

None of these approaches amount to principled solutions, and while they work better in some cases and worse in others, they all sometimes lead to bad results. It's tempting to look for a way out of this conundrum altogether, by avoiding the possibility of criss cross merges in the first place.

Avoiding the criss-cross merge

For those who haven't read my previous posts about how Iron approaches merges, I'll describe it briefly here. Iron organizes its branches into a hierarchy: every repository has a root feature, and that feature can have children, and those can have children as well. Thus, our main repository, called Jane, has a root feature called jane, and one can develop changes to jane in child features, such as jane/Core.Applicative or jane/quickcheck.

Critically, the merging of features is constrained. Note that in Iron, every feature is defined by its base and tip revision, where the diff between those two revisions is effectively the contents of the feature. Here are some of the key operations allowed on features.

  • fe release, moves changes from a child feature into a parent. This can only be done once the child feature is fully merged with its parent, and has the effect of setting the tip of parent to be the tip of the child, and typically deleting the child.

As an example, if the jane/quickcheck feature is based at the current tip of jane (and is fully reviewed, and all its tests pass), then calling fe release jane/quickcheck will move the tip of jane forward to be equal to the tip of jane/quickcheck, and will delete jane/quickcheck.

  • fe rebase lets you merge a feature with its parents, effectively pulling changes from a parent feature into a child. This has the effect of changing the base of the feature to be the tip of its parent, and the tip of the feature to be the result of the merge.

So, if other features have been released into jane since the jane/Core.Applicative feature was created, then the base of jane/Core.Applicative will no longer be the tip of jane. Calling fe rebase jane/Core.Applicative will merge the tip of jane/Core.Applicative with the tip of jane, and will set the base of jane/Core.Applicative to the tip of jane.

  • fe rename, which in addition to allowing you to simply change the name of a feature, also lets you introduce a parent-child relationship between features that didn't previously have one. e.g., calling fe rename jane/Core.Applicative jane/quickcheck/Core.Applicative causes the Core.Applicative feature to become a child of, and so be able to depend on the changes in, thequickcheck feature.

All of these operations are implemented against a single, centralized server which keeps track of the state of all our features. This centralization lets Iron enforce some useful invariants along the way, critically, that the GCA of a feature and its parent is well defined, and is equal to the base of the feature. This simple property turns out to outlaw criss-cross merges, which avoids all of the mess we described earlier.

The happy outcome turns out to depend critically on the fact that we built a central server that could enforce the invariants in question, or, more precisely, that we built a consistent service

discovered by chance, the existence of the central server is key to enforcing the necessary invariant. In particular, the scenario of two different users concurrently releasing into the same feature or rebasing the same feature simply isn't possible when there's a centralized monitor determining who goes first.

In retrospect, this shouldn't be too surprising. The criss-cross merge is really a result of concurrency, and the idea that introducing a lock (which is what a centralized server does for you) can be used to exclude unwanted concurrent executions in a distributed systems should surprise no one.

In the end, you can trace it all back to the CAP theorem: If you want progress while partitioned, you need to give up on consistency in some way. And criss cross merges are caused by a kind of inconsistency.

Centralization obviously has downsides, but I think Iron picks a nice point along the spectrum here: writing code is totally doable while disconnected, but operations like rebase and release that affect how information is shared between features requires you to be connected. I think it's a small price to pay to never have to deal with a criss-cross merge.

4