Quickcheck for Core

Automated testing is a powerful tool for finding bugs and specifying correctness properties of code. Haskell's Quickcheck library is the most well-known automated testing library, based on over 15 years of research into how to write property-base tests, generate useful sources of inputs, and report manageable counterexamples. Jane Street's Core library has not had anything comparable up until now; version 113.00 of Core finally has a version of Quickcheck, integrating automated testing with our other facilities like s-expression reporting for counterexample values, and support for asynchronous tests using Async.


There are at least two other Quickcheck implementations on OPAM. Why write implementation N+1?

Before working at Jane Street, I did some research relating to randomized testing (Eastlund 2009, Klein et al. 2012). In both cases, users of the software packages involved found it easy to write tests that use random data, but hard to write random distributions that actually produce values that are useful to test.

This Quickcheck clone started as an attempt to address those concerns by building tools for tuning random distributions of values. One way I've done that is building in tools for "remembering" sets of chosen values; for example, random tests won't repeat values, and it is easy to build distributions that generate sets or sorted lists and maintain uniqueness. This design is still in early stages, I don't know if these are the sort of tools that will be needed most, and I'm eager to get feedback from more users. The library has also evolved to integrate Quickcheck-style testing with the Jane Street Core and Async libraries. Over time I hope this will also produce useful random distributions for more of the Core types, so they will be easily testable.


For those unfamiliar with Haskell's Quickcheck, the library allows users to write tests of some property of a function, then test that property on many automatically-generated input values. For example, we might want to test that the optimized implementation of list-append in Core is associative:

  1. TEST_UNIT "associativity of list append" =
  2. Quickcheck.test Quickcheck.Generator.(tuple3 (list int) (list int) (list int))
  3. ~sexp_of:<:sexp_of>
  4. ~f:(fun (xs, ys, zs) ->
  5. <:test_eq>
  6. (List.append xs (List.append ys zs))
  7. (List.append (List.append xs ys) zs))

The test above randomly generates three lists of integers, appends them together two different ways, and tests that the results are equal. This process is repeated with different randomly chosen lists each time, until an error is reported or the default trial count is reached. Let's break down the parts of the code here.

  • TEST_UNIT is a camlp4 syntax for unit tests.
  • Quickcheck.test is the main entry point for running a test using Quickcheck. It takes two required, unnamed arguments. The first is a generator, specifying the probability distribution of values to choose from when generating inputs for the test. The second is a function that consumes the generated input values and runs a test. The function returns () if successful and raises an exception otherwise.
  • Quickcheck.Generator.(tuple3 (list int) (list int) (list int)) constructs the generator that we want to use here. Most of the functions in the Quickcheck.Generator module are named after the types they generate; here, default probability distributions for lists of ints, combined using tuple3.
  • We provide the optional named argument ~sexp_of to Quickcheck.test. This argument is used to render the first generated value that triggers an error. The <:sexp_of< ... >> expression is camlp4 syntax for the default sexp conversion for a type.
  • The final argument to Quickcheck.test is a function that takes the tuples of lists produced by our generator, appends them two different ways, and compares the output. <:test_eq< ... >> is camlp4 syntax for an equality test.

The example above uses the s-expression conversions and camlp4 syntax extensions that are common in Jane Street's libraries, but these things are not necessary for using Quickcheck. Quickcheck.test just needs a generator built from the functions in Quickcheck.Generator and a function that raises an exception on failure, and it will return () if successful or raise an exception describing the nature of the failure if not.


The primary data structure used by Quickcheck is the generator, or 'a Quickcheck.Generator.t. This corresponds to an implementation of the Arbitrary type class in Haskell's Quickcheck. Primarily, a generator represents a random distribution of values of type 'a, although in our implementation there is a little more metadata besides that under the hood. The Quickcheck.Generator module provides default distributions of several types, and tools for creating more distributions or customizing the existing ones.

In our example above, we generated three lists of integers using the following expression.

  1. Quickcheck.Generator.(tuple3 (list int) (list int) (list int))

Looking at the implementation of Core.Std.List.append, we can see that the implementation works in chunks of 5 elements, and changes behavior after 1000 chunks. So we might want to change our generator to make sure we get lists of the lengths we want to test.

  1. let open Quickcheck.Generator in
  2. let list_int = list int ~length:(`Between_inclusive (4900,5100)) in
  3. tuple3 list_int list_int list_int

Some experimentation might show us that this still doesn't hit the list lengths we want, as often as we want. The [Quickcheck.Generator.int_between] function, however, is documented as stressing boundary conditions, so we should be able to use it to get values at the upper and lower ends of the range we want. Here, it helps us that generators form a monad. If we combine generators using monadic bind, we get a weighted composition of their probability distributions. We can use that to first generate lengths for our lists, then use those randomly-generated lengths to build generators for the lists themselves.

  1. let open Quickcheck.Generator in
  2. let list_int =
  3. int_between ~lower_bound:(Incl 4900) ~upper_bound:(Incl 5100)
  4. >>= fun len ->
  5. list int ~length:(`Exactly len)
  6. in
  7. tuple3 list_int list_int list_int

Now we have a generator for three lists of integers, each list with a length between 4900 and 5100 inclusive, weighted toward the ends of that range. This is probably sufficient for our purposes. But if we want to go further down, if we decide that we need a very specific probability distribution, we can build one from scratch ourselves. Here is a rather idiosyncratic example that demonstrates the tools available in Quickcheck.Generator.

  1. let open Quickcheck.Generator in
  2. let rec ranges_of_five_between lower upper =
  3. if upper - lower < 10
  4. then of_list (List.range lower upper ~start:`inclusive ~stop:`inclusive)
  5. else weighted_union
  6. [ 5., singleton (lower + 0)
  7. ; 4., singleton (lower + 1)
  8. ; 3., singleton (lower + 2)
  9. ; 2., singleton (lower + 3)
  10. ; 1., singleton (lower + 4)
  11. ; 1., of_fun (fun () -> ranges_of_five_between (lower + 5) (upper - 5))
  12. ; 1., singleton (upper - 4)
  13. ; 2., singleton (upper - 3)
  14. ; 3., singleton (upper - 2)
  15. ; 4., singleton (upper - 1)
  16. ; 5., singleton (upper - 0)
  17. ]
  18. in
  19. let list_int =
  20. ranges_of_five_between 4900 5100
  21. >>= fun len ->
  22. list int ~length:(`Exactly len)
  23. in
  24. tuple3 list_int list_int list_int

This example uses a few more functions from Quickcheck.Generator. The of_list function takes a list of values and produces a generator that makes a uniform choice among them. weighted_union creates a probability distribution representing a weighted choice among the probability distributions of the associated sub-generators. singleton produces constant-valued generators, and of_fun produces a lazily-evaluated (but not memoized) generator. (Memoizing during random testing causes some unfortunate space leaks, it is important to be able to release resources after a batch of tests.) While this peculiar generator is probably not of practical use, it shows that when we need to, we can dig down into the interface and build whatever probability distribution we want.

Of course, it is also useful to construct generators for new types.

  1. type bst = Leaf | Node of bst * int * bst
  2. let gen : bst Quickcheck.Generator.t =
  3. let open Quickcheck.Generator in
  4. recursive (fun self ->
  5. let node =
  6. self >>= fun l ->
  7. int >>= fun k ->
  8. self >>| fun r ->
  9. Node (l, k, r)
  10. in
  11. union [ singleton Leaf; node ])

The function Quickcheck.Generator.recursive is a fixed-point generator that helps build simply-recursive generators that need to invoke themselves and don't have additional arguments. The function union is like weighted_union, but for uniform choice.


In Haskell's Quickcheck, there is a duality between the type class Arbitrary for generating random values and Coarbitrary for observing inputs to random functions. Our version of Quickcheck mirrors Generator with Observer. Most tests using Quickcheck do not need an observer, but if you want to generate a random input for a higher-order function, you will need an observer for the function's input type.

  1. TEST_UNIT "function composition" =
  2. let open Quickcheck.Generator in
  3. Quickcheck.test
  4. (tuple3
  5. (fn Quickcheck.Observer.int char)
  6. (fn Quickcheck.Observer.string int)
  7. string)
  8. ~f:(fun (f, g, x) ->
  9. <:test_eq< char >>
  10. ((Fn.compose f g) x)
  11. (f (g x)))

Here, Quickcheck.Generator.fn creates a generator for functions. It takes two arguments: an observer for the function's input type and a generator for the function's output type.

Think of an observer as a "generator of decision trees". For instance, Quickcheck.Observer.int might randomly generate any of the following decision trees:


 / \
?   ?

  / \
 x>2 ?
 / \
?   ?

These decision trees control how a randomly-generated function will use its input. The generator for the function's output is used to fill each of the ?s in with a concrete value. The result is a family of functions operating on the appropriate types, making randomly-chosen observations on the input and producing randomly-chosen outputs based on those observations.

If you need to build an observer for a custom type, there are tools for that as well.

  1. type bst = Leaf | Node of bst * int * bst
  2. let obs : bst Quickcheck.Observer.t =
  3. let open Quickcheck.Observer in
  4. recursive (fun self ->
  5. unmap (variant2 unit (tuple3 self int self))
  6. ~f:(function
  7. | Leaf -> `A ()
  8. | Node (l, k, r) ->`B (l, k, r))
  9. ~f_sexp:(fun () -> Atom "variant2_of_bst"))

As with generators, there is a fixed point function Quickcheck.Observer.recursive that helps for simply-recursive types. The function unmap transforms an input of some new type into an input for which we already have an observer. Variant types can be transformed to polymorphic variants, which have default observers variant2 through variant6. Records and constructor arguments can be transformed to tuples, which have default observers tuple2 through tuple6.

Work in Progress

Our OCaml adaptation of Quickcheck is new and still evolving. We already have some changes to the library internally which will be released over time, such as moving default generators and observers out of the Quickcheck module and into the modules for each type. For example, Quickcheck.Generator.int becomes Int.gen.

There are still some pragmatic lessons to learn about how best to use our formulation of the library, how to calibrate our default distributions, and what other distributions we might want to provide. As always, we hope to get feedback from anyone who tries out this library so that we can improve it.

Happy testing!

rsync rounds timestamps to the nearest second

I'm not sure how I've managed to use rsync for so many years without ever noticing this, but hey, you learn something new every day!

[cperl@localhost ~]$ rpm -q --qf '%{name}-%{version}-%{release}\n' rsync

[cperl@localhost ~]$ touch foo
[cperl@localhost ~]$ stat --format "%y" foo
2015-09-24 14:07:05.349357260 -0400
[cperl@localhost ~]$ rsync -a foo bar
[cperl@localhost ~]$ stat --format "%y" bar
2015-09-24 14:07:05.000000000 -0400
[cperl@localhost ~]$ cp -a foo baz
[cperl@localhost ~]$ stat --format "%y" baz
2015-09-24 14:07:05.349357260 -0400

No (functional) experience required

Jane Street is a serious functional programming shop. We use OCaml, a statically typed functional language for almost everything and have what is probably the largest OCaml codebase anywhere.

This leads lots of people to think that they shouldn't even bother applying, under the assumption that we are only interested in hiring deep functional programming gurus. I think people get to this conclusion in part because they think of functional languages, especially those with fancy type systems, as arcane tools that can only be used effectively after deep study.

To the contrary, one of the reasons we started building production systems with OCaml was that it was relatively easy to understand, even for people with no formal CS background. Since then, we've had good experiences taking students with no functional experience at all and getting them to the point of being able to complete a project in just a few weeks. We also have a very successful "OCaml Bootcamp" program, where over four weeks, we train all of the incoming traders and many other non-developer hires on OCaml and our development tools and libraries. By the end, most of them are able to create useful applications.

All of this is to say that we don't go out of our way to hire people who are already familiar with functional programming. In practice, it's just not that hard for strong programmers to pick it up after they start.

That said, a solid majority of the developers we hire do come in with functional programming experience --- but that's because of their preferences, not ours. Programmers with an interest in functional languages have an extra reason to want to work here, and so we get an unusually high number of good applicants from that pool.

There's a more general lesson here: using well-loved tools is a good way of attracting (and retaining) great developers.

Introducing Incremental

I'm pleased to announce the release of Incremental (well commented mli here), a powerful library for building self-adjusting computations, i.e., computations that can be updated efficiently when their inputs change.

At its simplest, you can think of a self-adjusting computation as a fancy spreadsheet. In a spreadsheet, each cell contains either simple data, or an equation that describes how the value in this cell should be derived from values in other cells. Collectively, this amounts to a graph-structured computation, and one of the critical optimizations in Excel is that when some of the cells change, Excel only recomputes the parts of the graph that depend on those changed cells.

What makes self-adjusting computation (or SAC) different from a spreadsheet is its dynamism. The structure of the computational graph in an SAC can change at runtime, in response to the changing input data.

This dynamism gives you quite a bit of flexibility, which can be used in different ways. Here are a few.

On-line combinatorial algorithms. Incremental is based on work by Umut Acar et. al.., on self adjusting computations (that's where the term comes from), and there, they were mostly interested in building efficient on-line versions of various combinatoral algorithms. In many cases, they could match the asymptotic complexity of custom on-line algorithms by fairly simple incrementalizations of all-at-once algorithms.

Incremental GUI construction. One simple and natural way to model a GUI application is to structure your display as a function that geneates a view from some more abstract data model.

Having a function that constructs a view from scratch at every iteration is simple, but prohibitively expensive. But if you can write this function so that it produces an incrementalized computation, then you have a solution that is both simple and efficient. We've used this technique in a number of our UIs, to good effect.

This might remind you of how functional reactive programming (FRP) is used for construction of GUIs in languages like Elm. SAC and FRP have different semantics --- FRP is mostly concerned with time-like computations, and SAC is mostly about optimizing DAG-structured computations --- but they are nonetheless closely related, especially at the implementation level. You can see my post here for a description of the broader conceptual landscape that includes both FRP and SAC.

Configurable computations. An example that comes from our own work is risk calculations. Calculating measures of risk of a portfolio involves combining data from a complex collection of interdependent models. Each of these models is dependent both on live data from the markets, and on configurations determined by users.

A config change could merely tweak a coefficient, or it could change the overall structure of the computation, say by changing the list of factors used by a given model. Incremental allows you to build a computation that can update efficiently in response to both simple data changes as well as more structural config changes, in one unified framework.

A taste of Incremental

It's hard to give a compelling example of Incremental in action in just a few lines of code, because what makes Incremental really useful is how it helps you build large and complex computations. Nonetheless, small examples can give you a sense of how the library works.

To that end, let's walk through a few small examples. To begin, we need to instantiate the Incremental functor.

  1. open Core.Std
  2. module Inc = Incremental_lib.Incremental.Make ()

Each instance thus generated is its own independent computational world. The Incremental functor is generative, meaning it mints fresh types each time it's applied, which prevents values from different incremental worlds from being mixed accidentally.

An Incremental computation always starts at its variables. Modifications to variables are how updates to input data are communicated to Incremental.

Let's write down a few variables corresponding to the dimensions of a rectangular prism.

  1. module Var = Inc.Var
  3. (* dimensions of a rectangular prism *)
  4. let width_v = Var.create 3.
  5. let depth_v = Var.create 5.
  6. let height_v = Var.create 4.

We can use Var.watch to get the (trivial) incremental computions associated with each variable.

  1. let width = Var.watch width_v
  2. let depth = Var.watch depth_v
  3. let height = Var.watch height_v

The following is an incremental computation of the base area of the prism, and of the volume.

  1. let base_area =
  2. Inc.map2 width depth ~f:( *. )
  3. let volume =
  4. Inc.map2 base_area height ~f:( *. )

In order to get information out of an incremental computation, we need to explicitly mark which nodes we want data from by creating observer nodes. Because it knows which nodes are observed, the framework can track what parts of the computation are still necessary to the results.

  1. let base_area_obs = Inc.observe base_area
  2. let volume_obs = Inc.observe volume

In order to force the computation to run, we need to explicitly call Inc.stabilize. Here's some code that uses stabilize to run the computation and then gets the information from the observers.

  1. let () =
  2. let v = Inc.Observer.value_exn in
  3. let display s =
  4. printf "%20s: base area: %F; volume: %F\n"
  5. s (v base_area_obs) (v volume_obs)
  6. in
  7. Inc.stabilize ();
  8. display "1st stabilize";
  9. Var.set height_v 10.;
  10. display "after set height";
  11. Inc.stabilize ();
  12. display "2nd stabilize"

If we run this, we'll se the following output:

1st stabilize: base area: 25.; volume: 125.
    after set height: base area: 25.; volume: 125.
       2nd stabilize: base area: 25.; volume: 250.

Note that setting the height isn't enough to change the observed values; we need a stabilization to make that happen.

That's a fairly trivial computation, and there certainly isn't much to incrementalize. Let's try something a little more complicated: a function for merging together an array of incrementals, using some commutative and associative operator like addition or max.

  1. let rec merge ar ~f =
  2. if Array.length ar <= 1 then ar.(0)
  3. else
  4. let len = Array.length ar in
  5. let len' = len / 2 + len % 2 in
  6. let ar' =
  7. Array.init len' ~f:(fun i ->
  8. if i * 2 + 1 >= len then ar.(i*2)
  9. else Inc.map2 ar.(i*2) ar.(i*2+1) ~f)
  10. in
  11. merge ar' ~f;;

Because this is done using a binary tree as the dependency graph, the complexity of updating an element is log(n), where n is the size of the array. We can use this for, computing an average:

  1. let average ar =
  2. let sum = merge ar ~f:(+.) in
  3. Inc.map sum ~f:(fun s -> s /. float (Array.length ar))

This works, but we can do better performance-wise, at least, if our merge operation has an inverse. In that case, maintaining the sum can in principle be done on constant time, by first, removing the old value before adding in the new. Incremental has a function for taking advantage of this structure.

  1. let sum ar =
  2. Inc.unordered_array_fold ~f:(+.) ~f_inverse:(-.) ar;;

Now, let's say we want to do something a little more dynamic: in particular, what if we want to compute the average of a prefix of the given array? For that, we need to use the bind function, which allows us to produce new incremental nodes within an incremental computation.

  1. let average_of_prefix ar length =
  2. Inc.bind length (fun length ->
  3. average (Array.init length ~f:(fun i -> ar.(i))))

The type of this function is float Inc.t array -> int Inc.t -> float Inc.t, so the length of the prefix is a fully fledged part of the incremental computation. As a result, the dependency structure of this computation changes dynamically, e.g., if the value of length is 7, then the computation only depends on length and the first 7 elements of the array.

Hopefully this gives you enough of a sense of what Incremental is about to start thinking about where it might be useful for you. Note that the overhead of incremental is not inconsiderable --- on my laptop, firing a single node takes on the order of 30ns, which is far more than, say, summing numbers together. Incremental tends to be useful when the computation that is put into a single node is large relative to that overhead, or when the computational graph is large relative to the sub-graph that needs to be recomputed. Our experience has been that there are plenty of applications in this space that can benefit from Incremental.

Converting a code base from camlp4 to ppx

As with many projects in the OCaml world, at Jane Street we have been working on migrating from camlp4 to ppx. After having developed equivalent ppx rewriters for our camlp4 syntax extensions, the last step is to actually translate the code source of all our libraries and applications from the camlp4 syntax to the standard OCaml syntax with extension points and attributes.

For instance to translate code using pa_ounit and pa_test, we have to rewrite:

TEST = <:test_result< int >> ~expect:42 (f x)


let%test _ = [%test_result: int] ~expect:42 (f x)

For small to medium projects it is enough to just take a couple hours to translate the source code by hand. But at Jane Street where we have a huge OCaml code base making extensive use of camlp4, it is simply not realistic. So we needed a tool to do that for us.

Writing a tool to automatically convert the syntax

Since the output of such as tool has to be accepted as the new code source that is committed in our repository, it must preserve the layout of the original file as much as possible and of course keep the comments. This mean that any approach using an AST pretty-printer would be extremely complex.

The path we choosed is to textually substitute the foreign syntaxes in the original file for the new ones. One could imagine doing that with a tool such as sed, awk, perl, ... however doing it properly would be fastidious and it would be pretty-much impossible to be 100% sure it would never translate things it is not supposed to. Plus writing perl is not as fun as writing OCaml programs.

Instead there is an easy way to find the foreign syntaxes: using camlp4 itself. To subsitute the text of foreign syntaxes the only thing we need to know is their location in the original file, and camlp4 can help us with that.

Writing dummy camlp4 syntax extensions

The idea is to write for each camlp4 syntax extension a dummy one that define the same grammar productions as the real one, but instead of generating code it simply record substitutions at certain locations.

Then we do the following:

  • parse a file with camlp4 and our dummy syntax extensions
  • apply all the recorded substitutions to the original file

This approach has the advantage of interpreting the original file in the exact same way as our regular syntax extensions. Giving us good confidence we did not change the syntactic constructions by mistake.

To do so we define this API:

(** [replace loc repl] records a text substitution that replaces the
    portion of text pointed by [loc] by [repl]. *)
val replace : Loc.t -> string -> unit

Then writing a dummy camlp4 syntax extension is pretty easy. For instance for a subset of pa_ounit:

  GLOBAL: str_item;

  test: [[ "TEST" -> replace _loc "let%test" ]];

    [[ `STRING _; "=" -> ()
     |            "=" -> replace _loc "_ ="

    [[ test; name_equal; expr -> <:str_item< >>

On the fly conversion and diffing the generated code

Since this tool was convenient to use, we used it to check that our newly written ppx rewriters did the same thing as the old camlp4 syntax extensions:

  • for a given OCaml source file of a library or application, we converted it using camlp4-to-ppx and saved the result
  • we processed the original file using camlp4 and the translated one using our ppx rewriters
  • in both case we saved the output of -dparsetree (human-readable version of the internal OCaml AST) and -dsource (pretty-printed code)
  • we diffed the camlp4 and ppx outputs of -dparsetree, as well as the outputs of -dsource

This was all quite easy to do with jenga. We kept looking at the generated diffs until they were all empty.

We have quite a lot of code in our camlp4 syntax extensions and converting them to ppx was a long mechanical job and so quite error-prone. Given that this diffing turned out to be really helpful to find errors.

The Camlp4 Syntax is not quite the OCaml syntax

While using this we noticed that quite a few syntaxes accepted by camlp4 are not accepted by OCaml, for instance:

let _ x = x

let f l = List.map l ~f:fun x -> x + 1

These where quite easy to fix automatically as well using camlp4-to-ppx.

Github repo and extension

We published a slightly modified version of this tool on github.

The method we used doesn't work out of the box with all syntax extensions. For instance to convert code using lwt.syntax some more work needs to be done on camlp4-to-ppx. But it is a good starting point.

CPU Registers and OCaml

Even though registers are a low-level CPU concept, having some knowledge about them can help write faster code. Simply put, a CPU register is a storage for a single variable. CPU can keep data in memory or cache or in registers and registers are often much faster. Furthermore, some operations are possible only when the data is in registers. Hence, the OCaml compiler tries to keep as many variables as it can in the registers.

Code with more than 13 variables is slow?

Consider this primitive statistics computation:

let stats xs ys =
  let len = Array.length xs in
  if len Array.length ys then
    raise not_of_same_length;
  let sx = ref 0 in
  let sy = ref 0 in
  let sxx = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let sax = ref 0 in
  let say = ref 0 in
  for i = 0 to len - 1 do
    let x = Array.unsafe_get xs i in
    let y = Array.unsafe_get ys i in
    let ax = abs x in
    let ay = abs y in
    let xx = x * x in
    let yy = y * y in
    let xy = x * y in
    sx := !sx + x;
    sy := !sy + y;
    sxx := !sxx + xx;
    syy := !syy + yy;
    sxy := !sxy + xy;
    sax := !sax + ax;
    say := !say + ay;
  !sx, !sy, !sax, !say, !sxx, !syy, !sxy

Rearranging just a few lines produces code 1.5-2x faster:

let x = Array.unsafe_get xs i in
    sx := !sx + x;
    let xx = x * x in
    sxx := !sxx + xx;
    let ax = abs x in
    sax := !sax + ax;
    let y = Array.unsafe_get ys i in
    sy := !sy + y;
    let xy = x * y in
    sxy := !sxy + xy;
    let ay = abs y in
    say := !say + ay;
    let yy = y * y in
    syy := !syy + yy;

Why is that? CPU has just a few registers:

  • 13 which can contain integers and pointers to arrays and records
  • 16 which can contain floating point numbers

If the code has more than that many variables, OCaml compiler has to park the extra variables in memory and this parking is called spilling. Actually, spilling may happen even when there are less variables, because for example some operations like integer multiplication use extra registers.

Therefore, it's good to try to keep the number of frequently used variables to 13 or less, or to rearrange the code so that fewer variables are used at the same time.

The OCaml compiler can show spilled variables when called with the option -dreload.

Calling a single function makes subsequent code slower?

If a function is called, all of the active registers are spilled because it is not known whether the called function will need those registers. That can often be the largest penalty when calling a function, assuming the function is not inlined.

Let's change the previous function:

let stats xs ys =
  let sx  = ref 0 in
  let sax = ref 0 in
  let sxx = ref 0 in
  let sy  = ref 0 in
  let say = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let len = Array.length xs in
  if len <> Array.length ys then
    failwith (Printf.sprintf "Arrays not of the same length: %d %d"
      len (Array.length ys));
  for i = 0 to len - 1 do

This produces 1.35x slower code simply because OCaml compiler will spill all of the variables because of Printf.sprintf. In each iteration, OCaml will pull sx from the memory and store it back.

It's a pity that this is actually not necessary. OCaml has to pull sx from the memory and store it back just once, not in each iteration. Looks like that can be improved in the OCaml compiler.

Recursive functions with more parameters are faster?

A function can get data from input parameters, from closure environment and from memory. Input parameters are normally stored in registers. Therefore, using input parameters will generally be slightly faster than the closure environment and memory. This can come in handy in recursions.

Take for example this function which finds the index of the given integer in the array:

let index (a : int array) e =
  let len = Array.length a in
  let rec loop i =
    if i < len then begin
      if Array.unsafe_get a i = e then Some i
      else loop (i + 1)
    end else None
  loop 0

The variables e and len are pulled from closure environment in each iteration. Moving them to input parameters speeds up the function 1.15-1.33 times:

let index (a : int array) e =
  let rec loop e len i =
    if i < len then begin
      if e = Array.unsafe_get a i then Some i
      else loop e len (i + 1)
    end else None
loop e (Array.length a) 0

Further reading

Processor Register

Register Allocation

Reverse web proxy in ~50 lines of BASH

In the spirit of reinventing the wheel for fun, I hacked this together as a quick challenge to myself last week. It's a little rough around the edges, but I thought it was too cute not to share. If you have any bug fixes, please post them in the comments.


set -e -u

function usage() {
  echo "USAGE: ${0} parent {port} {lower_bound_backend_port} {upper_bound_backend_port}"

[[ $# < 1 ]] && usage && exit 1

case ${mode} in
    socat TCP-LISTEN:${PORT},fork,reuseaddr "EXEC:${0} child ${LOWER} ${UPPER}"
    PORT=$(shuf -i ${LOWER}-${UPPER} -n 1)
    let "RANGE = UPPER - LOWER"
    while ((COUNT <= RANGE)) || ${SUCCESS}; do 
      set +e
      if socat STDIN TCP:${PORT},connect-timeout=2; then
        echo "unable to connect to port ${PORT}" >&2
      set -e
      let COUNT+=1
      let PORT+=1
      if ((PORT > UPPER )); then
        let 'PORT = LOWER'
    if ! ${SUCCESS}; then
      echo "HTTP/1.1 500 Internal Server Error"
      echo "All REDACTED servers are down.  Please report to REDACTED@janestreet.com."
      exit 1
    exit 1

Building a lower-latency GC

We've been doing a bunch of work recently on improving the responsiveness of OCaml's garbage collector. I thought it would be worth discussing these developments publicly to see if there was any useful feedback to be had on the ideas that we're investigating.

The basic problem is a well-known one: GCs can introduce unpredictable pauses into your application, and depending on how your GC is configured, these pauses can be quite long. Unpredictable latencies are a problem in a wide variety of applications, from trading systems to web stacks.

One approach people often take is to avoid using the allocator altogether: pool all your objects, and never allocate anything else. You can even keep many of your pooled objects outside of the heap.

This works, but makes for a less pleasant coding experience (and code that is trickier and harder to reason about.) So while pooling is a valuable technique, we'd like to have a GC that lets you run with low latencies without sacrificing the ability to allocate.

What are the problems?

OCaml's garbage collector is already pretty good from a latency perspective. Collection of the major heap in OCaml is incremental, which means that collection of the major heap can be done in small slices spread out over time, so no single transaction need experience the full latency of walking the major heap. Also, collection of the minor heap is pretty fast, and OCaml programs tend to do pretty well with a relatively small minor heap --- typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.

Still, there are problems with OCaml's collector.

No profiling

There's no good way in the stock runtime to see how long the different parts of collection take, and that makes it hard to optimize.

False promotions

OCaml's generational collector is very simple: objects are typically allocated first on the minor heap, where the work is effectively three inlined instructions to bump a pointer and check whether you've hit the end. When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.

In a typical functional workload, most of your allocations are short-lived, and so most of the minor heap is dead by the time you do the minor collection, so the walk of the minor heap can be quite cheap. But there's always a small number of false promotions, objects that would have become unreachable shortly, but were promoted because the minor collection came at an inconvenient time.


One fundamental issues with the stock runtime is that the collector is clocked in terms of minor allocations --- ignoring, critically, the amount of time that has gone by.

This clocking makes sense for many applications, but if you're building a server that needs to respond to bursty traffic with low and predictable latencies, this is the opposite of what you want. Really, what you'd prefer to do is to defer GC work when you're busy, instead scheduling it at times when the application would otherwise be idle.

One solution here is to allow the application to drive the scheduling of the GC, but the runtime in its current form doesn't really support doing this. In particular, while you can choose to explicitly run a major slice, the collector accounting doesn't take note of the work that has been done that way, so the major collector works just as hard as it did previously.

Furthermore, the major slice always forces a minor collection. But running minor collections all the time is problematic in its own right, since if you run them when the minor heap is too small, then you'll end up accidentally promoting a rather large fraction of your minor allocations.

Insufficient incrementality

While the major collector is mostly incremental, not everything about it runs incrementally. In particular, when the major collector hits an array, it walks the array all at once. This is problematic if you start using large arrays, which does happen when one is using pooling techniques. Similarly, the collector is not incremental when it comes to scanning GC roots.

Immediate accounting

The stock runtime decides how big of a major slice to do based on how much was promoted to the major heap in the last minor collection. This is part of a heuristic that is meant to make sure that the collector keeps up with the rate of allocation, without running needlessly when the application isn't promoting much.

But the accounting is in some sense too immediate: if you do a lot of promotion in a given cycle, you're forced to do the collection immediately. While all that work needs to be done, it's not clear that it needs to be done immediately. Again, for responsive systems, it's often better to push off work until after the busy times.

Making it better

Happily, Damien Doligez, the author of OCaml's GC, has been visiting with us for the last few months, and has been doing a lot of good work to improve the runtime, and in particular to address the concerns raised above. Here's the summary of the changes made thus far.

Better profiling

A set of probes was added to the GC, allowing us to record in a quite detailed way every phase of the collection process. This is quite detailed, telling you the phase (marking vs sweeping) and the sub-phase, as well as keeping track of a collection of useful counters. This is available in the instrument branch.


Damien has also implemented aging in the minor heap. Aging is a technique whereby objects stay in the minor heap for several minor collections before being promoted to the major heap. The goal of aging is to reduce the amount of false promotion.

Better incrementalization

Several of the stages of the collector have been made interruptible, including scanning of arrays and of the roots. The effect here is to reduce the worst-case delays imposed by the collector. This is in the low-latency branch.

Separating major slices from minor collections

In the stock runtime, major slices and minor collections are always done together. In the low-latency branch, you can run one without the other, and you can basically run them at any time. This has a couple of advantages --- one is that it's essentially another form of incrementalization, allowing you to do less work per GC pause.

The other is that it gives you more freedom to schedule collections when you want to. One way we're looking at using this is to have an application-level job that wakes up periodically, and does a heuristic check to see if the system appears busy. If it doesn't, then it schedules some GC work, and it may choose to do either a minor collection or a major slice. A minor collection would only be chosen in the case that the minor heap is bigger than some configured level, to avoid too much false promotion; but a major collection can be done at any time.

Smoothed work-accounting

Instead of just keeping track of the amount of work that needs to be done in the next major slice, the GC in the low-latency branch tracks work that must be done over the next n major slices, by keeping these numbers in a circular buffer.

The runtime also uses these buckets for keeping track of extra work that has been done by application-forced major slices. A forced major slice takes work away from the front-most bucket, potentially bringing the bucket to negative territory.

When the runtime checks if it needs to do a major slice, it looks at the first bucket. If it's got a positive amount of work in it, then that work is done in that slice, if possible. Whatever is left over (which may be positive or negative) is spread out uniformly over the next n buckets.

Segmented free lists

A big part of the cost of minor collections is the cost of finding free blocks. One observation is that in many OCaml applications, block sizes are quite small. One way of taking advantage of this is to have a set of size-segregated free-lists, for a configurable set of sizes. e.g., one could have a different free list for blocks with 1, 2, 3 and 4 slots.

This is still ongoing (read: not working yet), but it will show up in the multi-free-list branch eventually.

How is it going?

This is all very much a work in progress, but the results so far have been quite promising. By using a version of the compiler with most of these changes and with an application-driven job that forces major slices in quiet times, we were able to reduce tail latencies by a factor of 3 in a real production application. That's pretty good considering that we've done essentially no parameter tuning at this point.

That said, some of the results are less promising. We were somewhat disappointed to see that when doing more traditional batch jobs, aging didn't provide a significant improvement in overall compute time. It seems like in many applications, aging saves some on promotion, but the minor collection itself gets a little more expensive, and these seem to nearly cancel out.

This seems especially surprising given that aging is present in most GCs, including those for Java's HotSpot, the .NET CLR, and GHC. Given that everyone seems to use aging, I would have expected aging to have a quite noticeable benefit for lots of workloads, not just carefully tuned packet processors.

A call for help

The progress we've made so far is quite promising, but a lot of things are still up in the air. The reason that I wanted to post about it now is that I was hoping to hear feedback from others who have experience dealing with similar issues in other languages.

So, if you have thoughts about the techniques we've tried for making OCaml's runtime more responsive, or suggestions for other techniques we should consider, please comment!

Faster OCaml to C calls

The official OCaml documentation "Interfacing C with OCaml" doesn't document some interesting performance features.

C functions with no OCaml allocation

A C call can allocate OCaml data and pass it back to OCaml, for example using caml_copy_string(s). Between the C call allocating OCaml data and passing it back, it has to make sure that OCaml's Garbage Collector doesn't collect it, as the Garbage Collector can be triggered during the C call. There's an intricate mechanism which assures that, part of which are CAMLparam, CAMLlocal and CAMLreturn.

This mechanism can be bypassed if the C call is not going to allocate any OCaml data. This can yield performance benefits especially in shorter functions. To bypass it, CAMLparam, CAMLlocal and CAMLreturn should not be used and the primitive should be declared with "noalloc".

For example, OCaml's compare is not smart to avoid branch mispredictions for floats. Moving comparison to C speeds it up a little bit. "noalloc" speeds it up a lot.

float compare            8.93 ns
float_compare_c          7.88 ns
float_compare_c_noalloc  5.32 ns

external float_compare_noalloc : float -> float -> int =
  "float_compare_noalloc_stub" "noalloc"

CAMLprim value float_compare_noalloc_stub(value vf, value vg)
  double f = Double_val(vf);
  double g = Double_val(vg);
  return Val_int((f > g) - (f < g) + (f == f) - (g == g));

C functions with float arguments and float return

Since C code must use boxed OCaml floats, any unboxed float must be boxed prior to the C call. This is not cheap, especially for fast functions. This boxing can be avoided if the C call accepts and returns only floats.

For example, float_min can be replaced with a single CPU instruction. Unfortunately, the C implementation is much slower because of boxing floats:

float_min                  6.09 ns
float_min_c               15.92 ns

let float_min (x : float) y = if x < y then x else y

CAMLprim value float_min_stub(value x, value y)
  CAMLparam2(x, y);
  double z = Double_val(y);

  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(Double_val(x)));
  v = caml_copy_double(z);

external float_min_c : float -> float -> float = "float_min_stub"

To avoid boxing, C function's arguments and return should be "double", CAMLparam, CAMLlocal and CAMLreturn should be avoided and the primitive should include "float" and both interpreted and compiled implementation:

float_min_c_float          1.95 ns

external float_min_inan_c_float : float -> float -> float
  = "float_min_inan_float_bytecode" "float_min_inan_float_stub" "float"

CAMLprim double float_min_inan_float_stub(double x, double y)
  double z = y;
  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(x));
  return z;

C functions with float arguments and non-float return

We might be able to further speed up float_compare_noalloc if we avoided boxing. Alas, that function returns integer so it's impossible to use "float". Is it still possible to avoid boxing? The answer is yes, by simply converting float to int.

float_compare_c_float    3.73 ns

CAMLprim double float_compare_float_stub(double f, double g)
  return (f > g) - (f < g) + (f == f) - (g == g);

external float_compare_float : float -> float -> float
  = "float_compare_float_bytecode" "float_compare_float_stub" "float"
let float_compare_float x y = int_of_float (float_compare_float x y)

Why GADTs matter for performance

When GADTs (Generalized Algebraic Data Types) landed in OCaml, I wasn't particularly happy about it. I assumed that it was the kind of nonsense you get when you let compiler writers design your programming language.

Which is to say that the standard GADT examples all seem to be about the kinds of things that compiler writers do, like embed domain-specific languages or build typed abstract-syntax trees. But it didn't seem particularly relevant for the kind of systems programming that I think about.

But it became apparent pretty quickly that I was wrong. In particular, since GADTs have landed, at Jane Street we've found lots of examples where GADTs are important for performance, of all things. The theme in these examples is that GADTs enable you to tweak your memory representations in ways that would otherwise be painful or impossible to do safely in OCaml.

The Problem of Polymorphism

I'd like to walk through a simple example that illustrates this aspect of GADTs, but first, a few words about OCaml's memory representation. OCaml's polymorphism is in an important way backed on that memory representation. In particular, consider a simple polymorphic function like List.iter, which has the following type:

  1. val iter: 'a list -> f:('a -> unit) -> unit

The polymorphic type tells you that List.iter can operate on lists of any type, and in OCaml, this is achieved with a single compiled version of the code. This is possible because the memory representation of the elements of a list are uniform: you can always refer to an OCaml value in a single word, either as a pointer to a heap-allocated value, or as an immediate that fits inside that word.

That means that some OCaml datatypes are less efficient space-wise than you might imagine. Arrays, for example, take the same amount of space per element whether those elements are bytes, 32-bit ints, or 64-bit ints. (There's actually some special magic in the compiler for float arrays, though this is probably more trouble than it's worth, as described by Alain Frisch here. But let's ignore float arrays for now.)

OCaml does have a tighter representations for byte arrays, called bytes. But it's a completely different type, and so building a general purpose data structure that uses bytes when it would make sense, and ordinary arrays otherwise, is a little awkward.

Controlling memory representation without GADTs

Let's see what happens if we try to design (without GADTs) an array type that sometimes uses the general array representation and sometimes uses bytes.

You could imagine representing such a value using an ordinary variant.

  1. type 'a t = | Array of 'a array
  2. | Bytes of bytes

We could then implement each of the operations we want on our new array type, implementing each operation differently depending on the particular representation. Let's see what happens if we just take this idea and run with it, implementing all the required functions in the most straightforward way.

  1. > module Compact_array = struct
  3. type 'a t = | Array of 'a array
  4. | Bytes of bytes
  6. let of_bytes x : char t = Bytes x
  7. let of_array x = Array x
  9. let length = function
  10. | Array a -> Array.length a
  11. | Bytes s -> Bytes.length s
  13. let get t i =
  14. match t with
  15. | Array a -> a.(i)
  16. | Bytes s -> s.[i]
  18. let set t i v =
  19. match t with
  20. | Array a -> a.(i) <- v
  21. | Bytes s -> s.[i] <- v
  23. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = Array of 'a array | Bytes of bytes
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : char t -> int -> char
  32. val set : char t -> int -> char -> unit
  33. end

This seems pretty good at first glance, but the inferred types aren't quite what we want. In particular, get and set only work with Compact_arrays containing characters. If you think about how type inference works, it's not really all that surprising. If you think about the code for get:

  1. let get t i =
  2. match t with
  3. | Array a -> Array.get a i
  4. | String s -> String.get s i

The OCaml compiler is looking for a single type to assign to the return value for all the cases of the match. Given that String.get always returns a char, then Compact_array.get will be restricted to only returning a char.

One way to work around this problem is to essentially implement what we want as a poor-man's object. Here, we just write the code separately for the different cases, and stuff those functions into a record full of closures. Here's how that looks.

  1. module Compact_array = struct
  3. type 'a t = { len: unit -> int
  4. ; get: int -> 'a
  5. ; set: int -> 'a -> unit
  6. }
  8. let of_string s =
  9. { len = (fun () -> String.length s)
  10. ; get = (fun i -> String.get s i)
  11. ; set = (fun i x -> String.set s i x)
  12. }
  14. let of_array a =
  15. { len = (fun () -> Array.length a)
  16. ; get = (fun i -> Array.get a i)
  17. ; set = (fun i x -> Array.set a i x)
  18. }
  20. let length t = t.len ()
  21. let get t i = t.get i
  22. let set t i x = t.set i x
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = {
  28. len : unit -> int;
  29. get : int -> 'a;
  30. set : int -> 'a -> unit;
  31. }
  32. val of_string : bytes -> char t
  33. val of_array : 'a array -> 'a t
  34. val length : 'a t -> int
  35. val get : 'a t -> int -> 'a
  36. val set : 'a t -> int -> 'a -> unit
  37. end

This more or less solves the problem, but it's still not really the memory representation we want. In particular, we have to allocate three closures for each Compact_array.t, and this number of closures will only go up as we add more functions whose behavior depends on the underlying array.

GADTs to the rescue

Let's go back to our failed variant-based implementation, but rewrite it using the GADT syntax. Note that we're not trying to change the types at all this time, just rewriting the same type we had before in the language of GADTs.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> 'a t

The syntax of this declaration suggests thinking about variant constructor like Array or Bytes as functions from the constructor arguments to the type of the resulting value, with the thing to the right of the : roughly corresponding to the type signature of the constructor.

Note that for the Array constructor, the type value of 'a depends on the type of the argument:

  1. > Array [|1;2;3|];;
  2. - : int t = Array [|1; 2; 3|]
  3. > Array [|"one";"two";"three"|];;
  4. - : bytes t = Array [|"one"; "two"; "three"|]

But for the Bytes constructor, the type 'a in the type is still free.

  1. > Bytes "foo";;
  2. - : 'a t = Bytes "foo"

This is really the problematic case, because we'd like for Bytes "foo" for the parameter 'a to by char, since in the Bytes case, that's what the element type of our array is.

Because GADTs give us the ability to specify the type on the right-hand side of the arrow, we can get that.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> char t

Now, the Bytes constructor behaves as we'd like it too.

  1. > Bytes "foo";;
  2. - : char t = Bytes "foo"

Now let's see what happens when we try to write the length function.

  1. > let length t =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : char t -> int = <fun>

Disappointingly, we're again stuck with a function that doesn't have the right type. In particular, the compiler has decided that this function can only operate on char t, when we want it to work for arrays of any type.

But the problem now is that type inference in the presence of GADTs is difficult, and the compiler needs a little help. Roughly speaking, without some hints, OCaml's type system will try to identify all types as having a single value within a given function. But in this case, we need a type variable which might have different values in different branches of a match statement.

We can do this by creating a locally-abstract type el to represent the type parameter of t (and the element type), and annotating t accordingly.

  1. > let length (type el) (t:el t) =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : 'a t -> int = <fun>

Now we see that we get the right type. We can push this approach through to get a complete implementation.

  1. > module Compact_array = struct
  3. type 'a t = | Array : 'a array -> 'a t
  4. | Bytes : bytes -> char t
  6. let of_bytes x = Bytes x
  7. let of_array x = Array x
  9. let length (type el) (t:el t) =
  10. match t with
  11. | Array a -> Array.length a
  12. | Bytes s -> Bytes.length s
  14. let get (type el) (t:el t) i : el =
  15. match t with
  16. | Array a -> Array.get a i
  17. | Bytes s -> Bytes.get s i
  19. let set (type el) (t:el t) i (v:el) =
  20. match t with
  21. | Array a -> Array.set a i v
  22. | Bytes s -> Bytes.set s i v
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = Array : 'a array -> 'a t | Bytes : bytes -> char t
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : 'a t -> int -> 'a
  32. val set : 'a t -> int -> 'a -> unit
  33. end

As I said at the beginning, this is really just an example of the more general theme. GADTs are about more than clever typed interpreters; they're a powerful mechanism for building easy to use abstractions that give you more precise control of your memory representation. And getting the right memory representation is often critical for building high performance applications.