A solution to the ppx versioning problem

Ppx is a preprocessing system for OCaml where one maps over the OCaml abstract syntax tree (AST) to interpret some special syntax fragments to generate code.

Ppx rewriters get to work on the same AST definition as the compiler, which has many advantages:

  • The AST corresponds (almost) exactly to the OCaml language. This is not completely true as the AST can represent programs that you can't write, but it's quite close.

  • Given that the compiler and pre-processor agree on the data-type, they can communicate between each other using the unsafe [Marshal] module, which is a relatively cheap and fast way of serializing and deserializing OCaml values.

  • Finally, the biggest advantage for the user is that the locations in the original code are exactly preserved, which is a requirement to get usable error messages. This is not so great for the generated code, as the best one can do is reuse some locations from the original source code and hope for the best. In practice the user sometimes gets non-sensical errors, but this is a commonly accepted trade-off.

There is however one drawback to all this, the compiler AST is not stable and code using it is at the mercy of its evolution. We got lucky with the 4.04 release of OCaml but the 4.03 one was quite disruptive. Even before releases, whenever the AST definition changes during a development cycle, many ppx might not be usable for a while, which make testing a lot harder.

Several ideas have been flying around, such as adding a layer to convert between different versions of the AST. While this would work, it has the drawback that you need this layer for every variant of the AST. And when you want to make a patch modifying the AST, you'll need to do the extra work of updating this layer first.

In this blog post we show how we managed to solve the ppx compatiblity problem in a way that improves the user experience and lets us produce releases that don't depend on ppx rewriters at all.

We did this work while working on Base, our upcoming standard library. In the end, it's likely we'll use only the third of the methods described below for Base, while the others will be used to improve the user experience with the rest of our packages.

What do other code generators do?

Ppx is not the first system for generating code that is a mix of user-written code and machine generated code. A typical class of generators that get it right, i.e. that preserve locations and are independant from the AST definition, are lexer/parser generators, and not only the ones distributed with the compiler.

Let's take the example of lexer generators (parser generators work basically the same way). The users write a series of rules, consisting of a rular expression and an action to take if the input matches it:

rule token = parse
| "+"  { PLUS  }
| "-"  { MINUS }
| '0'..'9'+ as s { INT (int_of_string s) }
| " "* { token lexbuf (* skip blanks *) }

This code is written in a .mll file and the generator then produces a .ml file with code for the lexing engine interleaved with the user written actions.

In order to keep the locations of the user-written code pointing to the right place in the .mll file, the generator produces:

# 42 "lexer.mll"
         token lexbuf (* skip blanks *)

The OCaml compiler interprets the line starting with a # and updates its current location to point to the begining of line 42 in file lexer.mll. This is called a line directive.

To go back into the generated code, the lexer generator produces:

# 100 "lexer.ml"

Where 100 correspond to the real line number in lexer.ml.

With this method, when there is an error in the user-written code, it points to the lexer.mll file, while when there is an error in the generated code it points to the lexer.ml file. Even if the generated code might not be particularly easy to understand, at least you get to see the real code the compiler chokes on.

Another big advantage is that when using a debugger, you can follow the execution through the generated code.

Can we do the same for ppx?

At first glance, it seems that ppx rewriters work in a very different way, but the result is the same: only parts of the file is generated and the rest is taken as if from what the user wrote. In fact, compared to the lexer case, most of the resulting code is user written.

There is however some work to do to get the same result as with lexer generators. First you have to distinguish the generated code from the user code.

If you take a ppx rewriter as a black box, then the only way is to apply some kind of tree diff between the input and the output. In our ppx framework however, we know exactly what fragments of the AST are rewritten by plugins and we know the rewriting is always local. This makes the job a lot simpler and probably faster as well, so we chose to take advantage of this information.

The method

It works this way: while mapping the AST, we collect all the fragments of generated code with the location of the code they replace in the original file. At the end we sort them in the order of the file and make sure there is no overlap. Every fragment is pretty printer to a string.

What we end up with is a list of text substitutions: beginning position, end position, replacemen text. The next step is to simply apply these substitutions to the original file. If you read the bog post about how we switched from camlp4 to ppx, you'll notice the resemblance here.

This is what the transformation looks like:

(* ----- input ----- *)
type t = int [@@deriving sexp_of]

let f x = x + 1

let g x = [%sexp_of: t] x

(* ----- output ----- *)
# 1 "foo.ml"
type t = int [@@deriving sexp_of]
# 4 "foo.ml.pp"
let sexp_of_t = sexp_of_int
#2 "foo.ml"

let f x = x + 1

let g x =
# 11 "foo.ml.pp"
sexp_of_t
# 5 "foo.ml"
                        x

The result for [@@deriving sexp_of] is not bad at all. For code rewritten inside expressions, the result is not as good given that it breaks those expressions up. But given than extensions are often sparse in our source files, this is still acceptable.

This mode can be selected with ppx_driver based rewriter by passing the flag -reconcile.

Solving the compatiblity problem

With this mode, one can first generate a .ml.pp file and feed that to the compiler. Given that the concrete syntax of the language breaks much less often than the internal AST definition, a working ppx is likely to work for a very long time.

We'll soon start releasing a separate package that snapshots one version of the lexer/parser/AST/printer of the OCaml compiler. This package will have its own release schedule and will typically be updated soon after each relase of OCaml. This will give time for ppx authors to upgrade their code when it breaks while still allowing people to try out the new compiler with their favorite packages.

Mode for release tarballs

In addition to the mode described above, ppx_driver has a second mode -reconcile-with-comments where the result is similar to the one with line directives expect than the generated code is enclosed with comment markers:

type t = int [@@deriving sexp_of]
(* GENERATED CODE BEGIN *)
let sexp_of_t = sexp_of_int
(* GENERATED CODE END *)

let f x = x + 1

let g x =
(* GENERATED CODE BEGIN *)
sexp_of_t
(* GENERATED CODE END *)
                        x

This mode is intended for release tarballs. One can replace all the files in-place by the pre-processed version using =-reconcile-with-comments=. The result is readable and has the big advantage that you you don't need to depend on the ppx rewriter, which means the package is faster to install for users.

Jane Street packages will eventually move to this scheme, either for the next stable release or the one after that. One technical issue with this method is that to take full advantage of it, the runtime support libraries of the various ppx rewriters must be installable without the rewriter itself. Splitting the packages in opam is fine but splitting the repository is not desirable as often both components make strong assumption about the other.

For Jane Street packages, we'll need to update our release system so that it supports generating two opam packages from one repository.

ppx as a verfication tool only

While these new methods improve the ppx story in general, for Base we wanted to go even further and allow users to build Base without the need for ppx at all, both for the release and for the development versions. Not only to cut down the dependencies, but also to provide a better experience in general. For instance if you are working on a patched compiler and need the development version of Base, you shouldn't need all the ppx rewriters that might not work for some reason.

We explored various bootstrap story, and while they worked they were not very nice, especially for such an important library. Its development and build processes should be straightforward.

We even looked into not using ppx at all. While this is OK for many ppx rewriters that are mostly syntactic sugars, it is more problematic for [@@deriving ...]. It's not so much that the code is hard to write by hand, most data-structures in Base are either simple datatypes or require and written combinators anyway, but it is a pain to review. This code is very mechanical and you have to make sure that the constant strings correspond to the constructor/field names and other things where the machine can do much better than a human.

In the end we found a solution to keep the best of both worlds, /i.e./ being able to build the original source code without pre-processing and avoid having to write and review this boilerplate code.

The idea is to use ppx in the same way that we write expect tests; the tool only checks that what's comes after the type-definition correspond to what the rewriters derive from it. In case of mismatch it produces a .corrected file just like expect tests.

We are currently experimenting with this method for Base. It's possible that we'll have some marker to delimit the end of the generated code. In the end the code could look like this:

type t = A | B [@@deriving sexp_of]

let sexp_of_t = function
  | A -> Sexp.Atom "A"
  | B -> Sexp.Atom "B"

[@@@end_of_derived_code]

Given that the compiler ignores attributes it doesn't understand, this code compiles just fine without any pre-processing.

When running the ppx rewriter in this expect mode, the generated AST is matched against the source AST without taking locations into account, so that mean that you can reformat the code as you wish and even add comments.

The challenge now is to update our ppx rewriters so that they produce code that we are happy to show. Until now we didn't focus too much on that, but we have a good idea about how to do it. The plan is to move more of the logic of the various deriving system into proper functions instead of generating more code. Note that this is an improvement in general as proper functions are a lot easier to understand and maintain than code generators.

Conclusion

In this blog post we described a simple and clean method to decouple ppx rewriters from the release schedule of the compiler. This method has the advantage that once the ppx is written is likely to work for a long time and especially to work out of the box with development compilers.

Moreover, this method has is better for users as errors point to the real code the compiler sees and when debugging they can follow the execution through generated code without trouble.

All this is currently implemented in ppx_core/ppx_driver. Our github repositories haven't been updated in a while has the Base refactoring disrupted our public release process quite a bit. These new features should be published in the coming weeks and we'll be part of the next stable release of our packages, planned for the beginning of December.

Converting a code base from camlp4 to ppx

As with many projects in the OCaml world, at Jane Street we have been working on migrating from camlp4 to ppx. After having developed equivalent ppx rewriters for our camlp4 syntax extensions, the last step is to actually translate the code source of all our libraries and applications from the camlp4 syntax to the standard OCaml syntax with extension points and attributes.

For instance to translate code using pa_ounit and pa_test, we have to rewrite:

TEST = <:test_result< int >> ~expect:42 (f x)

to:

let%test _ = [%test_result: int] ~expect:42 (f x)

For small to medium projects it is enough to just take a couple hours to translate the source code by hand. But at Jane Street where we have a huge OCaml code base making extensive use of camlp4, it is simply not realistic. So we needed a tool to do that for us.

Writing a tool to automatically convert the syntax

Since the output of such as tool has to be accepted as the new code source that is committed in our repository, it must preserve the layout of the original file as much as possible and of course keep the comments. This mean that any approach using an AST pretty-printer would be extremely complex.

The path we choosed is to textually substitute the foreign syntaxes in the original file for the new ones. One could imagine doing that with a tool such as sed, awk, perl, ... however doing it properly would be fastidious and it would be pretty-much impossible to be 100% sure it would never translate things it is not supposed to. Plus writing perl is not as fun as writing OCaml programs.

Instead there is an easy way to find the foreign syntaxes: using camlp4 itself. To subsitute the text of foreign syntaxes the only thing we need to know is their location in the original file, and camlp4 can help us with that.

Writing dummy camlp4 syntax extensions

The idea is to write for each camlp4 syntax extension a dummy one that define the same grammar productions as the real one, but instead of generating code it simply record substitutions at certain locations.

Then we do the following:

  • parse a file with camlp4 and our dummy syntax extensions
  • apply all the recorded substitutions to the original file

This approach has the advantage of interpreting the original file in the exact same way as our regular syntax extensions. Giving us good confidence we did not change the syntactic constructions by mistake.

To do so we define this API:

(** [replace loc repl] records a text substitution that replaces the
    portion of text pointed by [loc] by [repl]. *)
val replace : Loc.t -> string -> unit

Then writing a dummy camlp4 syntax extension is pretty easy. For instance for a subset of pa_ounit:

EXTEND Gram
  GLOBAL: str_item;

  test: [[ "TEST" -> replace _loc "let%test" ]];

  name_equal:
    [[ `STRING _; "=" -> ()
     |            "=" -> replace _loc "_ ="
     ]];

  str_item:
    [[ test; name_equal; expr -> <:str_item< >>
    ]];
END

On the fly conversion and diffing the generated code

Since this tool was convenient to use, we used it to check that our newly written ppx rewriters did the same thing as the old camlp4 syntax extensions:

  • for a given OCaml source file of a library or application, we converted it using camlp4-to-ppx and saved the result
  • we processed the original file using camlp4 and the translated one using our ppx rewriters
  • in both case we saved the output of -dparsetree (human-readable version of the internal OCaml AST) and -dsource (pretty-printed code)
  • we diffed the camlp4 and ppx outputs of -dparsetree, as well as the outputs of -dsource

This was all quite easy to do with jenga. We kept looking at the generated diffs until they were all empty.

We have quite a lot of code in our camlp4 syntax extensions and converting them to ppx was a long mechanical job and so quite error-prone. Given that this diffing turned out to be really helpful to find errors.

The Camlp4 Syntax is not quite the OCaml syntax

While using this we noticed that quite a few syntaxes accepted by camlp4 are not accepted by OCaml, for instance:

let _ x = x

let f l = List.map l ~f:fun x -> x + 1

These where quite easy to fix automatically as well using camlp4-to-ppx.

Github repo and extension

We published a slightly modified version of this tool on github.

The method we used doesn't work out of the box with all syntax extensions. For instance to convert code using lwt.syntax some more work needs to be done on camlp4-to-ppx. But it is a good starting point.

CPU Registers and OCaml

Even though registers are a low-level CPU concept, having some knowledge about them can help write faster code. Simply put, a CPU register is a storage for a single variable. CPU can keep data in memory or cache or in registers and registers are often much faster. Furthermore, some operations are possible only when the data is in registers. Hence, the OCaml compiler tries to keep as many variables as it can in the registers.

Code with more than 13 variables is slow?

Consider this primitive statistics computation:

let stats xs ys =
  let len = Array.length xs in
  if len Array.length ys then
    raise not_of_same_length;
  let sx = ref 0 in
  let sy = ref 0 in
  let sxx = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let sax = ref 0 in
  let say = ref 0 in
  for i = 0 to len - 1 do
    let x = Array.unsafe_get xs i in
    let y = Array.unsafe_get ys i in
    let ax = abs x in
    let ay = abs y in
    let xx = x * x in
    let yy = y * y in
    let xy = x * y in
    sx := !sx + x;
    sy := !sy + y;
    sxx := !sxx + xx;
    syy := !syy + yy;
    sxy := !sxy + xy;
    sax := !sax + ax;
    say := !say + ay;
  done;
  !sx, !sy, !sax, !say, !sxx, !syy, !sxy

Rearranging just a few lines produces code 1.5-2x faster:

let x = Array.unsafe_get xs i in
    sx := !sx + x;
    let xx = x * x in
    sxx := !sxx + xx;
    let ax = abs x in
    sax := !sax + ax;
    let y = Array.unsafe_get ys i in
    sy := !sy + y;
    let xy = x * y in
    sxy := !sxy + xy;
    let ay = abs y in
    say := !say + ay;
    let yy = y * y in
    syy := !syy + yy;

Why is that? CPU has just a few registers:

  • 13 which can contain integers and pointers to arrays and records
  • 16 which can contain floating point numbers

If the code has more than that many variables, OCaml compiler has to park the extra variables in memory and this parking is called spilling. Actually, spilling may happen even when there are less variables, because for example some operations like integer multiplication use extra registers.

Therefore, it's good to try to keep the number of frequently used variables to 13 or less, or to rearrange the code so that fewer variables are used at the same time.

The OCaml compiler can show spilled variables when called with the option -dreload.

Calling a single function makes subsequent code slower?

If a function is called, all of the active registers are spilled because it is not known whether the called function will need those registers. That can often be the largest penalty when calling a function, assuming the function is not inlined.

Let's change the previous function:

let stats xs ys =
  let sx  = ref 0 in
  let sax = ref 0 in
  let sxx = ref 0 in
  let sy  = ref 0 in
  let say = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let len = Array.length xs in
  if len <> Array.length ys then
    failwith (Printf.sprintf "Arrays not of the same length: %d %d"
      len (Array.length ys));
  for i = 0 to len - 1 do
    ... 

This produces 1.35x slower code simply because OCaml compiler will spill all of the variables because of Printf.sprintf. In each iteration, OCaml will pull sx from the memory and store it back.

It's a pity that this is actually not necessary. OCaml has to pull sx from the memory and store it back just once, not in each iteration. Looks like that can be improved in the OCaml compiler.

Recursive functions with more parameters are faster?

A function can get data from input parameters, from closure environment and from memory. Input parameters are normally stored in registers. Therefore, using input parameters will generally be slightly faster than the closure environment and memory. This can come in handy in recursions.

Take for example this function which finds the index of the given integer in the array:

let index (a : int array) e =
  let len = Array.length a in
  let rec loop i =
    if i < len then begin
      if Array.unsafe_get a i = e then Some i
      else loop (i + 1)
    end else None
  in
  loop 0

The variables e and len are pulled from closure environment in each iteration. Moving them to input parameters speeds up the function 1.15-1.33 times:

let index (a : int array) e =
  let rec loop e len i =
    if i < len then begin
      if e = Array.unsafe_get a i then Some i
      else loop e len (i + 1)
    end else None
  in
loop e (Array.length a) 0

Further reading

Processor Register

Register Allocation

Faster OCaml to C calls

The official OCaml documentation "Interfacing C with OCaml" doesn't document some interesting performance features.

C functions with no OCaml allocation

A C call can allocate OCaml data and pass it back to OCaml, for example using caml_copy_string(s). Between the C call allocating OCaml data and passing it back, it has to make sure that OCaml's Garbage Collector doesn't collect it, as the Garbage Collector can be triggered during the C call. There's an intricate mechanism which assures that, part of which are CAMLparam, CAMLlocal and CAMLreturn.

This mechanism can be bypassed if the C call is not going to allocate any OCaml data. This can yield performance benefits especially in shorter functions. To bypass it, CAMLparam, CAMLlocal and CAMLreturn should not be used and the primitive should be declared with "noalloc".

For example, OCaml's compare is not smart to avoid branch mispredictions for floats. Moving comparison to C speeds it up a little bit. "noalloc" speeds it up a lot.

float compare            8.93 ns
float_compare_c          7.88 ns
float_compare_c_noalloc  5.32 ns

external float_compare_noalloc : float -> float -> int =
  "float_compare_noalloc_stub" "noalloc"

CAMLprim value float_compare_noalloc_stub(value vf, value vg)
{
  double f = Double_val(vf);
  double g = Double_val(vg);
  return Val_int((f > g) - (f < g) + (f == f) - (g == g));
}

C functions with float arguments and float return

Since C code must use boxed OCaml floats, any unboxed float must be boxed prior to the C call. This is not cheap, especially for fast functions. This boxing can be avoided if the C call accepts and returns only floats.

For example, float_min can be replaced with a single CPU instruction. Unfortunately, the C implementation is much slower because of boxing floats:

float_min                  6.09 ns
float_min_c               15.92 ns

let float_min (x : float) y = if x < y then x else y

CAMLprim value float_min_stub(value x, value y)
{
  CAMLparam2(x, y);
  CAMLlocal1(v);
  double z = Double_val(y);

  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(Double_val(x)));
  v = caml_copy_double(z);
  CAMLreturn(v);
}

external float_min_c : float -> float -> float = "float_min_stub"

To avoid boxing, C function's arguments and return should be "double", CAMLparam, CAMLlocal and CAMLreturn should be avoided and the primitive should include "float" and both interpreted and compiled implementation:

float_min_c_float          1.95 ns

external float_min_inan_c_float : float -> float -> float
  = "float_min_inan_float_bytecode" "float_min_inan_float_stub" "float"

CAMLprim double float_min_inan_float_stub(double x, double y)
{
  double z = y;
  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(x));
  return z;
}

C functions with float arguments and non-float return

We might be able to further speed up float_compare_noalloc if we avoided boxing. Alas, that function returns integer so it's impossible to use "float". Is it still possible to avoid boxing? The answer is yes, by simply converting float to int.

float_compare_c_float    3.73 ns

CAMLprim double float_compare_float_stub(double f, double g)
{
  return (f > g) - (f < g) + (f == f) - (g == g);
}

external float_compare_float : float -> float -> float
  = "float_compare_float_bytecode" "float_compare_float_stub" "float"
let float_compare_float x y = int_of_float (float_compare_float x y)

What is gained and lost with 63-bit integers?

Almost every programming language uses 64-bit integers on typical modern Intel machines. OCaml uses a special 63-bit representation. How does it affect OCaml?

OCaml int memory representation

Most of OCaml's types are in memory represented as a header followed by data. The header is a 64-bit integer containing the length of the data and a tag. Tag is a rough classification of the type. The only OCaml's types which differ from this are ints and sometimes floats.

Floats normally have header and data, data being the value of the float itself. This representation is called "boxed". If a record's field is float, record's data will actually contain the pointer to the float data. The only exceptions are records with only floats and float arrays, whose data instead of pointers contain the values of floats. This representation is called "unboxed".

Values of type int are never stored as header and data (boxed). Int x is stored as (x << 1) | 1, where << is left shift and | is bitwise or, hence its least significant bit is always set. Pointers are word aligned, so they will never have this bit set, hence how ints and pointers are discerned. It is assumed that much of typical data is integers, so this is done to significantly improve performance:

  • there's no need to dereference a pointer when getting an int
  • no memory allocation is needed when creating ints
  • less work for the garbage collector
  • less memory fragmentation
  • no memory is needed for int headers

Distinguishing whether a value is int or pointer is as simple as testing x & 1, so this feature doesn't slow down garbage collector, polymorphic hash, polymorphic compare and whatever else structurally inspects data. One should note that this doesn't apply to the types int32 and int64, which are always boxed.

Penalty

Having the extra bit comes with a price - arithmetic operations are more complicated. For example

  • x + y is translated to CPU instructions x + y - 1
  • x * y is translated to CPU instructions (x >> 1) * (y - 1) + 1
  • x / y is translated to CPU instructions (((x >> 1) / (y >> 1)) << 1) + 1
  • x lsl y is translated to CPU instructions ((x - 1) << (y >> 1)) + 1

Sometimes this penalty is small or nonexistent. For instance there's no need to fix the bit in x + y - z. Only one bit fixing is needed for all five additions in x + y + z + w + u + v.

Another help is the Intel CPU instruction LEA, which can compute the sum of three integers with a single instruction, like x + y - 1. Unfortunately, LEA became very slow in the recent generations of CPUs. Intel doesn't suggest this will change.

This benchmark (test.ml) tries to estimate the difference in the performance. The results from Sandy Bridge show about 2 times speed difference in arithmetic operations. Assembly can be examined by compiling using "ocamlopt -S test.ml".

speed(ns)       63-bit   64-bit   slowdown
add independent 0.327588 0.121502 2.696156
add   dependent 0.328160 0.169375 1.937477
mul independent 0.614824 0.328060 1.874120
mul   dependent 1.094343 0.328872 3.327565
lsl independent 0.359828 0.166088 2.166489
lsl   dependent 0.762251 0.177468 4.295151
xor independent 0.249350 0.122900 2.028886
xor   dependent 0.404255 0.170380 2.372668

Agner's instruction tables show that the difference is even bigger with later generations of CPUs. For instance, Haswell can do four integer adds per cycle versus one LEA.

Conclusion

The benefits of unboxed ints are amazing. On the other hand, arithmetic operations are significantly slower. How much do arithmetic operations affect an average program? Could we have a solution which would keep ints unboxed but have fast arithmetic operations?

Async Parallel

Background

Parallel is a library for spawning processes on a cluster of machines, and passing typed messages between them. The aim is to make using another processes as easy as possible. Parallel was built to take advantage of multicore computers in OCaml, which can't use threads for parallelism due to it's non reentrant runtime.

Introduction

Parallel is built on top of Async, an OCaml library that provides cooperative concurrency. So what do we want an async interface to parallel computations to look like? Well since a Deferred already captures the concept of some action starting now and returning a future result, we just want a function to run that action in parallel, like this,

  1. val run : ?where:[`Local | `On of string | `F of (unit
  2. -> string)] -> thunk:(unit -> 'a Deferred.t)
  3. -> 'a Deferred.t

So what exactly does run do,

  • run creates a new process on the machine specified by [where]
  • run starts [thunk] in that process
  • run waits for [thunk] to finish and returns it's result to the caller
  • [thunk] may also call run if it wants to

The above function is actually ALL we need, we can build anything we want on top of it. We could, for example, pass a closure that will return Deferred.never, and start up two way communication with that new process. For example via an address we agree on before it is sent to the other process. In practice parallel provides a few more things.

  1. val spawn : ?where:[`Local | `On of string | `F of (unit
  2. -> string)] -> (('a, 'b) Hub.t -> 'c Deferred.t)
  3. -> (('a, 'b) Channel.t * ('c, string) Result.t Deferred.t) Deferred.t

Spawn is run's more featureful cousin. The closure is started up on another process, and passed a typed Hub, and the caller is given the closure's result, and a typed Channel that is connected to that Hub. A Channel is a connection to a Hub. A Hub may have many channels connected to it at any given time. A very important feature of Channels is that they are mobile, they can be passed between processes, and they remain connected to their Hub. They can either be passed by lexical capture prior to spawn or run, or they can be sent over a Channel. Lets see a small example,

Ping Pong

  1. open Core.Std
  2. open Async.Std
  3. open Parallel.Std
  4.  
  5. let worker h =
  6. Pipe.iter_without_pushback (Hub.listen_simple h) ~f:(fun (id, \`Ping) ->
  7. Hub.send h id \`Pong;
  8. > > | fun () -> \`Done
  9.  
  10. let main () =
  11. Parallel.spawn ~where:Parallel.random worker >>> fun (c, _res) ->
  12. let rec loop () =
  13. Channel.write c \`Ping;
  14. Channel.read c >>> fun \`Pong ->
  15. Clock.after (sec 1.) >>> loop
  16. in
  17. loop ();
  18. Clock.after (sec 60.) >>> fun () -> Shutdown.shutdown 0
  19.  
  20. let () =
  21. Parallel.init ~cluster:
  22. {Cluster.master_machine = Unix.gethostname ();
  23. worker_machines = ["host0"; "host1"]} ();
  24. main ();
  25. never_returns (Scheduler.go ())

Lets take this program apart. The function worker, is the worker process, which will listen to it's hub for Ping messages (id is the
name of the client that sent the message) and respond to the sender with a Pong message. Meanwhile, the main function will start up one process using spawn and have it run the worker function. It will then write a Ping message to that process every second, and read the returned Pong message. After 60 seconds the main function will call shutdown. The toplevel action at the bottom first calls Parallel.init, and defines the cluster of three machines (the master, host0 and host1). Notice that main's Parallel.spawn is given a ~where argument that will randomly pick one of the machines for the worker to run on. Then starts main, and then the async scheduler.

The most important thing to say about this program is that to the compiler it looks just like any other program. So the type checker will check the types, and as a result it will enforce the protocol, to the extent possible. The other is that when main dies, after the shutdown call, the framework will ensure that all the worker processes (on all machines) are killed as well.

Implementation Notes and Gotchas

There are three kinds of processes involved in a program the uses Parallel:

  • the main process
  • the master process
  • worker processes

Parallel dynamically creates a worker process to service each call to [run].

The OS process tree looks like:

| main
|    master
|      worker1
|      ...
|      workerN

As far as the OS is concerned, all workers are children of the master. However, from the perspective of Parallel, the topology is more structured. Each worker process is created on behalf of its "owner" process, which is either the main process or another worker process. One can think of the main and worker processes as arranged in a tree different than the OS tree, in which there is an edge from each process to its owner (the main process has no owner).

Parallel uses OCaml's [Marshal] library to serialize OCaml values to and from strings so that they can be sent over unix sockets between processes. For example, the [f] supplied to [run] is marshalled and sent from the process that calls [run] to the worker process that will run [f]. Most, but not all values can be marshaled. Examples of values that can't be marshaled include C allocated abstract tagged values, and custom blocks with no serilize/deserialize method.

The main process and all worker processes have a socket connected to the master process. The master process's sole job is to service requests that are sent to these sockets, which can ask it to create a new worker process. As the master process receives requests, it does what each request asks, and then sends a response back via the socket to the client that made the request.

Each worker process has a socket connected to its owner process. This socket initially receives the [f] that the worker is to run, and is ultimately used to send the result back from the worker to the owner.

Here are the steps involved in implementing [run f]. There are three processes involved.

  • R = the process calling [run]
  • M = the master process
  • W = the worker process running the task

The steps are:

  1. R asks M to create W
  2. M forks W
  3. M tells R about W
  4. R sends [f] to W to run
  5. W runs [f]
  6. W sends the result of [f] to R
  7. M notices W has exited, and cleans up

When there are multiple machines in a cluster, each machine has a master process, and all the workers know about all master processes. When a worker wants to run on machine M, it looks up the address of that machine's master process in its table before performing step 1, everything after that is exactly the same as the example.

Channel Passing

When a channel is passed from one process to another, the open socket is not actually passed. The API makes this pretty transparant, any api call will reconnect the channel, but it is useful to be aware of what is really going on, as if you aren't aware you may create a race condition. For example, if I spawn a worker connected to a hub I have, and then I immediately send something, it may or may not arrive, because the worker may not have time to connect and recieve it. A better strategy is to wait for the worker to say hello, and then send the data.

Channel passing also means that though you created only one channel from a given hub, you can end up with as many connections (client ids) as workers who got hold of that channel. You can address them all individually, or you can always use send_to_all if you really want to model a hub as a kind of shared bus.

Stdout and Stderr

stdout and stderr will be forwarded back to the master machine. This can cause some interleaving if you print a lot of messages, but generally works reasonably well. So printf debugging can work normally, even in a parallel program spanning multiple machines.

Some things to avoid marshaling

Monitor.t, Pcre.regexp, Writer.t, Reader.t, and similar kinds of objects shouldn't be depended upon to marshal correctly. Pcre.regexp is a custom block, and doesn't implement marshal/unmarshal, so it won't work. Monitor.t, Writer.t, and Reader.t, because of their complex nature, generally tow the entire async scheduler along with them, and because of that they will fail if any job on the scheduler queue has a custom object (e.g. regexp, or other C object) that can't be marshaled. You also can't marshal functions you've dynamically loaded (e.g. with ocaml plugin, though I hear this will be fixed soonish).

Processes don't share memory!

The library can make it look easy to create and use a process on some other machine maybe halfway around the world, but even still it is another process. All the normal boundries associated with that apply, so you can't expect global variables you set in one worker process to effect another. For a large parallel program that is a good thing.

Shared things

Because of the way parallel works, with the master process an image of a very early state of one's program and workers forked from the master, it is usually not possible to share big static things in the way one might do in C using fork. Moreover, it isn't necessarially a win as you might think, if you know about how unix only copies pages on write when a process forks, you know that it should be a win. But the garbage collector ruins that completely, because as it scans it will write to EVERY page, causing a copy on write fault to copy the page, so you'll end up with a non shared copy of that big static thing in every process anyway. The best you can probably do is have one process own it and expose it with a query interface. Moreover, if you're running on multiple machines that IS the best you can do, so you may as well get used to it.

Why Not Just Fork!?

The unix savvy amoung you may ask, what the heck are you doing with master processes and closure passing, just fork! Oh how that would make life easier, but alas, it really isn't possible. Why? You can't write async without threads, because the Unix API doesn't provide an asynchronous system call for every operation, meaning if you need to do something that might block, you must do it in a thread. And the list of stuff that might block is long and crippling. Want to read from a file without blocking out for SECONDS? Sorry! Not without a thread you don't. But once you've started a thread, all bets are off if you fork. POSIX actually doesn't even say anything about what happens to threads in a process that forks (besides saying they don't think its a good idea to do so). In every sane OS, only the forking thread continues in the child, all the other threads are dead. OK, fine you say, let them die. But their mutexes, semephores and condition variables are in whatever state they were in the moment you forked, that is to say, any state at all. Unfortunatly this means that having created a thread that does anything meaningful (e.g. calls into libc), if you fork, all bets are off as to what happens in that child process. A dead thread may, for example, hold the lock around the C heap, which would mean that any call into libc would deadlock trying to allocate memory (oops), that'd ruin your day. Trust me, if parallel could work in some simpler way we'd adopt it quickly!

Say I Want To Have a Giant Shared Matrix

The parallelism model implemented is strictly message passing, shared memory isn't implemented, but there are various hacks you could use to make this work (e.g. implement it yourself). Bigarray already allows mmaping files, so in theory even a cluster of machines could all mmap a giant file and use/mutate it.

Why Can't I Use Async Before Parallel.init?

By default Parallel.init does a check that you haven't created any threads, and that you haven't made any use of async. The threads check is mandatory, but the async check can be turned off by setting [fail_if_async_has_been_initialized] to false. Why is this check the default? Well in general you can't initialize async libraries before calling Parallel.init and expect them to work in the child process. The reason is that the async scheduler is thrown away in the child process before calling Scheduler.go. This is out of necessity, there is no way we can know what state the scheduler is in at the moment we fork, and it would be quite unfortunate if it were in a bad state, or worse, there are jobs on the queue that get run in all workers as soon as they call Scheduler.go. But as a result of this, any asyncy thing you create before Parallel.init won't work in worker processes. For example, say you initialize the log module before Parallel.init expecting to use it in the workers. It won't work, since all of its state (loops, writers, etc) is invalid in the worker processes. The check exists to make sure people are aware of this, and to make sure it only happens if they really know it's ok.

What CWD Will Worker Machine Processes Have?

If the CWD of the master exists on the worker machine, and you have permission to enter it, then parallel will switch to that directory before starting the master process, otherwise it will chdir to /.

RWO tidbits: the runtime

This is my favorite tweet about Real World OCaml.


It is indeed pretty rare for a language introduction to spend this much time on the runtime. One reason we included it in RWO is that OCaml's simple and efficient runtime is one of its real strengths - it makes OCaml simple to reason about from a performance perspective, and simple to use in a wide variety of contexts. Also, critically, the runtime is also simple enough to explain! Even though it's one of my favorite parts of the book, I had very little to do with it. Anil wrote most of it, with some critical help from Jeremy Yallop (who worked on the ctypes library featured in

Chapter 19, and Stephen Weeks (whose notes formed the basis of Chapter 20 and Chapter 21).

In any case, if you're interested in how OCaml represents values, how the C interface works, or how a simple generational GC works, you should check out Part III.

The making of Real World OCaml

It's taken a good long while, but Real World OCaml is finally done. You can read it for free at http://realworldocaml.org, or buy a hardcopy or an ebook version.

The idea for the book was born in a bar in Tokyo in 2011. After a talk-filled day at ICFP, a few of us, Ashish Agarwal and Marius Eriksen, Anil, and myself, went out drinking. We were all bellyaching over the lack of a high-quality OCaml book in English, and, emboldened by the Guinness and egged on by Ashish and Marius, Anil and I decided that we were just going to have to go ahead and write the book ourselves. We brought Jason into the project soon after.

From the very beginning, Anil and I had different visions for the book. From my perspective, it was a simple proposition: there was no good book available, and we were going to write one. The goal was to document the state of the art so as to make it more accessible.

Anil's vision was grander. He argued that writing a book is an opportunity to discover all of the things that are just too embarrassing to explain. And, once discovered, you needed to find a way to get all of those things fixed before the book was published. It was Anil's grander vision that we followed, and to good effect. It's not that we fixed the problems ourselves. We solved some problems directly, but by and large, we ended up getting help from those who were better positioned than us to fix the problems at hand.

Here are a few examples of pieces that came together over those two years.

  • OPAM. The core work on OPAM was done by Thomas Gazagnaire at OCamlPro (funded by Jane Street), with Anil collaborating closely. This is probably the biggest improvement of the bunch. It's hard to overstate how transformational OPAM is. Before OPAM, installing OCaml libraries was a complex chore. Now it's a joy.
  • The Core release process. We decided early on to base the book on Core, Jane Street's alternative to OCaml's standard library. But the release process around Core was too monolithic and too hard for external collaborators. We reorganized the process completely, moving to weekly releases to github, with the repos broken out into manageable components. Most of this work was done by Jeremie Dimino and Yury Sulsky at Jane Street.
  • Ctypes. Anil was in charge of writing the section on C bindings, and he decided that the standard way was just too painful to explain. That was the germ of ctypes, a library, written by Jeremy Yallop, that lets you build C bindings entirely from within OCaml, with a easy and simple interface.
  • Short paths. One of the real pains associated with working with Core is the heavy use that Core makes of the module system. OCaml's error messages, unfortunately, do not cope with this terribly well, leading to absurd types in error messages, so you might see a type rendered as Core.Std.Int.Map.Key.t Core.Std.Option.Monad_infix where it could just as well have been rendered as int option. We worked with Jacques Garrigue to implement a better heuristic for picking type names (suggested by Stephen Weeks, who implemented the same heuristic for Mlton). This is now available in the 4.01 version of the compiler, via the -short-paths patch.

This period also saw the founding of OCaml Labs, a lab at Cambridge University devoted to improving OCaml as a platform, with Anil at the head.

And we're not done. OCaml Labs and OCamlPro are still working on improving OPAM and building out a curated OCaml Platform that's built on top of OPAM. There's ongoing work on improving documentation generation so we can have better online API docs. Jacques Garrigue and Leo White are working on adding module aliases to OCaml which will greatly improve compilation time for libraries like Core that need to manage large namespaces.

And there are more projects that are improving OCaml and its surrounding infrastructure that have no direct connection to Real World OCaml, like:

  • Frederic Bour and Thomas Refis' work on Merlin, a tool for providing IDE-like tooling from within editors like VI and Emacs.
  • Mark Shinwell's work on improving GDB support for OCaml,
  • Fabrice le Fessant's work improving OCaml's interactions with performance tools like perf.
  • Work that Ashish Agarwal, Christophe Troestler, Esther Baruk, and now many others, have poured into improving the http://ocaml.org website.

Real World OCaml is of course a personal milestone for Jason, Anil and myself (and for Leo White, Jeremy Yallop and Stephen Weeks, who made some key contributions to the text). But viewed from a broader perspective, it's just one part of the increasing swirl of activity in the OCaml community.

OCaml has been a great language for a very long time. Now, we're growing a social and technological infrastructure around it to match.

Patch review vs diff review, revisited

I've been thinking about code review a lot recently.

Code review is a key part of our dev process, and has been from the beginning. From our perspective, code is more than a way of getting things done, it's a way of expressing your intent. Code that's easy to read and understand is likely to be more robust, more flexible, and critically, safer. And we care about safety a lot, for obvious reasons.

But the importance of code review doesn't mean that we've always done a good job of organizing it. I'll talk a bit more about how we used to do code review, how we do it now, and the impact that those changes have had.

The bad old world

Our old code review process was what you might call batch-oriented. We'd prepare a set of changes for a release, and then, after someone gave it a quick look-over, combine these changes together in a branch. We'd then read over these changes very carefully, with multiple people reading each file, making comments, requesting changes, and fixing those changes, until the code was in a releasable state.

This was a big and expensive process, involving many people, and quite a lot of work and coordination. Given the time it took, we focused our code review on our so-called critical path systems, i.e., the ones that are involved in sending orders to the market.

The management task was complex enough that we wrote a tool called cr for managing and tracking the reading of these diffs, parceling out responsibility for different files to different people. We've actually blogged about this before, here and here.

Batch-oriented review worked well when we and our codebase were smaller, but it did not scale. By combining multiple changes into a single branch, you were stuck reading a collection of unrelated changes, and the breadth of the changes made fitting it all into your head harder. Even worse, when you throw a bunch of changes together, some are going to take longer than others, so the release is blocked until the slowest change gets beaten into shape.

The end result is that, while we found code review to be indispensable in creating high quality code that we felt comfortable connecting to the markets, the overhead of that review kept on getting higher and higher as we grew.

We needed a better solution.

Feature-based review

Another approach to code review, and a more common one, is patch-based review. In patch review, someone proposes a change to the current release in the form of a patch, and it is the patch itself that is reviewed. Once it passes review, it can be applied to the tree. Patch-based review is great in that it gives you independence: one patch taking a while doesn't block other patches from getting out.

We avoided patch-based review initially because we were worried about the complexities of dealing with conflicting patches. Indeed, one issue with patch-based review is that the state of the tree when the patch is reviewed is likely not the state of the tree when the patch is applied. Even when this doesn't lead to textual conflicts, this should leave you a little nervous, since a patch that is correct against one version of the tree is not necessarily correct against a changed tree.

And then, what do you do when there's a conflict and the patch no longer applies cleanly? You can rebase the patch against the new state of the tree, and then re-review the patch from scratch. But humans are really bad at investing mental energy in boring work, and carefully re-reviewing a patch you've already mostly read is deadly dull.

Moreover, when do you decide that there's a conflict? When dealing with patches that involve file moves and renames, even deciding what it means for a patch written previously to still apply cleanly is a tricky question.

Also, squaring patch-based review with a git or hg-based workflow can be tricky. There's something quite nice about the github-style pull-request workflow; but the semantics of merging are pretty tricky, and you need to be careful that what you read corresponds with the actual changes that are made by the merge.

For all the problems, the virtues of patch-based review are clear, and so about six months ago we started a project to revamp our cr tool to make it suitable for doing patch-like review. The new version of cr is now organized around what we call features, which are essentially hg bookmarks (similar to git branches) augmented with some associated metadata. This metadata includes

  • An English description of the change
  • A base-revision that the changes should be read against
  • An owner
  • A collection (usually just one other than the owner) of full-feature reviewers.

The workflow for a developer goes something like this:

  • create a new feature by running cr feature create. You'll select a name for the feature and write the initial description. The base-revision will automatically be chosen as the most recent release.
  • Write the code, using hg in the ordinary way, making commits as you go and pushing the result to a shared, multi-headed repo that has all of the features people are working on.
  • When you think the code is in a good state, get the feature enabled for review. At this point, you'll need to get a full-feature reviewer selected. It's this person's job to read every change that goes into this feature.
  • The full feature reviewer then reads the diff from the base-revision to the tip, adding comments, requesting fixes, and reading diffs forward until they're happy, at which point, it's seconded.
  • Once it's seconded, the feature is enabled for review by anyone who is signed up for review for the specific files you touched. How many file reviewers you are depends on the nature of the project. In our most safety-critical systems, every file has three reviewers. In some other systems, there are no file reviewers at all.

The remaining work needs to be done by the release manager. A release manager can create a new release based on a set of features that:

  • are fully reviewed, and have no outstanding reviewer complaints to be resolved.
  • compile cleanly on their own and pass their automated tests
  • have as their base revision the previous release
  • can be merged together cleanly

Checking that things "can be merged together cleanly" is actually tricky, since you can't just trust hg's notion of a merge. cr has its own merge logic that is more conservative than what hg and git do. The biggest worry with hg is that it tries to guess at a reasonable base-point for the 3-way merge (usually the greatest common ancestor of the two heads to be merged). Usually this works well, but it's easy to construct crazy cases where on one branch you make changes that are just silently dropped in the merge. There is also some rather surprising behavior that can come into play when files are moved, copied or deleted as part of the mix.

cr, on the other hand, will always choose the base-point of the features to be merged as the base-point for the 3-way merge. This way, the diffs that are reviewed are also the diffs that are used for constructing the merged node. Also, cr has some extra sanity conditions on what merges it's willing to try. This all greatly reduces the scope for surprising results popping out the other side of a merge.

If the base-revision of a given feature is against an older release, then you need to rebase the review before it can be released, i.e., update the base-revision to the most recent release. Among other things requires you to merge your changes with tip. If there are conflicts, you then either need to review the resolution of the conflicts, or you simply need to reread the diffs from scratch. The last bit is pretty rare, but it's an important escape hatch.

How'd it go?

The new code-review process has had a dramatic effect on our world. The review process for our main repository used to take anywhere from a couple of weeks to a three months to complete a cycle. Today, those releases go out every week, like clockwork. Everyone knows that if they can get their feature cleaned up and ready, they can always get it out that week. Indeed, if you're following our open-source releases on github, you'll see that new packages have shown up once a week for the last 16 weeks.

Feature-baaed review has led to a significant increase in the rate of change of our critical-path systems. Code review is now considerably less painful, and most importantly, it's easier than ever to say no to a feature that isn't ready to go. In old-style batch review, there was a lot of pressure to not hold up a release polishing some small bit, which sometimes lead you to release code that wasn't really ready. Now, that problem has largely vanished.

The barrier to entry for people who want to contribute to critical path systems has also been lowered. This has also contributed to us being able to get projects out the door faster.

But the most striking result I've seen is from our post-trade group, which operates outside of the review process used for the critical-path systems. The post-trade team is responsible for our infrastructure that handles everything after a transaction is done, like tracking the clearing and settlement of our trades or managing our trading books and records.

Post-trade has historically had a more relaxed approach to code review --- they do it, but not on all parts of the system, and not in a particularly strict way. In the last few months, however, they switched over to using the new feature-based workflow, and even though they're doing a lot more code review (which takes serious time), their overall process has become faster and more efficient. We think that's largely do to having a well-managed workflow for managing and merging independent features, even without whatever the benefits of review itself.

I'm now pushing to get feature-based review adopted throughout the firm. Obviously, not all code needs to be scrutinized to the same level --- having three reviewers for every file is sometimes sensible, sometimes overkill --- but ensuring that no change can get in unless one other person reads it and thinks it's reasonable is a good baseline rule. Review has a lot of benefits: it improves the quality of the code, gives you a great opportunity for training, and helps spread knowledge. Those benefits make sense everywhere we have people programming.

Maybe the biggest lesson in this for me is the importance of thinking through your processes, focusing on the biggest bottlenecks, and doing what you can to fix them.

Hackerschool tutorial at Jane Street

hacker-school-1

We just had a really fun tutorial on OCaml, Core and Async for about 20 people from the current Hacker School batch. The tutorial was in part based on an early cut of Real World OCaml, and in part based on the examples in the Core Hello World repo that has a few examples of Core and Async in action.

A few weeks ago, we gave out a small, early snapshot of the first few chapters of RWO, and some instructions for installing OPAM so you could get the appropriate versions of OCaml, Core and Async installed, in the hopes of getting people prepared for the tutorial. A few people also started working on the 99 problems in OCaml from the ocaml.org website.

hacker-school-2

The tutorial itself lasted about 4 1/2 hours, and we covered a lot of territory in a short period of time. I'm sure everyone didn't fully understand everything, but enough good questions were asked that I think people were getting the basic ideas.

In any case, it was a blast from our perspective, and I hope we can do it again.

"...you can write really pretty OCaml code..."

I got a bunch of feedback from the Hacker School folk, including the following from Alan O'Donnel, one of the staff members.

I think the most eye-opening part of the seminar was seeing Command; it made me realize two things: one, you can write really pretty OCaml code, and two, you guys write really pretty OCaml code! My (totally unfounded) image of OCaml (the language plus ecosystem) was something like a mix of Haskell and Erlang--some nice type stuff, but kind of ugly looking and with kind of yucky libraries. I was genuinely surprised that the Core and Async code we looked at was so pretty; it feels really well-designed and elegant, and it makes me want to write OCaml code.

Maybe we need to write a blog-post on how to use Command...

4