Incremental computation and the web

I've recently been thinking about the world of JavaScript and web applications. That's odd for me, since I know almost nothing about the web. Indeed, Jane Street's use of web technologies is quite minimal -- nearly all of our user interfaces are text based, and all told we've been pretty happy with that.

But there are real limitations to console apps, and if you need something richer that's cross-platform, the web is pretty appealing. For us it's made yet more appealing by the fact that OCaml, our language of choice, compiles into JavaScript via js_of_ocaml.

So recently, when a few people internally got interested in trying out JavaScript-based UIs, I dug in a little to try to understand the landscape, and help us figure out what approach to take.

Virtual DOM, all at once

I started by trying to understand more about the approaches taken in the wider world for building JavaScript apps. One idea the struck me as particularly interesting was virtual DOM. Virtual DOM showed up first in Facebook's React, but has since inspired other implementations, like Matt Esch's virtual-dom library, which in turn is the basis of the Mercury web framework and Elm's Html library. Other libraries have wrapped and extended React, like ClojureScript's Om framework.

To understand the appeal of virtual DOM, you first need to understand what the world looks like without it. JavaScript applications in the browser are fundamentally tied to the DOM, which is the tree of objects that reflects the structure of the HTML that the page is built from. The DOM is wired into the browser, and mutating the DOM is how you change what's shown on the screen, a necessity for a dynamic web app.

But working directly with DOM can be awkward. For one thing, it encourages you to write your display logic twice: once to create the initial state of your page, and then again for the code that updates the DOM in response to external events.

But it's worse than just having to write your logic twice: the second time is also trickier. That's because, for performance reasons, you need to minimize changes to the DOM, since those changes can cause expensive reflows and the like. Doing this kind of change minimization by hand can be pretty painful.

The goal of virtual DOM is to let you write your display logic just once, and to do so in a convenient, straight-ahead style. The idea is simple: instead of modifying the DOM directly, you create immutable trees that represent the DOM you'd like to have, a virtual DOM. Every time your state changes, you recompute the virtual DOM, and then diff the new virtual DOM against the old. The resulting patch can then be applied to the actual DOM, which minimizes changes to the DOM proper.

This makes it easier to express your display logic cleanly. Elm uses virtual DOM as part of the Elm architecture. In that approach, the state of the application is kept in a model value, which abstractly represents the state of the application, omitting presentation details. A separate view function is used to convert the model into a virtual DOM tree that describes the page that should be shown to the user.

The application is made dynamic by adding an action type which summarizes the kinds of changes that can be made to the model. Actions are enqueued either from callbacks embedded in the virtual DOM, or by other jobs that communicate with the outside world via web requests and the like. When the action is applied to the model, a DOM update is done by recomputing the virtual DOM, and then diffing and patching the real DOM accordingly.

Virtual DOM, incrementally

As described, the above approach involves computing the virtual DOM from scratch after every action. This is done even if the change to the DOM implied by the action is small, which is the common case. Essentially, every key press and mouse click causes the entire virtual DOM to be recomputed.

In a world where DOM updates are the only expense that matters, this isn't so bad. And for sufficiently small web applications, that's almost right. But once you're creating large, dynamic UIs, this simple story falls apart, and the cost of recreating the virtual DOM every time matters.

That's why in all of these virtual DOM APIs and frameworks, there's some form of incrementalization built in, a way to avoid paying the full cost of rebuilding the virtual DOM when the logical changes are small.

In React, for example, the state is organized into a set of hierarchical components, each with its own render function. These components are structured to match the structure of the HTML that they generate, with the idea that you'll only have to re-render the few components whose input data has actually changed. React effectively memoizes the render computation at the component level.

Elm, rather than tying the incrementalization directly to a framework-level notion of component, lets you introduce memoization in the construction of individual virtual DOM nodes. To do this, Elm's Html module exposes a set of "lazy" functions with roughly these signatures (shown with OCaml syntax).

  1. val lazy1 : ('a -> Html.t) -> 'a -> Html.t
  2. val lazy2 : ('a -> 'b -> Html.t) -> 'a -> 'b -> Html.t
  3. val lazy3 : ('a -> 'b -> 'c -> Html.t) -> 'a -> 'b -> 'c -> Html.t

Here, the first argument is the render function, and the remaining arguments are the values to be passed to the render function.

The idea is that a call to one of these lazy functions won't call the render function immediately. Instead, it creates a special node that stores the render function and its arguments for later. The render function is only called as part of the process of diffing two virtual DOM trees. When the diff gets to the point of comparing two such nodes, it first compares the things that the node was built from, i.e., the render function and its arguments. If they're the same, then the diff is empty. If they differ, then the render function is run to compute more of the tree, and the diffing process continues from there.

It's worth noting that forcing the render function for a given node to run will create more of the virtual DOM tree, but it won't necessarily build everything below that node. In particular, the tree that's created may contain yet more lazy nodes, which won't be forced until the diffing process gets to them.

By making enough nodes lazy in this way, you can incrementalize the computation of the entire virtual dom tree, only forcing the recomputation of parts of the virtual dom that could have changed given the changes in the underlying data model.

Elm's approach has some limitations. While it doesn't limit memoization to a particular notion of a component, it does tie it to nodes in the DOM tree. This can be limiting, since it prevents you from sharing other parts of the computation that don't result concretely in DOM nodes.

It's also a little anti-modular, in that you basically need to call your lazy function on simple values and top-level functions, so ordinary functional programming modularization techniques, which often rely on passing around closures, don't work as well as you'd hope.

Beyond virtual DOM

Virtual DOM isn't the only approach to simplifying the process of programming the DOM. Another example I ran across is Mike Bostock's amazing D3 library. D3 has some of the same goals as virtual DOM, in that it aims to provide a nice way to construct complex web pages based on some more abstract data model. Like virtual DOM, D3's approach lets you specify the view logic once, while producing a view that responds efficiently to changing data. D3 is doing this in the service of data visualization, but the approach it takes is not limited to that domain.

Where virtual DOM encourages you to think of your view calculation as an all at once affair, D3 makes you think about incrementalization explicitly where it matters. In particular, when you specify how the view changes in response to data, you do so by explicitly specifying what happens in three cases: enter, update, and exit. The enter case corresponds to new data points arriving, update corresponds to data points that are changing, and exit corresponds to data being removed.

These transformations are specified using a spiffed up version of the DOM selectors API, which lets you can select a collection of nodes by stating conditions that those nodes satisfy. You can then specify ways of transforming those nodes, and, somewhat surprisingly, specify the creation of nodes that don't exist yet. This is done using the append operation, and is all part of what's is called data binding in the D3 world.

If this sounds confusing, well, I found it confusing too. But the D3 approach has some good things going for it. For one thing, it gives you a natural way of thinking about animations, since you can specify simple animations to run on the enter/exit/update actions, which is more awkward in virtual DOM based approaches.

To borrow an analogy from circuits, virtual DOM is level-triggered, meaning the view depends only on the current value of the state; but D3 is edge-triggered, meaning that the display logic can depend on the precise transition that's occurring. This is a real difference in the models, but I'm not sure how important it is in practice.

To some degree, you can get around this issue on the virtual DOM side by expressing more time-dependent information in the model. Also, you can add edge-triggered events on top of your virtual DOM, which React does. That said, it's not as front-and-center in the Virtual DOM API as it is with D3, where edge-triggered animations are an easy and natural part of the design.

Incrementality everywhere

Given that incrementality seems to show up in one form or another in all of these web frameworks, it's rather striking how rarely it's talked about. Certainly, when discussing virtual DOM, people tend to focus on the simplicity of just blindly generating your virtual DOM and letting the diff algorithm sort out the problems. The subtleties of incrementalization are left as a footnote.

That's understandable, since for many applications you can get away without worrying about incrementalizing the computation of the virtual DOM. But it's worth paying attention to nonetheless, since more complex UIs need incrementalization, and the incrementalization strategy affects the design of a UI framework quite deeply.

The other benefit of thinking about incrementalization as a first class part of the design is it can lead you in new directions. In that vein, I've been experimenting with using self-adjusting computations, as embodied by our Incremental library, as another approach to incrementalizing computation of the virtual DOM.

Self-adjusting computations is a general purpose approach to building efficient on-line computations developed by Umut Acar in his dissertation. Thinking about Incremental in the context of GUI development has lead us to some new ideas about how to build efficient JavaScript GUIs, and some new ideas about how Incremental itself should work. I hope to write more about that in an upcoming post.

(You can check out the next post here.)


Most of what I've written here comes from talking to people who know a lot more about the web than I do, and I wanted to thank them. I had some very enlightening conversations with Jordan Walke about React's design and history. I've also talked a bunch to Evan Czaplicki about Elm, and I'm indebted to Spiros Eliopoulos, who helped me learn a bit about D3, in part through his ocaml-d3 bindings. Also thanks to Hugo Heuzard and Izzy Meckler for writing a bunch of useful code and helping me learn about js_of_ocaml and more generally about various details of JavaScript and modern browsers.

Why OCaml?

Here's a post from a talk I gave this last summer during our internship program about why we use OCaml. It spends a lot of time on how OCaml fits into the space of programming language designs, and why we think OCaml is in a real sweet spot in that design space, especially for the kind of work we do at Jane Street.

Warning: it's a very informal talk, with lots of questions and answers from the audience, not all of which are clearly audible, for which I apologize. Still, I hope people will get something out of it.

Testing with expectations

Testing is important, and it's hard to get people to do as much of it as they should. Testing tools matter because the smoother the process is, the more tests people will write.

Especially in the functional programming world, most of the talk about testing tools is focused on tools for property-based testing, like the various and sundry quickcheck-style systems. These are great, but sometimes, you don't want to write down properties --- what you want is to write your tests in terms of simple, concrete scenarios.

We've recently added support for what we call expect tests, a kind of test optimized for this kind testing. Expect tests allow you to write test scenarios without having to manually write out the output generated by the code you're testing. Instead, that output is captured and recorded automatically for you, in a way that makes it easy to integrate into the source of the test.

Our expect tests were inspired by Mercurial's unified test format. Unified tests are designed for testing command-line tools like hg, and so are specialized to the shell.

Here's an example using cram, an implementation of the unified test idea that is independent of Mercurial. Let's say we want to test the UNIX sort command. We might start by writing a test file, simple.t, that looks like this.

Dump some lines into a file

  $ cat > colors << HERE
  > red
  > yellow
  > green
  > HERE

sort the file and dump to stdout

  $ sort colors

If you then run cram on this, cram will show you that the test failed by showing you a diff.

expect-test $ cram simple.t
--- simple.t
+++ simple.t.err
@@ -10,5 +10,11 @@
 sort the file and dump to stdout

   $ sort colors
+  green
+  red
+  yellow

It also creates a new file, simple.t.err, which contains the output of running the script, intermixed with the script itself. You can accept the new version just by moving the err file over the original.

mv simple.t.err simple.t

If you run cram now, you'll see that the tests pass.

$ cram simple.t
# Ran 1 tests, 0 skipped, 0 failed.

If we break the tests somehow, then the diff will show us exactly what failed. For example, if we replace sort with cat, here's what Cram will show us:

bash-3.2$ cram simple.t 
--- simple.t
+++ simple.t.err
@@ -10,7 +10,7 @@
 sort the file and dump to stdout

   $ cat colors
-  green
+  green

# Ran 1 tests, 0 skipped, 1 failed.

Note how good the diff is for seeing how your test failed.

Starting with the development of Iron last year, we started using cram tests pretty extensively for command-line programs. We found it to be a very productive idiom, but it's pretty awkward to apply outside of the command-line domain. That's why we started thinking about how to get the benefits of cram, but in OCaml.

Breaking out of the shell

Unified tests are great for three reasons:

  • they let you combine the scenario and the output of that scenario (and comments) into one readable file
  • they help you construct the file automatically
  • they display test failures as easy-to-interpret diffs.

None of these advantages is tied to using the shell. To bring this to OCaml, though, we needed to figure out a reasonable way of embedding these tests in an OCaml program, without breaking all of the tooling. We did this by leveraging OCaml's annotations, which let us get the data we need in place without breaking from the ordinary syntax of an OCaml program. That means that tools like merlin and ocp-indent and editor modes like tuareg will work without incident.

We can write the OCaml analogue of our cram test by creating the following file, named

  1. open Core.Std
  3. let%expect_test "simple sort" =
  4. let sorted = List.sort ["red";"yellow";"green"] in
  5. [%sexp_of: string list] sorted |> Sexp.to_string_hum |> print_endline;
  6. [%expect {| |}]

Here, the let%expect_test introduces a new test, and registers it with our inline test framework. [%expect {| |}] introduces a section where output is captured, and multiple such declarations can go in a single test.

Since we haven't actually filled in the output, running the test will fail. Here's the diff it would show.

open Core.Std

  let%expect_test "simple sort" =
    let sorted = List.sort ["red";"yellow";"green"] in
    [%sexp_of: string list] sorted |> Sexp.to_string_hum |> print_endline;
-   [%expect {| |}]
+   [%expect {| (green red yellow) |}]

As with cram, a new file will have been generated, in this case called, containing the updated test file. As with cram, you can accept the new test results by just copying the generated file over the original.

The above example is simple, but expect tests really shine when you start doing bigger and more complicated scenarios. And the ability to do this in ordinary OCaml code means you can use it for a much wider set of applications.

The source hasn't been released yet, but it will come out as part of our ordinary public release process, and we hope others will give it a spin when it does come out.

Quickcheck for Core

Automated testing is a powerful tool for finding bugs and specifying correctness properties of code. Haskell's Quickcheck library is the most well-known automated testing library, based on over 15 years of research into how to write property-base tests, generate useful sources of inputs, and report manageable counterexamples. Jane Street's Core library has not had anything comparable up until now; version 113.00 of Core finally has a version of Quickcheck, integrating automated testing with our other facilities like s-expression reporting for counterexample values, and support for asynchronous tests using Async.


There are at least two other Quickcheck implementations on OPAM. Why write implementation N+1?

Before working at Jane Street, I did some research relating to randomized testing (Eastlund 2009, Klein et al. 2012). In both cases, users of the software packages involved found it easy to write tests that use random data, but hard to write random distributions that actually produce values that are useful to test.

This Quickcheck clone started as an attempt to address those concerns by building tools for tuning random distributions of values. One way I've done that is building in tools for "remembering" sets of chosen values; for example, random tests won't repeat values, and it is easy to build distributions that generate sets or sorted lists and maintain uniqueness. This design is still in early stages, I don't know if these are the sort of tools that will be needed most, and I'm eager to get feedback from more users. The library has also evolved to integrate Quickcheck-style testing with the Jane Street Core and Async libraries. Over time I hope this will also produce useful random distributions for more of the Core types, so they will be easily testable.


For those unfamiliar with Haskell's Quickcheck, the library allows users to write tests of some property of a function, then test that property on many automatically-generated input values. For example, we might want to test that the optimized implementation of list-append in Core is associative:

  1. TEST_UNIT "associativity of list append" =
  2. Quickcheck.test Quickcheck.Generator.(tuple3 (list int) (list int) (list int))
  3. ~sexp_of:<:sexp_of>
  4. ~f:(fun (xs, ys, zs) ->
  5. <:test_eq>
  6. (List.append xs (List.append ys zs))
  7. (List.append (List.append xs ys) zs))

The test above randomly generates three lists of integers, appends them together two different ways, and tests that the results are equal. This process is repeated with different randomly chosen lists each time, until an error is reported or the default trial count is reached. Let's break down the parts of the code here.

  • TEST_UNIT is a camlp4 syntax for unit tests.
  • Quickcheck.test is the main entry point for running a test using Quickcheck. It takes two required, unnamed arguments. The first is a generator, specifying the probability distribution of values to choose from when generating inputs for the test. The second is a function that consumes the generated input values and runs a test. The function returns () if successful and raises an exception otherwise.
  • Quickcheck.Generator.(tuple3 (list int) (list int) (list int)) constructs the generator that we want to use here. Most of the functions in the Quickcheck.Generator module are named after the types they generate; here, default probability distributions for lists of ints, combined using tuple3.
  • We provide the optional named argument ~sexp_of to Quickcheck.test. This argument is used to render the first generated value that triggers an error. The <:sexp_of< ... >> expression is camlp4 syntax for the default sexp conversion for a type.
  • The final argument to Quickcheck.test is a function that takes the tuples of lists produced by our generator, appends them two different ways, and compares the output. <:test_eq< ... >> is camlp4 syntax for an equality test.

The example above uses the s-expression conversions and camlp4 syntax extensions that are common in Jane Street's libraries, but these things are not necessary for using Quickcheck. Quickcheck.test just needs a generator built from the functions in Quickcheck.Generator and a function that raises an exception on failure, and it will return () if successful or raise an exception describing the nature of the failure if not.


The primary data structure used by Quickcheck is the generator, or 'a Quickcheck.Generator.t. This corresponds to an implementation of the Arbitrary type class in Haskell's Quickcheck. Primarily, a generator represents a random distribution of values of type 'a, although in our implementation there is a little more metadata besides that under the hood. The Quickcheck.Generator module provides default distributions of several types, and tools for creating more distributions or customizing the existing ones.

In our example above, we generated three lists of integers using the following expression.

  1. Quickcheck.Generator.(tuple3 (list int) (list int) (list int))

Looking at the implementation of Core.Std.List.append, we can see that the implementation works in chunks of 5 elements, and changes behavior after 1000 chunks. So we might want to change our generator to make sure we get lists of the lengths we want to test.

  1. let open Quickcheck.Generator in
  2. let list_int = list int ~length:(`Between_inclusive (4900,5100)) in
  3. tuple3 list_int list_int list_int

Some experimentation might show us that this still doesn't hit the list lengths we want, as often as we want. The [Quickcheck.Generator.int_between] function, however, is documented as stressing boundary conditions, so we should be able to use it to get values at the upper and lower ends of the range we want. Here, it helps us that generators form a monad. If we combine generators using monadic bind, we get a weighted composition of their probability distributions. We can use that to first generate lengths for our lists, then use those randomly-generated lengths to build generators for the lists themselves.

  1. let open Quickcheck.Generator in
  2. let list_int =
  3. int_between ~lower_bound:(Incl 4900) ~upper_bound:(Incl 5100)
  4. >>= fun len ->
  5. list int ~length:(`Exactly len)
  6. in
  7. tuple3 list_int list_int list_int

Now we have a generator for three lists of integers, each list with a length between 4900 and 5100 inclusive, weighted toward the ends of that range. This is probably sufficient for our purposes. But if we want to go further down, if we decide that we need a very specific probability distribution, we can build one from scratch ourselves. Here is a rather idiosyncratic example that demonstrates the tools available in Quickcheck.Generator.

  1. let open Quickcheck.Generator in
  2. let rec ranges_of_five_between lower upper =
  3. if upper - lower < 10
  4. then of_list (List.range lower upper ~start:`inclusive ~stop:`inclusive)
  5. else weighted_union
  6. [ 5., singleton (lower + 0)
  7. ; 4., singleton (lower + 1)
  8. ; 3., singleton (lower + 2)
  9. ; 2., singleton (lower + 3)
  10. ; 1., singleton (lower + 4)
  11. ; 1., of_fun (fun () -> ranges_of_five_between (lower + 5) (upper - 5))
  12. ; 1., singleton (upper - 4)
  13. ; 2., singleton (upper - 3)
  14. ; 3., singleton (upper - 2)
  15. ; 4., singleton (upper - 1)
  16. ; 5., singleton (upper - 0)
  17. ]
  18. in
  19. let list_int =
  20. ranges_of_five_between 4900 5100
  21. >>= fun len ->
  22. list int ~length:(`Exactly len)
  23. in
  24. tuple3 list_int list_int list_int

This example uses a few more functions from Quickcheck.Generator. The of_list function takes a list of values and produces a generator that makes a uniform choice among them. weighted_union creates a probability distribution representing a weighted choice among the probability distributions of the associated sub-generators. singleton produces constant-valued generators, and of_fun produces a lazily-evaluated (but not memoized) generator. (Memoizing during random testing causes some unfortunate space leaks, it is important to be able to release resources after a batch of tests.) While this peculiar generator is probably not of practical use, it shows that when we need to, we can dig down into the interface and build whatever probability distribution we want.

Of course, it is also useful to construct generators for new types.

  1. type bst = Leaf | Node of bst * int * bst
  2. let gen : bst Quickcheck.Generator.t =
  3. let open Quickcheck.Generator in
  4. recursive (fun self ->
  5. let node =
  6. self >>= fun l ->
  7. int >>= fun k ->
  8. self >>| fun r ->
  9. Node (l, k, r)
  10. in
  11. union [ singleton Leaf; node ])

The function Quickcheck.Generator.recursive is a fixed-point generator that helps build simply-recursive generators that need to invoke themselves and don't have additional arguments. The function union is like weighted_union, but for uniform choice.


In Haskell's Quickcheck, there is a duality between the type class Arbitrary for generating random values and Coarbitrary for observing inputs to random functions. Our version of Quickcheck mirrors Generator with Observer. Most tests using Quickcheck do not need an observer, but if you want to generate a random input for a higher-order function, you will need an observer for the function's input type.

  1. TEST_UNIT "function composition" =
  2. let open Quickcheck.Generator in
  3. Quickcheck.test
  4. (tuple3
  5. (fn char)
  6. (fn Quickcheck.Observer.string int)
  7. string)
  8. ~f:(fun (f, g, x) ->
  9. <:test_eq< char >>
  10. ((Fn.compose f g) x)
  11. (f (g x)))

Here, Quickcheck.Generator.fn creates a generator for functions. It takes two arguments: an observer for the function's input type and a generator for the function's output type.

Think of an observer as a "generator of decision trees". For instance, might randomly generate any of the following decision trees:


 / \
?   ?

  / \
 x>2 ?
 / \
?   ?

These decision trees control how a randomly-generated function will use its input. The generator for the function's output is used to fill each of the ?s in with a concrete value. The result is a family of functions operating on the appropriate types, making randomly-chosen observations on the input and producing randomly-chosen outputs based on those observations.

If you need to build an observer for a custom type, there are tools for that as well.

  1. type bst = Leaf | Node of bst * int * bst
  2. let obs : bst Quickcheck.Observer.t =
  3. let open Quickcheck.Observer in
  4. recursive (fun self ->
  5. unmap (variant2 unit (tuple3 self int self))
  6. ~f:(function
  7. | Leaf -> `A ()
  8. | Node (l, k, r) ->`B (l, k, r))
  9. ~f_sexp:(fun () -> Atom "variant2_of_bst"))

As with generators, there is a fixed point function Quickcheck.Observer.recursive that helps for simply-recursive types. The function unmap transforms an input of some new type into an input for which we already have an observer. Variant types can be transformed to polymorphic variants, which have default observers variant2 through variant6. Records and constructor arguments can be transformed to tuples, which have default observers tuple2 through tuple6.

Work in Progress

Our OCaml adaptation of Quickcheck is new and still evolving. We already have some changes to the library internally which will be released over time, such as moving default generators and observers out of the Quickcheck module and into the modules for each type. For example, becomes Int.gen.

There are still some pragmatic lessons to learn about how best to use our formulation of the library, how to calibrate our default distributions, and what other distributions we might want to provide. As always, we hope to get feedback from anyone who tries out this library so that we can improve it.

Happy testing!

No (functional) experience required

Jane Street is a serious functional programming shop. We use OCaml, a statically typed functional language for almost everything and have what is probably the largest OCaml codebase anywhere.

This leads lots of people to think that they shouldn't even bother applying, under the assumption that we are only interested in hiring deep functional programming gurus. I think people get to this conclusion in part because they think of functional languages, especially those with fancy type systems, as arcane tools that can only be used effectively after deep study.

To the contrary, one of the reasons we started building production systems with OCaml was that it was relatively easy to understand, even for people with no formal CS background. Since then, we've had good experiences taking students with no functional experience at all and getting them to the point of being able to complete a project in just a few weeks. We also have a very successful "OCaml Bootcamp" program, where over four weeks, we train all of the incoming traders and many other non-developer hires on OCaml and our development tools and libraries. By the end, most of them are able to create useful applications.

All of this is to say that we don't go out of our way to hire people who are already familiar with functional programming. In practice, it's just not that hard for strong programmers to pick it up after they start.

That said, a solid majority of the developers we hire do come in with functional programming experience --- but that's because of their preferences, not ours. Programmers with an interest in functional languages have an extra reason to want to work here, and so we get an unusually high number of good applicants from that pool.

There's a more general lesson here: using well-loved tools is a good way of attracting (and retaining) great developers.

Introducing Incremental

I'm pleased to announce the release of Incremental (well commented mli here), a powerful library for building self-adjusting computations, i.e., computations that can be updated efficiently when their inputs change.

At its simplest, you can think of a self-adjusting computation as a fancy spreadsheet. In a spreadsheet, each cell contains either simple data, or an equation that describes how the value in this cell should be derived from values in other cells. Collectively, this amounts to a graph-structured computation, and one of the critical optimizations in Excel is that when some of the cells change, Excel only recomputes the parts of the graph that depend on those changed cells.

What makes self-adjusting computation (or SAC) different from a spreadsheet is its dynamism. The structure of the computational graph in an SAC can change at runtime, in response to the changing input data.

This dynamism gives you quite a bit of flexibility, which can be used in different ways. Here are a few.

On-line combinatorial algorithms. Incremental is based on work by Umut Acar et. al.., on self adjusting computations (that's where the term comes from), and there, they were mostly interested in building efficient on-line versions of various combinatoral algorithms. In many cases, they could match the asymptotic complexity of custom on-line algorithms by fairly simple incrementalizations of all-at-once algorithms.

Incremental GUI construction. One simple and natural way to model a GUI application is to structure your display as a function that geneates a view from some more abstract data model.

Having a function that constructs a view from scratch at every iteration is simple, but prohibitively expensive. But if you can write this function so that it produces an incrementalized computation, then you have a solution that is both simple and efficient. We've used this technique in a number of our UIs, to good effect.

This might remind you of how functional reactive programming (FRP) is used for construction of GUIs in languages like Elm. SAC and FRP have different semantics --- FRP is mostly concerned with time-like computations, and SAC is mostly about optimizing DAG-structured computations --- but they are nonetheless closely related, especially at the implementation level. You can see my post here for a description of the broader conceptual landscape that includes both FRP and SAC.

Configurable computations. An example that comes from our own work is risk calculations. Calculating measures of risk of a portfolio involves combining data from a complex collection of interdependent models. Each of these models is dependent both on live data from the markets, and on configurations determined by users.

A config change could merely tweak a coefficient, or it could change the overall structure of the computation, say by changing the list of factors used by a given model. Incremental allows you to build a computation that can update efficiently in response to both simple data changes as well as more structural config changes, in one unified framework.

A taste of Incremental

It's hard to give a compelling example of Incremental in action in just a few lines of code, because what makes Incremental really useful is how it helps you build large and complex computations. Nonetheless, small examples can give you a sense of how the library works.

To that end, let's walk through a few small examples. To begin, we need to instantiate the Incremental functor.

  1. open Core.Std
  2. module Inc = Incremental_lib.Incremental.Make ()

Each instance thus generated is its own independent computational world. The Incremental functor is generative, meaning it mints fresh types each time it's applied, which prevents values from different incremental worlds from being mixed accidentally.

An Incremental computation always starts at its variables. Modifications to variables are how updates to input data are communicated to Incremental.

Let's write down a few variables corresponding to the dimensions of a rectangular prism.

  1. module Var = Inc.Var
  3. (* dimensions of a rectangular prism *)
  4. let width_v = Var.create 3.
  5. let depth_v = Var.create 5.
  6. let height_v = Var.create 4.

We can use to get the (trivial) incremental computions associated with each variable.

  1. let width = width_v
  2. let depth = depth_v
  3. let height = height_v

The following is an incremental computation of the base area of the prism, and of the volume.

  1. let base_area =
  2. Inc.map2 width depth ~f:( *. )
  3. let volume =
  4. Inc.map2 base_area height ~f:( *. )

In order to get information out of an incremental computation, we need to explicitly mark which nodes we want data from by creating observer nodes. Because it knows which nodes are observed, the framework can track what parts of the computation are still necessary to the results.

  1. let base_area_obs = Inc.observe base_area
  2. let volume_obs = Inc.observe volume

In order to force the computation to run, we need to explicitly call Inc.stabilize. Here's some code that uses stabilize to run the computation and then gets the information from the observers.

  1. let () =
  2. let v = Inc.Observer.value_exn in
  3. let display s =
  4. printf "%20s: base area: %F; volume: %F\n"
  5. s (v base_area_obs) (v volume_obs)
  6. in
  7. Inc.stabilize ();
  8. display "1st stabilize";
  9. Var.set height_v 10.;
  10. display "after set height";
  11. Inc.stabilize ();
  12. display "2nd stabilize"

If we run this, we'll se the following output:

1st stabilize: base area: 25.; volume: 125.
    after set height: base area: 25.; volume: 125.
       2nd stabilize: base area: 25.; volume: 250.

Note that setting the height isn't enough to change the observed values; we need a stabilization to make that happen.

That's a fairly trivial computation, and there certainly isn't much to incrementalize. Let's try something a little more complicated: a function for merging together an array of incrementals, using some commutative and associative operator like addition or max.

  1. let rec merge ar ~f =
  2. if Array.length ar <= 1 then ar.(0)
  3. else
  4. let len = Array.length ar in
  5. let len' = len / 2 + len % 2 in
  6. let ar' =
  7. Array.init len' ~f:(fun i ->
  8. if i * 2 + 1 >= len then ar.(i*2)
  9. else Inc.map2 ar.(i*2) ar.(i*2+1) ~f)
  10. in
  11. merge ar' ~f;;

Because this is done using a binary tree as the dependency graph, the complexity of updating an element is log(n), where n is the size of the array. We can use this for, computing an average:

  1. let average ar =
  2. let sum = merge ar ~f:(+.) in
  3. sum ~f:(fun s -> s /. float (Array.length ar))

This works, but we can do better performance-wise, at least, if our merge operation has an inverse. In that case, maintaining the sum can in principle be done on constant time, by first, removing the old value before adding in the new. Incremental has a function for taking advantage of this structure.

  1. let sum ar =
  2. Inc.unordered_array_fold ~f:(+.) ~f_inverse:(-.) ar;;

Now, let's say we want to do something a little more dynamic: in particular, what if we want to compute the average of a prefix of the given array? For that, we need to use the bind function, which allows us to produce new incremental nodes within an incremental computation.

  1. let average_of_prefix ar length =
  2. Inc.bind length (fun length ->
  3. average (Array.init length ~f:(fun i -> ar.(i))))

The type of this function is float Inc.t array -> int Inc.t -> float Inc.t, so the length of the prefix is a fully fledged part of the incremental computation. As a result, the dependency structure of this computation changes dynamically, e.g., if the value of length is 7, then the computation only depends on length and the first 7 elements of the array.

Hopefully this gives you enough of a sense of what Incremental is about to start thinking about where it might be useful for you. Note that the overhead of incremental is not inconsiderable --- on my laptop, firing a single node takes on the order of 30ns, which is far more than, say, summing numbers together. Incremental tends to be useful when the computation that is put into a single node is large relative to that overhead, or when the computational graph is large relative to the sub-graph that needs to be recomputed. Our experience has been that there are plenty of applications in this space that can benefit from Incremental.

Converting a code base from camlp4 to ppx

As with many projects in the OCaml world, at Jane Street we have been working on migrating from camlp4 to ppx. After having developed equivalent ppx rewriters for our camlp4 syntax extensions, the last step is to actually translate the code source of all our libraries and applications from the camlp4 syntax to the standard OCaml syntax with extension points and attributes.

For instance to translate code using pa_ounit and pa_test, we have to rewrite:

TEST = <:test_result< int >> ~expect:42 (f x)


let%test _ = [%test_result: int] ~expect:42 (f x)

For small to medium projects it is enough to just take a couple hours to translate the source code by hand. But at Jane Street where we have a huge OCaml code base making extensive use of camlp4, it is simply not realistic. So we needed a tool to do that for us.

Writing a tool to automatically convert the syntax

Since the output of such as tool has to be accepted as the new code source that is committed in our repository, it must preserve the layout of the original file as much as possible and of course keep the comments. This mean that any approach using an AST pretty-printer would be extremely complex.

The path we choosed is to textually substitute the foreign syntaxes in the original file for the new ones. One could imagine doing that with a tool such as sed, awk, perl, ... however doing it properly would be fastidious and it would be pretty-much impossible to be 100% sure it would never translate things it is not supposed to. Plus writing perl is not as fun as writing OCaml programs.

Instead there is an easy way to find the foreign syntaxes: using camlp4 itself. To subsitute the text of foreign syntaxes the only thing we need to know is their location in the original file, and camlp4 can help us with that.

Writing dummy camlp4 syntax extensions

The idea is to write for each camlp4 syntax extension a dummy one that define the same grammar productions as the real one, but instead of generating code it simply record substitutions at certain locations.

Then we do the following:

  • parse a file with camlp4 and our dummy syntax extensions
  • apply all the recorded substitutions to the original file

This approach has the advantage of interpreting the original file in the exact same way as our regular syntax extensions. Giving us good confidence we did not change the syntactic constructions by mistake.

To do so we define this API:

(** [replace loc repl] records a text substitution that replaces the
    portion of text pointed by [loc] by [repl]. *)
val replace : Loc.t -> string -> unit

Then writing a dummy camlp4 syntax extension is pretty easy. For instance for a subset of pa_ounit:

  GLOBAL: str_item;

  test: [[ "TEST" -> replace _loc "let%test" ]];

    [[ `STRING _; "=" -> ()
     |            "=" -> replace _loc "_ ="

    [[ test; name_equal; expr -> <:str_item< >>

On the fly conversion and diffing the generated code

Since this tool was convenient to use, we used it to check that our newly written ppx rewriters did the same thing as the old camlp4 syntax extensions:

  • for a given OCaml source file of a library or application, we converted it using camlp4-to-ppx and saved the result
  • we processed the original file using camlp4 and the translated one using our ppx rewriters
  • in both case we saved the output of -dparsetree (human-readable version of the internal OCaml AST) and -dsource (pretty-printed code)
  • we diffed the camlp4 and ppx outputs of -dparsetree, as well as the outputs of -dsource

This was all quite easy to do with jenga. We kept looking at the generated diffs until they were all empty.

We have quite a lot of code in our camlp4 syntax extensions and converting them to ppx was a long mechanical job and so quite error-prone. Given that this diffing turned out to be really helpful to find errors.

The Camlp4 Syntax is not quite the OCaml syntax

While using this we noticed that quite a few syntaxes accepted by camlp4 are not accepted by OCaml, for instance:

let _ x = x

let f l = l ~f:fun x -> x + 1

These where quite easy to fix automatically as well using camlp4-to-ppx.

Github repo and extension

We published a slightly modified version of this tool on github.

The method we used doesn't work out of the box with all syntax extensions. For instance to convert code using lwt.syntax some more work needs to be done on camlp4-to-ppx. But it is a good starting point.

CPU Registers and OCaml

Even though registers are a low-level CPU concept, having some knowledge about them can help write faster code. Simply put, a CPU register is a storage for a single variable. CPU can keep data in memory or cache or in registers and registers are often much faster. Furthermore, some operations are possible only when the data is in registers. Hence, the OCaml compiler tries to keep as many variables as it can in the registers.

Code with more than 13 variables is slow?

Consider this primitive statistics computation:

let stats xs ys =
  let len = Array.length xs in
  if len Array.length ys then
    raise not_of_same_length;
  let sx = ref 0 in
  let sy = ref 0 in
  let sxx = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let sax = ref 0 in
  let say = ref 0 in
  for i = 0 to len - 1 do
    let x = Array.unsafe_get xs i in
    let y = Array.unsafe_get ys i in
    let ax = abs x in
    let ay = abs y in
    let xx = x * x in
    let yy = y * y in
    let xy = x * y in
    sx := !sx + x;
    sy := !sy + y;
    sxx := !sxx + xx;
    syy := !syy + yy;
    sxy := !sxy + xy;
    sax := !sax + ax;
    say := !say + ay;
  !sx, !sy, !sax, !say, !sxx, !syy, !sxy

Rearranging just a few lines produces code 1.5-2x faster:

let x = Array.unsafe_get xs i in
    sx := !sx + x;
    let xx = x * x in
    sxx := !sxx + xx;
    let ax = abs x in
    sax := !sax + ax;
    let y = Array.unsafe_get ys i in
    sy := !sy + y;
    let xy = x * y in
    sxy := !sxy + xy;
    let ay = abs y in
    say := !say + ay;
    let yy = y * y in
    syy := !syy + yy;

Why is that? CPU has just a few registers:

  • 13 which can contain integers and pointers to arrays and records
  • 16 which can contain floating point numbers

If the code has more than that many variables, OCaml compiler has to park the extra variables in memory and this parking is called spilling. Actually, spilling may happen even when there are less variables, because for example some operations like integer multiplication use extra registers.

Therefore, it's good to try to keep the number of frequently used variables to 13 or less, or to rearrange the code so that fewer variables are used at the same time.

The OCaml compiler can show spilled variables when called with the option -dreload.

Calling a single function makes subsequent code slower?

If a function is called, all of the active registers are spilled because it is not known whether the called function will need those registers. That can often be the largest penalty when calling a function, assuming the function is not inlined.

Let's change the previous function:

let stats xs ys =
  let sx  = ref 0 in
  let sax = ref 0 in
  let sxx = ref 0 in
  let sy  = ref 0 in
  let say = ref 0 in
  let syy = ref 0 in
  let sxy = ref 0 in
  let len = Array.length xs in
  if len <> Array.length ys then
    failwith (Printf.sprintf "Arrays not of the same length: %d %d"
      len (Array.length ys));
  for i = 0 to len - 1 do

This produces 1.35x slower code simply because OCaml compiler will spill all of the variables because of Printf.sprintf. In each iteration, OCaml will pull sx from the memory and store it back.

It's a pity that this is actually not necessary. OCaml has to pull sx from the memory and store it back just once, not in each iteration. Looks like that can be improved in the OCaml compiler.

Recursive functions with more parameters are faster?

A function can get data from input parameters, from closure environment and from memory. Input parameters are normally stored in registers. Therefore, using input parameters will generally be slightly faster than the closure environment and memory. This can come in handy in recursions.

Take for example this function which finds the index of the given integer in the array:

let index (a : int array) e =
  let len = Array.length a in
  let rec loop i =
    if i < len then begin
      if Array.unsafe_get a i = e then Some i
      else loop (i + 1)
    end else None
  loop 0

The variables e and len are pulled from closure environment in each iteration. Moving them to input parameters speeds up the function 1.15-1.33 times:

let index (a : int array) e =
  let rec loop e len i =
    if i < len then begin
      if e = Array.unsafe_get a i then Some i
      else loop e len (i + 1)
    end else None
loop e (Array.length a) 0

Further reading

Processor Register

Register Allocation

Building a lower-latency GC

We've been doing a bunch of work recently on improving the responsiveness of OCaml's garbage collector. I thought it would be worth discussing these developments publicly to see if there was any useful feedback to be had on the ideas that we're investigating.

The basic problem is a well-known one: GCs can introduce unpredictable pauses into your application, and depending on how your GC is configured, these pauses can be quite long. Unpredictable latencies are a problem in a wide variety of applications, from trading systems to web stacks.

One approach people often take is to avoid using the allocator altogether: pool all your objects, and never allocate anything else. You can even keep many of your pooled objects outside of the heap.

This works, but makes for a less pleasant coding experience (and code that is trickier and harder to reason about.) So while pooling is a valuable technique, we'd like to have a GC that lets you run with low latencies without sacrificing the ability to allocate.

What are the problems?

OCaml's garbage collector is already pretty good from a latency perspective. Collection of the major heap in OCaml is incremental, which means that collection of the major heap can be done in small slices spread out over time, so no single transaction need experience the full latency of walking the major heap. Also, collection of the minor heap is pretty fast, and OCaml programs tend to do pretty well with a relatively small minor heap --- typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.

Still, there are problems with OCaml's collector.

No profiling

There's no good way in the stock runtime to see how long the different parts of collection take, and that makes it hard to optimize.

False promotions

OCaml's generational collector is very simple: objects are typically allocated first on the minor heap, where the work is effectively three inlined instructions to bump a pointer and check whether you've hit the end. When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.

In a typical functional workload, most of your allocations are short-lived, and so most of the minor heap is dead by the time you do the minor collection, so the walk of the minor heap can be quite cheap. But there's always a small number of false promotions, objects that would have become unreachable shortly, but were promoted because the minor collection came at an inconvenient time.


One fundamental issues with the stock runtime is that the collector is clocked in terms of minor allocations --- ignoring, critically, the amount of time that has gone by.

This clocking makes sense for many applications, but if you're building a server that needs to respond to bursty traffic with low and predictable latencies, this is the opposite of what you want. Really, what you'd prefer to do is to defer GC work when you're busy, instead scheduling it at times when the application would otherwise be idle.

One solution here is to allow the application to drive the scheduling of the GC, but the runtime in its current form doesn't really support doing this. In particular, while you can choose to explicitly run a major slice, the collector accounting doesn't take note of the work that has been done that way, so the major collector works just as hard as it did previously.

Furthermore, the major slice always forces a minor collection. But running minor collections all the time is problematic in its own right, since if you run them when the minor heap is too small, then you'll end up accidentally promoting a rather large fraction of your minor allocations.

Insufficient incrementality

While the major collector is mostly incremental, not everything about it runs incrementally. In particular, when the major collector hits an array, it walks the array all at once. This is problematic if you start using large arrays, which does happen when one is using pooling techniques. Similarly, the collector is not incremental when it comes to scanning GC roots.

Immediate accounting

The stock runtime decides how big of a major slice to do based on how much was promoted to the major heap in the last minor collection. This is part of a heuristic that is meant to make sure that the collector keeps up with the rate of allocation, without running needlessly when the application isn't promoting much.

But the accounting is in some sense too immediate: if you do a lot of promotion in a given cycle, you're forced to do the collection immediately. While all that work needs to be done, it's not clear that it needs to be done immediately. Again, for responsive systems, it's often better to push off work until after the busy times.

Making it better

Happily, Damien Doligez, the author of OCaml's GC, has been visiting with us for the last few months, and has been doing a lot of good work to improve the runtime, and in particular to address the concerns raised above. Here's the summary of the changes made thus far.

Better profiling

A set of probes was added to the GC, allowing us to record in a quite detailed way every phase of the collection process. This is quite detailed, telling you the phase (marking vs sweeping) and the sub-phase, as well as keeping track of a collection of useful counters. This is available in the instrument branch.


Damien has also implemented aging in the minor heap. Aging is a technique whereby objects stay in the minor heap for several minor collections before being promoted to the major heap. The goal of aging is to reduce the amount of false promotion.

Better incrementalization

Several of the stages of the collector have been made interruptible, including scanning of arrays and of the roots. The effect here is to reduce the worst-case delays imposed by the collector. This is in the low-latency branch.

Separating major slices from minor collections

In the stock runtime, major slices and minor collections are always done together. In the low-latency branch, you can run one without the other, and you can basically run them at any time. This has a couple of advantages --- one is that it's essentially another form of incrementalization, allowing you to do less work per GC pause.

The other is that it gives you more freedom to schedule collections when you want to. One way we're looking at using this is to have an application-level job that wakes up periodically, and does a heuristic check to see if the system appears busy. If it doesn't, then it schedules some GC work, and it may choose to do either a minor collection or a major slice. A minor collection would only be chosen in the case that the minor heap is bigger than some configured level, to avoid too much false promotion; but a major collection can be done at any time.

Smoothed work-accounting

Instead of just keeping track of the amount of work that needs to be done in the next major slice, the GC in the low-latency branch tracks work that must be done over the next n major slices, by keeping these numbers in a circular buffer.

The runtime also uses these buckets for keeping track of extra work that has been done by application-forced major slices. A forced major slice takes work away from the front-most bucket, potentially bringing the bucket to negative territory.

When the runtime checks if it needs to do a major slice, it looks at the first bucket. If it's got a positive amount of work in it, then that work is done in that slice, if possible. Whatever is left over (which may be positive or negative) is spread out uniformly over the next n buckets.

Segmented free lists

A big part of the cost of minor collections is the cost of finding free blocks. One observation is that in many OCaml applications, block sizes are quite small. One way of taking advantage of this is to have a set of size-segregated free-lists, for a configurable set of sizes. e.g., one could have a different free list for blocks with 1, 2, 3 and 4 slots.

This is still ongoing (read: not working yet), but it will show up in the multi-free-list branch eventually.

How is it going?

This is all very much a work in progress, but the results so far have been quite promising. By using a version of the compiler with most of these changes and with an application-driven job that forces major slices in quiet times, we were able to reduce tail latencies by a factor of 3 in a real production application. That's pretty good considering that we've done essentially no parameter tuning at this point.

That said, some of the results are less promising. We were somewhat disappointed to see that when doing more traditional batch jobs, aging didn't provide a significant improvement in overall compute time. It seems like in many applications, aging saves some on promotion, but the minor collection itself gets a little more expensive, and these seem to nearly cancel out.

This seems especially surprising given that aging is present in most GCs, including those for Java's HotSpot, the .NET CLR, and GHC. Given that everyone seems to use aging, I would have expected aging to have a quite noticeable benefit for lots of workloads, not just carefully tuned packet processors.

A call for help

The progress we've made so far is quite promising, but a lot of things are still up in the air. The reason that I wanted to post about it now is that I was hoping to hear feedback from others who have experience dealing with similar issues in other languages.

So, if you have thoughts about the techniques we've tried for making OCaml's runtime more responsive, or suggestions for other techniques we should consider, please comment!

Faster OCaml to C calls

The official OCaml documentation "Interfacing C with OCaml" doesn't document some interesting performance features.

C functions with no OCaml allocation

A C call can allocate OCaml data and pass it back to OCaml, for example using caml_copy_string(s). Between the C call allocating OCaml data and passing it back, it has to make sure that OCaml's Garbage Collector doesn't collect it, as the Garbage Collector can be triggered during the C call. There's an intricate mechanism which assures that, part of which are CAMLparam, CAMLlocal and CAMLreturn.

This mechanism can be bypassed if the C call is not going to allocate any OCaml data. This can yield performance benefits especially in shorter functions. To bypass it, CAMLparam, CAMLlocal and CAMLreturn should not be used and the primitive should be declared with "noalloc".

For example, OCaml's compare is not smart to avoid branch mispredictions for floats. Moving comparison to C speeds it up a little bit. "noalloc" speeds it up a lot.

float compare            8.93 ns
float_compare_c          7.88 ns
float_compare_c_noalloc  5.32 ns

external float_compare_noalloc : float -> float -> int =
  "float_compare_noalloc_stub" "noalloc"

CAMLprim value float_compare_noalloc_stub(value vf, value vg)
  double f = Double_val(vf);
  double g = Double_val(vg);
  return Val_int((f > g) - (f < g) + (f == f) - (g == g));

C functions with float arguments and float return

Since C code must use boxed OCaml floats, any unboxed float must be boxed prior to the C call. This is not cheap, especially for fast functions. This boxing can be avoided if the C call accepts and returns only floats.

For example, float_min can be replaced with a single CPU instruction. Unfortunately, the C implementation is much slower because of boxing floats:

float_min                  6.09 ns
float_min_c               15.92 ns

let float_min (x : float) y = if x < y then x else y

CAMLprim value float_min_stub(value x, value y)
  CAMLparam2(x, y);
  double z = Double_val(y);

  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(Double_val(x)));
  v = caml_copy_double(z);

external float_min_c : float -> float -> float = "float_min_stub"

To avoid boxing, C function's arguments and return should be "double", CAMLparam, CAMLlocal and CAMLreturn should be avoided and the primitive should include "float" and both interpreted and compiled implementation:

float_min_c_float          1.95 ns

external float_min_inan_c_float : float -> float -> float
  = "float_min_inan_float_bytecode" "float_min_inan_float_stub" "float"

CAMLprim double float_min_inan_float_stub(double x, double y)
  double z = y;
  __asm__ ("minsd %1, %0;" : "+&x"(z) : "x"(x));
  return z;

C functions with float arguments and non-float return

We might be able to further speed up float_compare_noalloc if we avoided boxing. Alas, that function returns integer so it's impossible to use "float". Is it still possible to avoid boxing? The answer is yes, by simply converting float to int.

float_compare_c_float    3.73 ns

CAMLprim double float_compare_float_stub(double f, double g)
  return (f > g) - (f < g) + (f == f) - (g == g);

external float_compare_float : float -> float -> float
  = "float_compare_float_bytecode" "float_compare_float_stub" "float"
let float_compare_float x y = int_of_float (float_compare_float x y)