A better inliner for OCaml, and why it matters

OCaml 4.03 is branched and a first release candidate is imminent, so it seems like a good time to take stock of what’s coming.

This post will focus on just one of those features: Flambda, a new IR (intermediate representation) in the depths of the compiler designed to allow for better inlining, paired with a set of optimizations leveraging that IR.

Why inlining matters

If your expectations about inlining come from a language like C, you might not be all that excited by Flambda. In C, after all, the benefits of inlining are relatively small, mostly allowing one to avoid function call overhead. That’s useful, but has limited impact.

In a language like OCaml, inlining is about more than just function call overhead. That’s because inlining grows the amount of code that the optimizer can look at a given point, and that makes other optimizations more effective. And there are simply more optimizations available in a language like OCaml than in one like C. In particular, many allocations can be optimized away, which can have a rather big effect on the performance of a program.

To get a sense of how we can eliminate allocations, consider this simple example.

let rec fold l ~init ~f =
  match l with
  | [] -> init
  | hd :: tl -> fold tl ~init:(f init hd ) ~f

let pow_sum l n = 
  fold l ~init:0 ~f:(fun x y -> pow x n + pow y n)

In the current compiler, every invocation of pow_sum will allocate a closure for the function f. With Flambda, this code should be allocation free, since the fold can be inlined into pow_sum, f can be inlined into the body of the now inlined fold, and the fold can be specialized by adding an extra argument, thus making the closure allocation unnecessary.

Here’s another example. Imagine you wanted to sum the first and last elements of a list. You could do this using functions in the list module and a simple pattern match, as follows.

let first_plus_last =
  match List.hd l, List.last l with
  | None, _ | _, None -> None
  | Some x, Some y -> Some (x + y)

Without any inlining, this function creates three heap-allocated values, which are the various options being returned.

Now, let’s consider what happens if we rewrite this code using the option monad.

let (>>=) o f =
  match o with
  | None -> None
  | Some x -> f x

let first_plus_last l =
  List.hd l   >>= fun x ->
  List.last l >>= fun y ->
  Some (x + y)

Or, equivalently, using our ppx_let syntax:

let first_plus_last l =
  let%bind x = List.hd l in
  let%bind y = List.last l in
  Some (x + y)

Without any inlining, this implementation will create five heap allocated values: one for each of the closures on the right hand side of the bind, one for the options returned from List.hd and List.tl respectively, and once for the final Some.

It’s of course possible to write a version that only allocates one object, for the return value.

let first_plus_last l =
  match l with
  | [] -> None
  | x::_  ->
    let rec last_exn = function
      | [] -> assert false
      | [y] -> y
      | _ :: tl -> last_exn tl
    in
    Some (x + last_exn l)

But this is obviously uglier and harder to read than either of the earlier versions.

With Flambda (and the appropriate flags, i.e., (-O3 -unbox-closures), these examples all allocate the same amount. That’s because, once the inlining is done and all the code is in one place, the compiler can do things like observe that the option returned from List.hd is immediately deconstructed and is never used for anything else. As a result, the allocation of the options can be removed entirely.

This example highlights why these kinds of compiler optimizations are valuable. It’s not that they’re strictly necessary for good performance – usually, pretty good performance can be achieved with sufficiently ugly code. But what good optimizations can do for us is let us write simpler, easier to understand code, without sacrificing performance.

A basis for future optimizations

In addition to the optimizations that will be there when 4.03 lands, Flambda will also make it easier to build new optimizations. One example comes from a project that was done last summer by Will Crichton, one of our interns. The aim of the project was to make the kind of allocation removal I described above more predictable.

From how I described it above, it might seem like Flambda’s ability to remove allocations relies on the ability to inline the allocating functions in their entirety. That would be problematic, because what is and isn’t inlined can be a bit unpredictable. That’s because there have to be some heuristic thresholds, since inlining everything can lead to excessively large executables.

As such, depending on inlining makes it harder to predict the allocation behavior of your code, since it now hinges on the heuristics for determining when a function should be inlined. The situation isn’t quite so grim – some of the optimizations that come along with Flambda (like -unbox-closures) don’t depend so critically on inlining. But still, many of Flambda’s optimizations do depend on inlining, and as such whether they hit becomes harder to predict.

But we can make this better. The specific project the Will worked on had to do with allocations that come from returning small objects that are immediately deconstructed. With Flambda as it is today, such allocations are only removed if the both the construction and deconstruction of the value are together, which often requires inlining. Will’s project made this more robust by changing the calling conventions in Flambda, specifically by adding a first class multi-argument return to the Flambda IR.

With a multi-argument return in place, a function that returns a small tuple, say, could instead be compiled to return the components of the tuple as individual values to be passed back on the stack. Then, this function can be wrapped with a small function that picks those values off the stack and allocates the necessary object.

This small wrapper function is by design small enough to be inlined reliably. Once that inlining is done, both the construction and deconstruction of the value will be visible to the optimizer, and they can be collapsed. This should work whether or not the main function is inlined.

This is really just one example. Our hope and expectation is that Flambda will be the basis of many improvements in the compilation of OCaml over the next few years.

Impact

The performance impact we’ve seen from our experiments with Flambda seem pretty promising so far. On real world applications we’ve tested, it’s pretty normal to see allocations reduced by 20-30%. We’ve found similarly sized improvements in application latency as well. And we think that these numbers will improve yet more as more optimizations are added on top of Flambda.

Beyond improving existing programs, we already are seeing how Flambda allows us to write prettier and cleaner code without compromising on performance. For example, Flambda allows us to make freer use of OCaml’s functors, which are a great abstraction tool, but one that imposed significant costs pre-Flambda. With Flambda, functors can simply be inlined away.

Flambda makes things easier in other ways too. For example, we’re in the processing of developing a new internal protocol for zero-allocation messaging, which requires a lot of code generation, via PPX. Being able to rely on Flambda greatly simplifies that code generation, since it lets us write a fairly naive code generator, which generates efficient code because of Flambda. We can write code that computes offsets into the message naively, and Flambda, via a combination of inlining and constant folding, moves the computation of these offsets to compile time, so the runtime field access becomes a simple lookup.

Even if you’re perfectly happy with OCaml’s performance as is, Flambda is still an exciting change. That’s because various upcoming language improvements like modular implicits (a feature that brings some of the same benefits as Haskell’s typeclasses) will only really perform acceptably well with a good inliner in place. So in addition to making your current programs run faster, Flambda also enables new abstractions that can make your future programs easier to read and write.

One downside of all this is that it reduces the predictability of OCaml’s performance, which is an important property of the language. Part of how we hope to address this is by improving Flambda’s heuristics to be more predictable, but that’s unlikely to be enough on its own. That’s why OCaml 4.03 comes with new annotations that let the programmer require or prohibit inlining for a particular function.

Over time, we hope to find a good balance, where the heuristics do a pretty good and pretty predictable job by default, but where we give programmers hooks to control things more precisely where it matters.

Thanks

Getting Flambda this far has been a big project, and a lot of people were involved. OCamlPro did a lot of the heavy lifting, with Pierre Chambart writing most of the compiler code, and Louis Gesbert and Michel Laporte working on benchmarking. Jane Street jumped in too, with Mark Shinwell and Leo White contributing mightily to the code and design as part of the review and stabilization process, and Jeremie Dimino helping with testing. And a number of other people on the core team (notably Alain Frisch, Gabriel Scherer, Runhang Li, Damien Doligez) contributed in a variety of ways to the final product. Thanks to all involved!