The dangers of being too partial

This article deals with some not well-known dark corners of the OCaml compiler and how to get around them to produce more efficient code. The bottom line is that you should avoid using partial applications and instead prefer eta-expanding your functions to the maximum. To understand why, let’s compare the performance of the following definitions:

let f a b c = a + b + c
let g1 a l = List.fold_left ~f:(f a) ~init:0 l
let g2 a l = List.fold_left ~f:(fun b -> f a b) ~init:0 l
let g3 a l = List.fold_left ~f:(fun b c -> f a b c) ~init:0 l

Those three versions are all semantically equivalent (as long as f does not perform mutations between arguments): they all return the sum of the elements in l plus a for each element. Yet, they have quite different execution times (n is the length of the list l, please ignore the last 3 columns for the moment):

n	g1	g2	g3	g4	g5	g6
0	0.18	0.11	0.12	0.14	0.12	0.10
1	0.39	0.43	0.22	0.25	0.35	0.24
5	1.16	1.78	0.64	0.62	1.54	0.23
10	2.16	3.49	1.11	1.14	2.73	0.36
100	20.32	33.69	10.19	10.25	24.45	2.85

So, what is going on here? The answer is in the way the OCaml compiler optimizes calls to functions with multiple arguments. The lambda-calculus interpretation of f is of a function that takes one argument a, returns a closure (i.e. a function plus an environment containing the value of a), which in turn takes one argument b, etc. Applying f to three arguments would then amount to applying them one by one, creating intermediate closures in the process.

For efficiency reasons, most functional languages compile f as a function of arity 3, i.e. as a function optimized to take 3 arguments all at once. Most of the time, functions are called with all their arguments anyway, so this optimization is a good thing. In essence, this is similar (but not quite equivalent) to the following non-currified version of f:

let f_non_currified (a,b,c) = a + b + c

Obviously, the currified version of f is strictly more expressive since it allows partial applications and it avoids dealing with the allocation of the tuples (a,b,c), using three CPU registers or the stack instead. Still, there needs to be a mechanism to deal with applications where the numbers of arguments do not match: inside List.fold_left, there is a call to the argument ~f with two arguments (the list element and the accumulated result). If the arity of ~f is not 2, some generic functions caml_applyN and caml_curryN come into play to decompose the applications into simpler steps, potentially creating the intermediate closures we talked about. You might have seen those functions show up in the output of gprof sometimes, but not many people are actually fully aware of this internal mechanism (if someone knows of a good reference on this subject, please report it as I could not find any). If you are curious about their definitions, you can look at them in the (undocumented) output of ocamlopt -dcmm.

Let’s go back to our example and consider g3 first. One closure (the argument to List.fold_left) is initially allocated, but since this function is of arity 2 (b and c), all the calls from List.fold_left are direct and require no allocation.

One the other hand, g1 also allocates a closure initially, but it is of a different nature: it is a call to caml_curry3 with f and a as arguments. This function creates a closure of arity 1, simply waiting for the next argument, so that when List.fold_left calls it, there is an arity mismatch. We then end up passing the first argument b first, creating a new closure and finally passing in the second argument c. In the end, we allocate one extra closure per element in the list (and execute more instructions), which is why g1 is twice slower than g3.

g2 is even worse: the initial allocation creates a closure of arity 1 (waiting for the argument b). There is a first mismatch when List.fold_left calls it, as with g1, so we go into slow mode and pass in b only. But then there is a second mismatch inside the code of the local function when we call f (of arity 3) with only the two arguments a and b. Arguments are passed in one-by-one and we create two intermediate closures in the process. Finally, c which was put aside from List.fold_left is applied in a final step. In the end, g2 allocates two closures per element in the list and is three times slower than g3.

Those interpretations can be checked by measuring the number of allocated words with (Gc.stat ()).Gc.minor_words before and after a function call (I thank Stephen Weeks for this idea). The following results confirm how much GC pressure is created by g1 and g2 compared to g3 (a closure typically takes 4 or 5 words in our example):

g1	g2	g3	g4	g5	g6
`5 + 5n`	`4 + 10n`	5	5	`5 + 5n`	5

It is worth noting that the actual cost does not come so much from the allocating part (it is just a few extra instructions after all), but from the GC: reclaiming the memory back is quite expensive and this cost is amortized over all allocations. Reducing the total amount of allocation will automatically reduce the amount of GC pressure. There is an old page written by Pierre Weiss that basically says that a block allocation is roughly equivalent to 100 additions on average (taking into account the cost of reclaiming the memory by the GC). Those results are outdated so they probably do not apply anymore, but they give the right order of magnitude.

A different style

If we really wanted to use a partial application as in g1, we would need to tell the compiler that f is a function that takes one argument a and return a function taking two arguments. Ideally, we would like to write one of the following:

let f2 a = fun b c -> a + b + c
let f2 = fun a -> fun b c -> a + b + c

Sadly, the OCaml parser still interprets those definitions as a function of arity 3. We have to write the following cumbersome version to make it work:

let f2 a = (); fun b c -> a + b + c
let g4 a l = List.fold_left ~f:(f2 a) ~init:0 l

(or any other no-op operation instead of () ). This does indeed solve the problem: if you look at the timings above, you will see that g4 is indeed about as fast as g3, and for good reason since they now do essentially the same operations and allocations.

But what if we accidentally use f2 combined with the eta-expanded style of g3 ?

let g5 a l = List.fold_left ~f:(fun b c -> f2 a b c) ~init:0 l

As you can see in the timings, things get much worse again, since for every list element, a closure will be created from the application of f2 to a.

Whether to go with g3 or g4 is a matter of style: if you know the arguments are always going to be decomposed in the same way, you can use g4. But if you use your function sometimes partially, sometimes with all its arguments, or with different levels of partial application, you have to resort to g3. Since this style is also optimal in any scenario, it is probably the right choice in most situations.

As an aside note, there is another major optimization down the road. Instead of calling the generic List.fold_left (or any other such function), we could redefine a local recursive function that calls f directly:

let g6 a l =
  let rec fold_left accu = function
  | [] -> accu
  | (x::r) -> fold_left (f a accu x) r
  in
  fold_left 0 l

As you can see from the timings, this would give you another factor of 3 in this particular case (mostly because of inlining, more efficient loops and direct branching instead of functions calls). It might be possible to perform this syntactic transformation automatically through a Camlp4 extension, however we have not tried it yet (and one should consider the impact on code size and instruction caching in the CPU).

Inlining

Another nasty effect of partial applications is that they prevent inlining. It is a common style to define small generic functions and make specialized versions through partial application. Here is a somewhat extreme example:

let mult_and_add m c x = m * x + c
let double_and_add_v1 = mult_and_add 2
let double_and_add_v2 c x = mult_and_add 2 c x

If you look at the assembly code generated for those functions (with ocamlopt -S), you will see that the second version gets compiled as only two instructions, does not require any allocation and is likely to be inlined further at calling sites. The first version on the other hand creates one closure from the partial application at startup, another one on every call (!) and is composed of many more instructions. I ran some benchmarks and found a speedup factor of about 15 between the two versions.

Yet another trick

This is a somewhat related trick that applies whenever you have the following (quite common) scheme:

List.iter ~f:(fun x -> do_something ... x ...) l

(in most cases with List.iter, List.map, List.fold_left, etc). Every time this code gets executed, a new closure gets allocated for the local function before the call to List.iter. If the argument list l is going to be empty a large fraction of the time, the closure will be useless most of the time. This can get quite expensive in the long run and can be avoided with the following trick:

if l <> [] then List.iter ~f:(fun x -> do_something ... x ...) l

Similarly, if you know that l is likely to contain only 0 or 1 element, you can specialize even further:

match l with
| [] -> ()
| [x] -> do_something ... x ...
| _ -> List.iter ~f:(fun x -> do_something ... x ...) l

You will save the closure allocation in most cases and the function call in the 1-element case is a good candidate for inlining.

Note that this trick does not apply when the local function does not have any free variable (i.e. when it does not reference any value outside the scope of the local function). In this case, no closure is necessary and the function is fully allocated at compile time.

Does it really matter ?

It really depends what kind of code you are writing. If your application is very small, runs in a short period of time or does not do enough allocation to ever trigger the GC, then you probably don’t care. But for applications with high-performance requirements or for central libraries that are used everywhere, it can make a big difference. Here at Jane Street, we achieved significant speedups with those simple transformations on one of our biggest and most speed sensitive application.

I have to agree that eta-expanding all the functions in my code is intellectually less satisfying than using higher-order code and partial applications. Readability can also get impacted, although there are situations where it benefits from the change, making it more explicit what the different arguments are.