We've been doing a bunch of work recently on improving the responsiveness of OCaml's garbage collector. I thought it would be worth discussing these developments publicly to see if there was any useful feedback to be had on the ideas that we're investigating.
The basic problem is a well-known one: GCs can introduce unpredictable pauses into your application, and depending on how your GC is configured, these pauses can be quite long. Unpredictable latencies are a problem in a wide variety of applications, from trading systems to web stacks.
One approach people often take is to avoid using the allocator altogether: pool all your objects, and never allocate anything else. You can even keep many of your pooled objects outside of the heap.
This works, but makes for a less pleasant coding experience (and code that is trickier and harder to reason about.) So while pooling is a valuable technique, we'd like to have a GC that lets you run with low latencies without sacrificing the ability to allocate.
What are the problems?
OCaml's garbage collector is already pretty good from a latency perspective. Collection of the major heap in OCaml is incremental, which means that collection of the major heap can be done in small slices spread out over time, so no single transaction need experience the full latency of walking the major heap. Also, collection of the minor heap is pretty fast, and OCaml programs tend to do pretty well with a relatively small minor heap --- typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.
Still, there are problems with OCaml's collector.
There's no good way in the stock runtime to see how long the different parts of collection take, and that makes it hard to optimize.
OCaml's generational collector is very simple: objects are typically allocated first on the minor heap, where the work is effectively three inlined instructions to bump a pointer and check whether you've hit the end. When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.
In a typical functional workload, most of your allocations are short-lived, and so most of the minor heap is dead by the time you do the minor collection, so the walk of the minor heap can be quite cheap. But there's always a small number of false promotions, objects that would have become unreachable shortly, but were promoted because the minor collection came at an inconvenient time.
One fundamental issues with the stock runtime is that the collector is clocked in terms of minor allocations --- ignoring, critically, the amount of time that has gone by.
This clocking makes sense for many applications, but if you're building a server that needs to respond to bursty traffic with low and predictable latencies, this is the opposite of what you want. Really, what you'd prefer to do is to defer GC work when you're busy, instead scheduling it at times when the application would otherwise be idle.
One solution here is to allow the application to drive the scheduling of the GC, but the runtime in its current form doesn't really support doing this. In particular, while you can choose to explicitly run a major slice, the collector accounting doesn't take note of the work that has been done that way, so the major collector works just as hard as it did previously.
Furthermore, the major slice always forces a minor collection. But running minor collections all the time is problematic in its own right, since if you run them when the minor heap is too small, then you'll end up accidentally promoting a rather large fraction of your minor allocations.
While the major collector is mostly incremental, not everything about it runs incrementally. In particular, when the major collector hits an array, it walks the array all at once. This is problematic if you start using large arrays, which does happen when one is using pooling techniques. Similarly, the collector is not incremental when it comes to scanning GC roots.
The stock runtime decides how big of a major slice to do based on how much was promoted to the major heap in the last minor collection. This is part of a heuristic that is meant to make sure that the collector keeps up with the rate of allocation, without running needlessly when the application isn't promoting much.
But the accounting is in some sense too immediate: if you do a lot of promotion in a given cycle, you're forced to do the collection immediately. While all that work needs to be done, it's not clear that it needs to be done immediately. Again, for responsive systems, it's often better to push off work until after the busy times.
Making it better
Happily, Damien Doligez, the author of OCaml's GC, has been visiting with us for the last few months, and has been doing a lot of good work to improve the runtime, and in particular to address the concerns raised above. Here's the summary of the changes made thus far.
A set of probes was added to the GC, allowing us to record in a quite detailed way every phase of the collection process. This is quite detailed, telling you the phase (marking vs sweeping) and the sub-phase, as well as keeping track of a collection of useful counters. This is available in the instrument branch.
Damien has also implemented aging in the minor heap. Aging is a technique whereby objects stay in the minor heap for several minor collections before being promoted to the major heap. The goal of aging is to reduce the amount of false promotion.
Several of the stages of the collector have been made interruptible, including scanning of arrays and of the roots. The effect here is to reduce the worst-case delays imposed by the collector. This is in the low-latency branch.
Separating major slices from minor collections
In the stock runtime, major slices and minor collections are always done together. In the low-latency branch, you can run one without the other, and you can basically run them at any time. This has a couple of advantages --- one is that it's essentially another form of incrementalization, allowing you to do less work per GC pause.
The other is that it gives you more freedom to schedule collections when you want to. One way we're looking at using this is to have an application-level job that wakes up periodically, and does a heuristic check to see if the system appears busy. If it doesn't, then it schedules some GC work, and it may choose to do either a minor collection or a major slice. A minor collection would only be chosen in the case that the minor heap is bigger than some configured level, to avoid too much false promotion; but a major collection can be done at any time.
Instead of just keeping track of the amount of work that needs to be done in the next major slice, the GC in the low-latency branch tracks work that must be done over the next n major slices, by keeping these numbers in a circular buffer.
The runtime also uses these buckets for keeping track of extra work that has been done by application-forced major slices. A forced major slice takes work away from the front-most bucket, potentially bringing the bucket to negative territory.
When the runtime checks if it needs to do a major slice, it looks at the first bucket. If it's got a positive amount of work in it, then that work is done in that slice, if possible. Whatever is left over (which may be positive or negative) is spread out uniformly over the next n buckets.
Segmented free lists
A big part of the cost of minor collections is the cost of finding free blocks. One observation is that in many OCaml applications, block sizes are quite small. One way of taking advantage of this is to have a set of size-segregated free-lists, for a configurable set of sizes. e.g., one could have a different free list for blocks with 1, 2, 3 and 4 slots.
This is still ongoing (read: not working yet), but it will show up in the multi-free-list branch eventually.
How is it going?
This is all very much a work in progress, but the results so far have been quite promising. By using a version of the compiler with most of these changes and with an application-driven job that forces major slices in quiet times, we were able to reduce tail latencies by a factor of 3 in a real production application. That's pretty good considering that we've done essentially no parameter tuning at this point.
That said, some of the results are less promising. We were somewhat disappointed to see that when doing more traditional batch jobs, aging didn't provide a significant improvement in overall compute time. It seems like in many applications, aging saves some on promotion, but the minor collection itself gets a little more expensive, and these seem to nearly cancel out.
This seems especially surprising given that aging is present in most GCs, including those for Java's HotSpot, the .NET CLR, and GHC. Given that everyone seems to use aging, I would have expected aging to have a quite noticeable benefit for lots of workloads, not just carefully tuned packet processors.
A call for help
The progress we've made so far is quite promising, but a lot of things are still up in the air. The reason that I wanted to post about it now is that I was hoping to hear feedback from others who have experience dealing with similar issues in other languages.
So, if you have thoughts about the techniques we've tried for making OCaml's runtime more responsive, or suggestions for other techniques we should consider, please comment!