With the latest release of Core, I've had occasion to think about how our libraries differ from INRIA's. One difference that's been there from the very beginning is in the List module: INRIA's list functions are not tail-recursive, and ours are. The obvious reason to prefer tail-recursive solutions is that they are able to handle lists of arbitrary length, but they have a downside as well. INRIA's list implementation is not tail recursive largely because of performance, as described by Xavier

here. All that heap allocation you get by mapping and then reversing the list really costs.

The key tradeoff that Xavier points to is the choice between running fast on small-to-medium lists, and running at all on long lists. A few different people have opined that lists don't make sense for large datasets anyway, which argues in favor of the choice made in the standard library of using the tail-recursive version. But I've never bought this argument. It's hard to predict what your code is going to be used for, and you don't want to have landmines where your code simply blows up (rather than degrading gracefully in performance) when run on an unexpectedly large dataset. Using a non-tail-recursive list implementation leads to such brittle behavior.

So, it occurred to me, can we have the best of both worlds? As Xavier points out, there are "magic" implementation that you can do that use the dreaded Obj.magic, but Obj.magic is notoriously difficult to reason about, and is really the same thing as hacking the compiler. Among other things, it leaves you with no guarantees when new versions of the compiler are released. ExtLib takes this approach, but it's never been something we've been comfortable with.

But what if we write a version that detects when we're dealing with a big list, and dynamically switches implementations? Here's a simple version.

  1. open Core.Std
  3. let rec count_map ~f l ctr =
  4. match l with
  5. | [] -> []
  6. | hd :: tl -> f hd ::
  7. (if ctr < 5000 then count_map ~f tl (ctr + 1)
  8. else List.map ~f tl)
  10. let map ~f l = count_map ~f l 0

This works a lot better. It's a little bit slower than the standard List.map for small lists, and about the same as the tail-recursive List.map for large lists. But we can do better still. There are two more optimizations I played with. The first is to do a little loop unrolling on the recursion, and the second is to to deal with the large list case going through arrays, as suggested in a post by Christophe Troestler. Here's the resulting code:

  1. open Core.Std
  3. let list_array_map ~f l =
  4. Array.to_list (Array.map ~f (Array.of_list l))
  6. let rec count_map ~f l ctr =
  7. match l with
  8. | [] -> []
  9. | [x] -> [f x]
  10. | [x;y] -> [f x; f y]
  11. | [x;y;z] -> [f x; f y; f z]
  12. | x :: y :: z :: w :: tl ->
  13. f x :: f y :: f z :: f w ::
  14. (if ctr > 500 then list_array_map ~f tl
  15. else count_map ~f tl (ctr + 1))
  17. let map ~f l = count_map ~f l 0

This implementation does better still. It's actually faster than the standard implementation on short lists, and only a little slower on long lists. Here are some very rough benchmarks (done on an x86-64 box). Here, the mean and standard deviations are of the ratio of the implementation versus the implementation in the standard library. "core" is the implementation currently in Core, and "fast" is the above implementation.

## list length 0 ##
core: mean 1.648838, nstdev 0.043502
fast: mean 0.717259, nstdev 0.043177
## list length 1 ##
core: mean 2.113085, nstdev 0.075585
fast: mean 0.596140, nstdev 0.049489
## list length 2 ##
core: mean 1.989603, nstdev 0.044707
fast: mean 0.636450, nstdev 0.003376
## list length 5 ##
core: mean 2.003528, nstdev 0.043638
fast: mean 0.821950, nstdev 0.024802
## list length 10 ##
core: mean 1.428904, nstdev 0.016536
fast: mean 0.729491, nstdev 0.018766
## list length 20 ##
core: mean 1.443628, nstdev 0.062703
fast: mean 0.743741, nstdev 0.018308
## list length 100 ##
core: mean 1.301089, nstdev 0.019097
fast: mean 0.898968, nstdev 0.017839
## list length 1000 ##
core: mean 1.719725, nstdev 0.025758
fast: mean 0.950799, nstdev 0.018624
## list length 25000 ##
core: mean 1.721275, nstdev 0.044541
fast: mean 1.188690, nstdev 0.031437

The performance improvement seems worth the ugly code given how common map is. I suspect some hack like this will make it's way to a future version of Core.