Why GADTs matter for performance

When GADTs (Generalized Algebraic Data Types) landed in OCaml, I wasn't particularly happy about it. I assumed that it was the kind of nonsense you get when you let compiler writers design your programming language.

Which is to say that the standard GADT examples all seem to be about the kinds of things that compiler writers do, like embed domain-specific languages or build typed abstract-syntax trees. But it didn't seem particularly relevant for the kind of systems programming that I think about.

But it became apparent pretty quickly that I was wrong. In particular, since GADTs have landed, at Jane Street we've found lots of examples where GADTs are important for performance, of all things. The theme in these examples is that GADTs enable you to tweak your memory representations in ways that would otherwise be painful or impossible to do safely in OCaml.

The Problem of Polymorphism

I'd like to walk through a simple example that illustrates this aspect of GADTs, but first, a few words about OCaml's memory representation. OCaml's polymorphism is in an important way backed on that memory representation. In particular, consider a simple polymorphic function like List.iter, which has the following tyfpe:

  1. val iter: 'a list -> f:('a -> 'b) -> 'b list

The polymorphic type tells you that List.iter can operate on lists of any type, and in OCaml, this is achieved with a single compiled version of the code. This is possible because the memory representation of the elements of a list are uniform: you can always refer to an OCaml value in a single word, either as a pointer to a heap-allocated value, or as an immediate that fits inside that word.

That means that some OCaml datatypes are less efficient space-wise than you might imagine. Arrays, for example, take the same amount of space per element whether those elements are bytes, 32-bit ints, or 64-bit ints. (There's actually some special magic in the compiler for float arrays, though this is probably more trouble than it's worth, as described by Alain Frisch here. But let's ignore float arrays for now.)

OCaml does have a tighter representations for byte arrays, called bytes. But it's a completely different type, and so building a general purpose data structure that uses bytes when it would makes sense, and ordinary arrays otherwise, is a little awkward.

Controlling memory representation without GADTs

Let's see what happens if we try to design (without GADTs) an array type that sometimes uses the general array representation and sometimes uses bytes.

You could imagine representing such a value using an ordinary variant.

  1. type 'a t = | Array of 'a array
  2. | Bytes of bytes

We could then implement each of the operations we want on our new array type, implementing each operation differently depending on the particular representation. Let's see what happens if we just take this idea and run with it, implementing all the required functions in the most straightforward way.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array of 'a array
  4. | Bytes of bytes
  5.  
  6. let of_bytes x : char t = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length = function
  10. | Array a -> Array.length a
  11. | Bytes s -> Bytes.length s
  12.  
  13. let get t i =
  14. match t with
  15. | Array a -> a.(i)
  16. | Bytes s -> s.[i]
  17.  
  18. let set t i v =
  19. match t with
  20. | Array a -> a.(i) <- v
  21. | Bytes s -> s.[i] <- v
  22.  
  23. end;;
  24.  
  25. module Compact_array :
  26. sig
  27. type 'a t = Array of 'a array | Bytes of bytes
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : char t -> int -> char
  32. val set : char t -> int -> char -> unit
  33. end

This seems pretty good at first glance, but the inferred types aren't quite what we want. In particular, get and set only work with Compact_arrays containing characters. If you think about how type inference works, it's not really all that surprising. If you think about the code for get:

  1. let get t i =
  2. match t with
  3. | Array a -> Array.get a i
  4. | String s -> String.get s i

The OCaml compiler is looking for a single type to assign to the return value for all the cases of the match. Given that String.get always returns a char, then Compact_array.get will be restricted to only returning a char.

One way to work around this problem is to essentially implement what we want as a poor-man's object. Here, we just write the code separately for the different cases, and stuff those functions into a record full of closures. Here's how that looks.

  1. module Compact_array = struct
  2.  
  3. type 'a t = { len: unit -> int
  4. ; get: int -> 'a
  5. ; set: int -> 'a -> unit
  6. }
  7.  
  8. let of_string s =
  9. { len = (fun () -> String.length s)
  10. ; get = (fun i -> String.get s i)
  11. ; set = (fun i x -> String.set s i x)
  12. }
  13.  
  14. let of_array a =
  15. { len = (fun () -> Array.length a)
  16. ; get = (fun i -> Array.get a i)
  17. ; set = (fun i x -> Array.set a i x)
  18. }
  19.  
  20. let length t = t.len ()
  21. let get t i = t.get i
  22. let set t i x = t.set i x
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = {
  28. len : unit -> int;
  29. get : int -> 'a;
  30. set : int -> 'a -> unit;
  31. }
  32. val of_string : bytes -> char t
  33. val of_array : 'a array -> 'a t
  34. val length : 'a t -> int
  35. val get : 'a t -> int -> 'a
  36. val set : 'a t -> int -> 'a -> unit
  37. end

This more or less solves the problem, but it's still not really the memory representation we want. In particular, we have to allocate three closures for each Compact_array.t, and this number of closures will only go up as we add more functions whose behavior depends on the underlying array.

GADTs to the rescue

Let's go back to our failed variant-based implementation, but rewrite it using the GADT syntax. Note that we're not trying to change the types at all this time, just rewriting the same type we had before in the language of GADTs.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> 'a t

The syntax of this declaration suggests thinking about variant constructor like Array or Bytes as functions from the constructor arguments to the type of the resulting value, with the thing to the right of the : roughly corresponding to the type signature of the constructor.

Note that for the Array constructor, the type value of 'a depends on the type of the argument:

  1. > Array [|1;2;3|];;
  2. - : int t = Array [|1; 2; 3|]
  3. > Array [|"one";"two";"three"|];;
  4. - : bytes t = Array [|"one"; "two"; "three"|]

But for the Bytes constructor, the type 'a in the type is still free.

  1. > Bytes "foo";;
  2. - : 'a t = Bytes "foo"

This is really the problematic case, because we'd like for Bytes "foo" for the parameter 'a to by char, since in the Bytes case, that's what the element type of our array is.

Because GADTs give us the ability to specify the type on the right-hand side of the arrow, we can get that.

  1. type 'a t = | Array : 'a array -> 'a t
  2. | Bytes : bytes -> char t

Now, the Bytes constructor behaves as we'd like it too.

  1. > Bytes "foo";;
  2. - : char t = Bytes "foo"

Now let's see what happens when we try to write the length function.

  1. > let length t =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : char t -> int = <fun>

Disappointingly, we're again stuck with a function that doesn't have the right type. In particular, the compiler has decided that this function can only operate on char t, when we want it to work for arrays of any type.

But the problem now is that type inference in the presence of GADTs is difficult, and the compiler needs a little help. Roughly speaking, without some hints, OCaml's type system will try to identify all types as having a single value within a given function. But in this case, we need a type variable which might have different values in different branches of a match statement.

We can do this by creating a locally-abstract type el to represent the type parameter of t (and the element type), and annotating t accordingly.

  1. > let length (type el) (t:el t) =
  2. match t with
  3. | Bytes b -> Bytes.length b
  4. | Array a -> Array.length a
  5. ;;
  6. val length : 'a t -> int = <fun>

Now we see that we get the right type. We can push this approach through to get a complete implementation.

  1. > module Compact_array = struct
  2.  
  3. type 'a t = | Array : 'a array -> 'a t
  4. | Bytes : bytes -> char t
  5.  
  6. let of_bytes x = Bytes x
  7. let of_array x = Array x
  8.  
  9. let length (type el) (t:el t) =
  10. match t with
  11. | Array a -> Array.length a
  12. | Bytes s -> Bytes.length s
  13.  
  14. let get (type el) (t:el t) i : el =
  15. match t with
  16. | Array a -> Array.get a i
  17. | Bytes s -> Bytes.get s i
  18.  
  19. let set (type el) (t:el t) i (v:el) =
  20. match t with
  21. | Array a -> Array.set a i v
  22. | Bytes s -> Bytes.set s i v
  23.  
  24. end;;
  25. module Compact_array :
  26. sig
  27. type 'a t = Array : 'a array -> 'a t | Bytes : bytes -> char t
  28. val of_bytes : bytes -> char t
  29. val of_array : 'a array -> 'a t
  30. val length : 'a t -> int
  31. val get : 'a t -> int -> 'a
  32. val set : 'a t -> int -> 'a -> unit
  33. end

As I said at the beginning, this is really just an example of the more general theme. GADTs are about more than clever typed interpreters; they're a powerful mechanism for building easy to use abstractions that give you more precise control of your memory representation. And getting the right memory representation is often critical for building high performance applications.

A lighter Core

We recently released a version of our open source libraries with a much anticipated change --- Async_kernel, the heart of the Async concurrent programming library, now depends only on Core_kernel rather than on Core.

This sounds like a dull, technical change, and it kind of is. But it's also part of a larger project to make our libraries more lightweight and portable, and so suitable for a wider array of users and applications.

We've actually been working on these issues for a while now, and this seems like a good time to review some of the changes we've made over the years, and what's still to come.

Reorganizing for portability

Core has always had dependencies on Unix, including OCaml's Unix library, as well as some other parts of the Unix environment, like the Unix timezone files. This has long been a problem for porting to Windows, but more recently, the issue has loomed for two other increasingly important platforms for OCaml: Javascript and Mirage.

To help fix this problem, in 2013 we released a library called Core_kernel, which is the portable subset of Core that avoids Unixisms as well as things like threads that don't match well with the Javascript and Mirage back-ends.

In the same vein, we refactored Async, our concurrent programming library, into a set of layers (modeled on the design of the similar Lwt library) that both clarified the design and separated out the platform specific bits. Async_kernel is the lowest level and most portable piece, hosting the basic datastructures and abstractions. Async_unix adds a Unix-specific scheduler, and Async_extra builds further os-specific functionality on top.

Until recently, the fly in this ointment was that Async_kernel still depended on Core, rather than Core_kernel, because only Core had a time library. Making Async_kernel only require Core_kernel was a bigger project than you might imagine, in the end leading us to change Timing_wheel, a core datastructure for Async and several other critical libraries at Jane Street, to use an integer representation of time instead of the float-based one from Core.

Already, some experiments are underway to take advantage of this change, including some internal efforts to get Async working under javascript, and external efforts to get cohttp's Async back-end to only depend on Async_kernel.

I'm hoping that yet more of this kind of work will follow.

Module Aliases

One long-running annoyance with OCaml is the lack of an effective namespace mechanism. For a long time, the only choice was OCaml's packed modules, which let you take a collection of modules and merge them together into one mega-module. Some kind of namespace mechanism is essential at scale, and so we used packed modules throughout our libraries.

Unfortunately, packed modules have serious downsides, both in terms of compilation time and executable sizes. We've been talking to people about this and looking for a solution for a long time. You can check out this epic thread on the platform list if you want to see some of the ensuing conversation.

A solution to this problem finally landed in OCaml 4.02, in the form of module aliases. I'll skip the detailed explanation (you can look here if you want to learn more), but the end result was great: our compilation times immediately went down by more than a factor of 3, and it gave us a path towards dropping packed modules altogether, thus reducing executable sizes and making incremental compilation massively more efficient.

The work on dropping packed modules has already landed internally, and will hopefully make it to the external release in a few months. The benefit to executable size is significant, with typical executables dropping in size by a factor of 2, but there is more to do. OCaml doesn't have aggressive dead code elimination, and that can lead to a lot of unnecessary stuff getting linked in. We're looking at some improvements we can make to cut down the dependency tree, but better dead code elimination at the compiler would really help.

Sharing basic types

Interoperability between Core and other OCaml libraries is generally pretty good: Core uses the same basic types (e.g., string, list, array, option) as other OCaml code, and that makes it pretty easy to mix and match libraries.

That said, there are some pain points. For example, Core uses a Result type (essentially, type ('a,'b) result = Ok of 'a | Error of 'b) quite routinely, and lots of other libraries use very similar types. Unfortunately, these libraries each have their own incompatible definitions.

The solution is to break out a simple type that the different libraries can share. After some discussion with the people behind some of the other libraries in question, I made a pull request to the compiler to add a result type to the stdlib.

This is a small thing, but small things matter. I hope that by paying attention to this kind of small issue, we can help keep interoperability between Core and the rest of the OCaml ecosystem smooth.

Eliminating camlp4

One concern I've heard raised about Core and Jane Street's other libraries is their reliance on camlp4. camlp4 is a somewhat divisive piece of infrastructure: it's long been the only decent way to do metaprogramming in OCaml, and as such has been enormously valuable; but it's also a complex and somewhat unloved piece of infrastructure that lots of people want to avoid.

camlp4 also makes tooling a lot more complicated, since there's no single syntax to target. Dev tools like ocp-indent and the excellent merlin have some terrible hacks to support some of the most common camlp4 syntax extensions, but the situation is clearly untenable.

You do need camlp4 to build Core, but you don't need camlp4 to use it, and in practice, that's good enough for most use cases. But for people who want to avoid camlp4 entirely, it's still a nuisance. Moreover, while you don't need camlp4 to use Core, it is convenient. For example, a lot of Core's idioms work best when you provide s-expression serializers for your types, and the sexplib syntax extension is an awfully convenient way to generate those functions.

Our plan is to simply eliminate our dependency on camlp4 entirely over the next 6 months, by switching to using ppx and extension points, a new approach to metaprogramming in OCaml that, like module aliases, landed in 4.02. We're currently rewriting all of our syntax extensions, and building tools to automatically migrate the code that depends on camlp4. People who want to continue to use the old camlp4 extensions are welcome to continue doing so, but we're cutting our dependency on them.


Even at the end of all this, we don't expect that Core and Async will suit everyone --- that's a hard bar to cross for any software package. But we do hope that through these efforts, an ever wider set of developers will be able to take advantage of the work we've done.

Centralizing distributed version control, revisited

7 years ago, I wrote a blog post about how we at Jane Street were using our distributed version control system (hg, though the story would be the same for git) in a partially centralized way. Essentially, we built a centralized repo and a continuous integration system whose job was to merge in new changesets. The key responsibility of this system was to make sure that a change was rejected unless it merged, compiled and tested cleanly.

This half-distributed, half-centralized approach let us enjoy the benefits of a DVCS, while still getting the coherence and easy of sharing that comes from having a central authoritative source.

Since then, our development tools have changed a lot, including the arrival of a new code review and release management system called Iron. In writing Iron we discovered that centralization was valuable in ways we hadn't considered before. In particular, despite the fact that good support for merging is central to a DVCS, centralization is actually a critical ingredient to making merges work better.

To understand how centralization can help, let's talk about one reason why merging is a fraught process to begin with.

The criss-cross merge

The basic approach to merging in a DVCS like hg or git is pretty simple. Here are the basic steps that are taken to merge two heads A and B.

  • Find the greatest common ancestor (GCA(A,B)) of the heads to be merged.
  • Compute the patch from that base point to one of the two heads, say, A.
  • Take the patch you just computed, and apply it to B. Conflicts appear when the patch, which was actually based on GCA(A,B), doesn't apply cleanly to B. The result of this process is the merge.

The above discussion oversimplifies the story by assuming there's a well defined GCA, but this just isn't always true. To see why, consider a repository staring with a root revision R, and two revisions made independently on top of R.

  A
 /
R
 \
  B

Now, imagine that two different developers each concurrently decide to merge the heads A and B and do some further development. Note that in both of the cases shown below, the GCA for the merge between A and B is R.

Developer 1                Developer 2

  A---C--D                    A     
 /   /                       / \        
R   /                       R   \       
 \ /                         \   \  
  B                           B---E--F

Now, if we bring these two separate histories together into one repo, we have something like this.

  A---C--D
 / \ /
R   \
 \ / \
  B---E--F

Now, what happens if we want to merge D and F? In particular, what is GCA(D,F)? Both A and B are common ancestors, but neither one is greater than the other. In this case, there are in some sense two different GCAs, or, more precisely, there are multiple maximal common ancestors, or MCAs. This case is often described as a criss-cross merge, and is the source of much wailing and gnashing of teeth among developers and users of DVCSs.

git and hg have different ways of dealing with the case of multiple MCAs. By default, hg just picks one of the MCAs arbitrarily and does the merge based on that. Given that different choices of the merge base will lead to different results, making that choice arbitrarily is pretty disturbing.

git, on the other hand, has a strategy called recursive merge that repeatedly merges together the MCAs, and then uses that merged MCA as the basis for computing the diffs to A and B on which the final merge will be based. And hg has a new strategy called bid merge that is willing to make different choices as to the GCA to use on a file by file basis.

None of these approaches amount to principled solutions, and while they work better in some cases and worse in others, they all sometimes lead to bad results. It's tempting to look for a way out of this conundrum altogether, by avoiding the possibility of criss cross merges in the first place.

Avoiding the criss-cross merge

For those who haven't read my previous posts about how Iron approaches merges, I'll describe it briefly here. Iron organizes its branches into a hierarchy: every repository has a root feature, and that feature can have children, and those can have children as well. Thus, our main repository, called Jane, has a root feature called jane, and one can develop changes to jane in child features, such as jane/Core.Applicative or jane/quickcheck.

Critically, the merging of features is constrained. Note that in Iron, every feature is defined by its base and tip revision, where the diff between those two revisions is effectively the contents of the feature. Here are some of the key operations allowed on features.

  • fe release, moves changes from a child feature into a parent. This can only be done once the child feature is fully merged with its parent, and has the effect of setting the tip of parent to be the tip of the child, and typically deleting the child.

As an example, if the jane/quickcheck feature is based at the current tip of jane (and is fully reviewed, and all its tests pass), then calling fe release jane/quickcheck will move the tip of jane forward to be equal to the tip of jane/quickcheck, and will delete jane/quickcheck.

  • fe rebase lets you merge a feature with its parents, effectively pulling changes from a parent feature into a child. This has the effect of changing the base of the feature to be the tip of its parent, and the tip of the feature to be the result of the merge.

So, if other features have been released into jane since the jane/Core.Applicative feature was created, then the base of jane/Core.Applicative will no longer be the tip of jane. Calling fe rebase jane/Core.Applicative will merge the tip of jane/Core.Applicative with the tip of jane, and will set the base of jane/Core.Applicative to the tip of jane.

  • fe rename, which in addition to allowing you to simply change the name of a feature, also lets you introduce a parent-child relationship between features that didn't previously have one. e.g., calling fe rename jane/Core.Applicative jane/quickcheck/Core.Applicative causes the Core.Applicative feature to become a child of, and so be able to depend on the changes in, thequickcheck feature.

All of these operations are implemented against a single, centralized server which keeps track of the state of all our features. This centralization lets Iron enforce some useful invariants along the way, critically, that the GCA of a feature and its parent is well defined, and is equal to the base of the feature. This simple property turns out to outlaw criss-cross merges, which avoids all of the mess we described earlier.

The happy outcome turns out to depend critically on the fact that we built a central server that could enforce the invariants in question, or, more precisely, that we built a consistent service

discovered by chance, the existence of the central server is key to enforcing the necessary invariant. In particular, the scenario of two different users concurrently releasing into the same feature or rebasing the same feature simply isn't possible when there's a centralized monitor determining who goes first.

In retrospect, this shouldn't be too surprising. The criss-cross merge is really a result of concurrency, and the idea that introducing a lock (which is what a centralized server does for you) can be used to exclude unwanted concurrent executions in a distributed systems should surprise no one.

In the end, you can trace it all back to the CAP theorem: If you want progress while partitioned, you need to give up on consistency in some way. And criss cross merges are caused by a kind of inconsistency.

Centralization obviously has downsides, but I think Iron picks a nice point along the spectrum here: writing code is totally doable while disconnected, but operations like rebase and release that affect how information is shared between features requires you to be connected. I think it's a small price to pay to never have to deal with a criss-cross merge.

Making making better

We spend a lot of time and effort on training new people, and it never stops for long. Right now our winter-intern class is ending; in five months we'll have a slew of new interns to get up to speed, and a few months after that we'll have an incoming class of new hires.

A big part of our new-hire training is OCaml Bootcamp, a month-long immersion program for non-dev hires (mostly trading, but some from other areas like research, systems, and accounting). We don't think everyone at Jane Street should be a full-on software developer, but writing code is such a useful way to get things done that it's worth teaching the basics to a broad set of people.

Teaching programming, especially to people who are not planning on becoming software developers, is an opportunity to reflect on how unnecessarily hard programming can be. There's a huge learning curve as you struggle to learn your way around the Unix command-line, or figure out the key commands for controlling a 1970's era text editor like Emacs or Vi, or puzzle through the semantics of a distributed version control system like Mercurial. And all of that is before you even start writing code!

To me, this serves as a reminder of the importance of good tools. The quality of your tools can increase or decrease the steepness of the learning curve, and they also affect your day-to-day efficiency after you've climbed up that hill.

Tools are easy to undervalue. Most of our development time is spent, as it should be, on our first order problems -- writing the code that powers the systems that let us trade. And the connection between better tools and better trading can seem tenuous.

But really it's not tenuous at all. If you spend all your time on first order problems, you'll discover you're not solving them as fast as you should be. Getting things done effectively requires optimizing your own productivity, and to do that, you have to spend some time sharpening your tools.

And we've done a fair bit of sharpening. One recent example is Iron, a new code review and release management system that we started using last summer. Last year, we also rolled out a new build system called Jenga, which greatly simplified and sped up the process of compiling our code. Plus, we switched to a new version of OCaml, which includes a big set of improvements, some of which were specifically aimed at improving our development process [7]. And we funded some former interns to improve Merlin, a fantastic tool that provides IDE-like features like context-sensitive autocompletion in a way that can be easily integrated into multiple editors.

Jane Street is a pretty small shop --- we have fewer than 65 full time developers --- but even at our modest scale, spending time on tools is a big win. But it's really about more than dev tools. Thinking about how to make the people around you more effective informs how you work more generally, changing how you design libraries, how you manage services, and how (and whether!) you write documentation.

And in addition to making for a more effective organization, it's also a more pleasant way to live.

13 Virtues

Very early on in his life, while on lengthy voyage from London to Philadelphia, Ben Franklin created a system of thirteen virtues to live his life by. He spent the remainder of his days giving special focus to one virtue per week in a 13 week cycle, as well as noting the virtues he failed to live up to at the end of each day.

Over time he credited the system with making him more productive and more fulfilled.

My aspirations are not so lofty, but in the spirit of the new year, I present Ben's thirteen virtues as they relate to code review and discussion. Nothing here is meant to be taken as gospel, but together they give me a path towards the type of collaboration we value at Jane Street.

My simple hope is to waste less time, produce better code, and have fewer arguments over the next 12 months than I did over the last.

Temperance

Review thoroughly, but not to the point of exhaustion. Be mindful of the limits of review.

Silence

Say only things that clearly benefit the code; avoid trifling comments and tangents.

Order

Create the structure (files, modules and types) necessary to give every concept a clear place. Be suspicious of catchall modules.

Resolution

Respond to feedback and change requests quickly. Momentum is important.

Frugality

Don't waste people's time with frivolous review, careless comments, or code that isn't ready for review. Attention is expensive.

Industry

Prefer to respond with working code over additional commentary. Focus review on immediately productive outcomes instead of uncertain concerns.

Sincerity

Come to discussions with an innocent mind. Engage in code review with the clear goal of helping.

Justice

Weigh code decisions on the evidence at hand today, and not on personal preferences, prejudices, or obsolete past constraints.

Moderation

Avoid extremes in both style and approach. Incorporate strong views slowly.

Cleanliness

Spend time to lay out code in a clear and visually pleasing way. When reviewing, leave the code neater than you found it.

Tranquility

Don't become impassioned or incensed over trifles. Engage in all conversation with an open balanced tone and a sense of patience.

Chastity

Proliferate new ideas through the code base cautiously and be aware that even a good idea may not work in all places.

Humility

Take a modest view of your own contributions and feedback. Be unpretentious and respectful in your comments. Accept that you may be wrong.

Inspecting the Environment of a Running Process

Sometimes its useful to be able see the values of environment variables in running processes. We can use the following test program to see how well we can accomplish this:

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
        int n;
        char *envstr;
        while((n = scanf("%as", &envstr)) != EOF) {
                putenv(envstr);
        }
        return 0;
}

This program just reads strings from stdin and then basically passes them on to `putenv(3)1 so we have any easy way to modify our environment.

Now, lets run it with env -i to reset the environment to something pretty sparse:

[cperl@localhost ~]$ gcc -Wall t.c -o t
[cperl@localhost ~]$ env -i FOO=bar ./t

First, lets see what we can get out of /proc/{pid}/environ, as googling for this problem will undoubtedly point you in this direction (including ps eww which reads /proc/{pid}/environ):

[cperl@localhost ~]$ cat /proc/$(pgrep -x t)/environ | xargs -r0 -n1 echo
FOO=bar

Great, so that looks like its our answer!

Unfortunately, /proc/{pid}/environ only reflects the environment of the process when it started and does not reflect any calls that process might have made to putenv(3) or setenv(3) (you can experiment with the above program substituting in setenv(3) for putenv(3) and playing with overwrite to see what you get).

We can see that if we feed some data into our program, causing calls to putenv(3):

[cperl@localhost ~]$ env -i FOO=bar ./t
BAR=baz

And then check /proc/{pid}/environ again:

[cperl@localhost ~]$ cat /proc/$(pgrep -x t)/environ | xargs -r0 -n1 echo
FOO=bar

However, we can verify the data is really there if we attach with gdb and iterate over the environ(7) array directly:

[cperl@localhost ~]$ gdb ./t $(pgrep -x t)
...
(gdb) set $i = 0
(gdb) while (environ[$i] != 0)
 >print environ[$i++]
 >end
$1 = 0x7fffc8e42fec "FOO=bar"
$2 = 0x12d4080 "BAR=baz"

Unfortunately, I'm not aware of any other way to get this "dynamic environment info" (except for other ptrace based solutions). Obviously attaching to production processes with gdb (or ptrace in general) isn't a great idea. Most of the time you'll probably be fine inspecting /proc/{pid}/environ and verifying (via source code inspection) that that process you care about doesn't make any calls to putenv(3) or setenv(3) for the variables whose values you are interested in.

If you have any better ideas about how to get this information, please share in the comments!

How to choose a teaching language

If you were teaching a programming course, what language would you teach it in?

I like this question because it has any number of good answers, each quite different from the other, and each emblematic of a different approach to what programming is about.

The first formal programming class I took was COS 217 at Princeton, taught by the excellent (and at the time, I thought, terrifying) Anne Rogers. The course was (and is) taught in C, and the intellectual approach of the class was to start from the machine. Instead of just learning to program in C, we learned about how the machines we were programming to worked. That was where I first encountered instruction counters, stack frames, registers and the memory hierarchy. It was exhilarating.

Where C encourages you to start with the machine, Scheme wants you to start at the mathematical foundations of computation. You don't need to know what the lambda caluclus is to appreciate Scheme's slim core, and the way in which you can build a rich and vibrant computational world on top of it. That core is expressive enough that it makes it natural to introduce ideas that come up in multiple different languages, including functional and imperative techniques, object orientation, and logic programming.

The classic course in this vein is MIT's 6.001, also known as SICP, The Structure and Interpretation of Computer Programming. Sadly, 6.001 has been retired at MIT, but the book lives on, and is a worthwhile read even if you took your last CS course years ago.

MIT replaced SICP with a course based on Python, and this reflects a broader trend. As was highlighted by an informal study by Philip Guo, lots of schools now teach Python, particularly for early introductory courses. I have mixed feelings about this choice. Python is a wonderfully friendly language, but that friendliness is bundled together with some problems.

This was made apparent to me in part by my experience with students who chose to code in their interviews in Python. In many ways, Python is the ideal interview language, since its concise and readable syntax makes the constraints of coding on the whiteboard more bearable. But what I saw was that students who learn Python often walk away with a rather rough model of the semantics of the language. You might be surprised at what fraction of students who have programmed extensively in Python can't guess how Python lists might be implemented, to say nothing of their ability to explain the semantics of language features like generators or decorators.

This isn't really a knock on Python. After all, there's something great about a tool that lets you get things done without fully understanding how it works. But in different ways, both Scheme and C encourage you to understand what's going on from the ground up, and there's a certain pedagogical power in that. All in, I think Python is a great choice for an early introductory course, particularly one meant for those who aren't going to end up as computer scientists or full-time programmers. But I'd be leery of using it outside of those circumstances.

A development that I'm personally rather encouraged by is the emergence of statically typed functional languages, ML in particular, as teaching tools. Over the last few years, I've had the pleasure of visiting and lecturing at a number of schools that teach using OCaml or SML, including Brown, Cornell, Penn, Princeton, CMU and Harvard.

ML has gained ground for good reasons. First, it shares much of Scheme's elegant intellectual foundations, even though its core isn't quite as charmingly minimalistic as Scheme's. But ML has a wider range than Scheme because it allows you to show students the role of types in programming. Despite that greater range, OCaml and SML are relatively simple languages, which matters even more for teaching than it does for everyday use.

The only choice I've seen a lot of that I can't reconcile myself to is Java. Java is of course a widely used industrial language, but that doesn't mean it's a good teaching language. In my view, a key requirement for a teaching language is simplicity, and all the other choices I mentioned are simple in one way or another: C is a minimal layer on top of the machine; Scheme and ML are based on simple mathematical models of computation; and Python is a language that people find simple to use.

Java isn't simple in any of these ways. It's not particularly easy to get started with, as indicated by all the things you need to tell students to ignore rather than understand. (Yeah, public static void main, I'm looking at you!) Nor does it have a simple and transparent execution model like C. And an elegant computational core calculus like the one at the heart of Scheme and ML is nowhere in sight. The only real advantage I see for Java is vocational, and that doesn't seem to me like argument enough.

The thing to consider when you're picking a language to teach is that you're not just picking a bit of infrastructure for your students to program with during the course. You're picking an intellectual frame through which they will see all the lessons you teach them. And you should pick that frame with care.

Interviewing At Jane Street

Welcome to our version of the seemingly obligatory post about technical interviews. This topic has been covered by a lot of people already, so I'm going to do my best to not repeat all of the copious advice already out there.

Like many companies, we are looking for extremely talented technical people, and we have a challenging interview process that we think does a good job of selecting people who will do well here.

That said, we know that we miss lots of good people too. Some of that is because of the awkwardness of the interview process itself: time is short, the questions are weirdly artificial, and, of course, people get nervous. It's made even worse by the fact that programming on a whiteboard, a web browser, or even just on a computer that isn't yours is a bit like playing classical guitar with mittens on. It can put even an accomplished person off their game.

Missing out on good people makes us sad.

That's what this post is for. We hope that by talking a bit about what we're looking for, the ways people do poorly, and how we think you might be able to prepare, that we'll reduce the mitten handicap - at least a bit.

What Are We Looking For?

From our perspective, the main thing we want to figure out when we interview someone is: are they someone we want to work with?

That seems obvious enough, but it's a point that can get lost in the puzzles and whiteboard coding of an interview. Really, we think of our interviews as little simulations of what it's like to work together with the candidate. And while at the time, it may seem to the candidate that the interview is all about solving the problem, it's really not. We're much more interested in learning about how you work than we are in whether you actually finish whatever problem we put in front of you.

It's not that technical skill is irrelevant --- far from it. But it's only part of the story. Just as important to us is whether we can have a productive and fun discussion with the candidate.

To that end, we try to avoid algorithm bingo and puzzles with clever "aha" solutions. We prefer more open-ended problems that have no single right answer, since they give us more freedom to work together, and to see how the candidates' thinking plays out.

That sounds nice enough, but it's a bit high-level and hand-wavy. So here's a more concrete list of suggestions that follow from our overall approach.

Be nice

The smartest, best person in the world won't get hired at Jane Street if they aren't courteous and pleasant to talk to most of the time. Almost nothing will end your interview faster than being rude, pushy, or obnoxious.

Remember, we are looking for someone to work with, not just someone who can win an argument.

Be clear

And by clear, we mean simple and to the point. Use words and examples that get the core idea across to the widest technical audience possible.

Avoid showy, highly technical or deeply obscure terms of art, especially if you don't fully understand them. In the best case we'll likely just ask exactly what you meant by "hylomorphism", which wastes precious time. In the worst case it will become clear that you should have said "metamorphism" instead, which is just embarrassing for everyone involved.

Know what you don't know

Naturally we like it when candidates are good at solving the problems we put in front of them. But just as important, perhaps more important, is their ability to think reasonably about their own level of understanding.

In other words, we really like people who can express appropriate levels of confidence: admitting ignorance when they're unsure, and speaking confidently when they have the basis to do so. At a basic level this means being willing to say, "I don't know" rather than guessing and hoping when we ask you about something you aren't familiar with.

Know your language

Code is a wonderful language for communicating complex ideas because it provides a clear, concise and unambiguous way of expressing them. But, like any foreign language, it takes a lot of time and practice to get really comfortable.

We need you to be comfortable with it, because we communicate ideas in code a lot.

Now, comfortable doesn't mean that you have to be the world's best coder, or that you need to have memorized your favorite algorithms book. It means that you should be able to read and write code in at least one language without constant access to reference materials for common things, such as:

  • Standard control structures (loops/if-then/etc.)
  • Function, module, class, type, etc. definitions
  • Common data types like arrays, lists, hash tables/maps/dictionaries
  • Exceptions and other error handling techniques

Also, pick coding tools you understand well. Don't pick a functional language to make us happy. We'd much prefer you use a language that you know well. Similarly, when picking which features of the language to use, pick the ones you understand best. We're not going to be impressed with your elegant use of Python decorators if you don't really understand the details of what they do.

In other words, pick a simple, clunky solution that you understand over a fancy, elegant one that you don't.

Remember CS 101

We've hired plenty of successful people who didn't have a traditional college background in CS, and we certainly don't require a masters or a PhD. That said, we need you to have a solid working knowledge of core computer science concepts, including:

  • Abstraction layers like functions, objects, and modules

  • Basic algorithms and data structures, including binary search, sorting, hashing, breadth/depth first search, hashtables, binary trees and heaps.

  • Techniques for estimating CPU and memory costs, including big-O notation.

So if you can't for the life of you recall what amortized analysis is, and you can't nimbly bang out the code for a depth-first search it's probably worth reviewing some of this material.

Think about real computers

Depending on your point of view it's either a sad or beautiful fact that the most elegant code can only run on top of the giant, complex, odd stack of parts and abstractions that is a real computer. Since we need programs to actually run, we need people who understand the inner workings of that behemoth.

Now, this doesn't mean that we quiz every candidate about deep kernel internals, or the differences between SDRAM vs SGRAM. But for some jobs in systems development we do expect a fair amount of detailed knowledge, and in general it's a big plus if in your thinking you can take into account things like cache effects, IO patterns, memory representations, and the capabilities of real CPUs.

What We Don't Look For

  • Perfection. Our questions are often designed to be open ended enough that even the best people we've seen couldn't answer them fully in the time alloted. We want to keep the conversation going to learn everything we can, and we don't expect that you'll answer everything 100% perfectly.

  • We don't ask developers mental math, or math olympiad questions despite what you might have read online. Dev interviews are about programming.

  • We don't ask developers logic puzzles about pirates, people who only tell the truth, or which door the tiger is behind. Dev interviews are about programming.

How do people do poorly?

The most common issue is, of course, that some candidates just don't have the background for the job they are applying for. The solution to that is to learn more, and practice more. But there are a few other less obvious reasons that interviews with otherwise technically good people can go awry.

They're careless

One of the most common pieces of negative feedback from our interviewers is that the candidate was careless. This usually doesn't mean that the candidate didn't make progress, or didn't have good insights. It means that the candidate didn't think carefully and systematically about how their program might go wrong.

We care that you make progress, but we are rarely concerned about finishing a problem. In fact, many of the problems are designed to go on for far longer than the average interview length. It's better to step back and check your work carefully before you claim that it is finished than to rush.

They talk more than they code

Some candidates do a good job of talking through the problem and explaining their thinking, but never quite get around to concretely answering the question. We want to hear your ideas, but we also want to see you produce concrete solutions, which almost always involves writing code. There are lots of things that we can't learn about a candidate without seeing their code and talking through the details.

Take some time at the beginning to think about the solution and to talk about your plans, but make sure you start writing code - even if it isn't the code you would write if you had more time.

They don't generalize

We try to keep our interviews interactive, and we'll often stop candidates to ask about something they have just done, or to point out something that we think might be confusing or incorrect. We understand that we've seen these problems over and over again, and that you are coming to them fresh, so you shouldn't worry just because we've found a problem with your solution.

What you should do is generalize the advice. If we point out that you missed a case, consider other cases you might have missed. If we remind you of an invariant you forgot, find a way to protect yourself from making the mistake in other places in your code.

They say one thing and do another

We love it when a plan comes together, but it's extra hard to watch when a good plan falls apart on execution. If you hear a question, and discuss a plan of attack with your interviewer, do what you claim you will do. Don't change the plan in the middle, or drop it entirely in favor of a better idea without some discussion. You have a very limited amount of time to describe your solution in code, and executing a decent plan well is better than producing a Frankenstein's monster of 3 different plans that doesn't quite come to life.

If you do get partway through and start to lose faith step back and talk about it. Explain exactly why you are concerned, and whether you think it might be fatally flawed at the core, or just not ideal. If there really is a fatal flaw and you've seen it, we'll help you get out of the jam, and we'll appreciate that you articulated it. If it's just not quite perfect we'll likely encourage you to continue.

So, what can you do to prepare?

This part is short and sweet. Build something - from scratch - in a language you like. Don't stop short. Build the whole thing.

Now, show it to the smartest people you know, get feedback, tear it down and build it again with what you've learned.

Repeat with a new problem.

We are looking for people to build real things with us, and practice really does make perfect.

What the interns have wrought: RPC_parallel and Core_profiler

We're in the midst of intern hiring season, and so we get a lot of questions about what it's like to be an intern at Jane Street. One of the things people most want to know is what kind of projects they might work on as an intern.

That's of course hard to answer precisely, since we don't yet know what next summer's intern projects are going to be. But you can get a sense by seeing some of what interns have done in the past. To that end, I thought I'd describe a couple of intern projects that were done this past summer.

Rpc_parallel

Rpc_parallel is a new library written by Todd Lubin (who will be returning to Jane Street full-time next fall), aimed at making parallel programming in OCaml simpler and safer.

Writing programs that take advantage of multiple cores, and even of multiple computers, is of course a necessity. In OCaml, such parallelism is typically achieved by writing multiple single-threaded programs that use message passing to coordinate.

We have lots of parallel message-passing systems, but while these share a lot of infrastructure (notably, the Async_rpc library), we've never really settled on a lightweight way to construct complete parallel programs. Instead, each such program tends to have its own little process coordination framework.

A while back, we wrote a library called Async_parallel, (described here). Async_parallel does a lot of things right -- in particular, it makes it easy to spin up and manage new processes and distribute work between them.

But Async_parallel has a problem. It is based on OCaml's marshal facility, which allows you to serialize arbitrary data, including functions, between processes. This sounds great, but it has a dark side: marshalling arbitrary functions turns out to be error prone. In particular, in order to send a function over the wire, you also have to send a copy of any data that that function implicitly depends on.

Unfortunately, it's hard to know what kind of data is hidden behind a function, which can cause a few different forms of trouble: it might send much more data than you expect, it might fail unexpectedly if it hits one of the forms of data that can't be marshalled, or it might lead to crazy and hard-to-predict behavior if some of the data required by the function is meant to be mutable shared state.

Instead, we wanted a library that was more typeful and constrained in terms of what was sent over the wire. This pushed us towards a design where, at the cost of some extra verbosity, we explicitly declare the types of data that is sent. In exchange, we get a system that is easier to understand and debug.

One of the great things about Rpc_parallel is how fast it came together. Todd got it into a sufficiently good state by the middle of the summer that he was able to use it for his other projects (interns typically have at least two major projects over the course of the summer).

Rpc_parallel also benefitted from some world-hopping collaboration. Interns spend at least a week in an office other than their home office, and Todd ended up visiting Hong Kong. While there, he ended up spending a lot of time talking and working with Chengqi Song, who had a lot of experience with Async_parallel. Out of those discussions came a complete redesign and rewrite of the library, factoring out the core primitives for coordinating across multiple processes, and making the overall library simpler and more general.

By the end of the summer, a few other people picked up and starting using it for other projects, and last week, it was released open source, so you can take a look at it yourself on github.

Core_profiler

Profiling is surprisingly hard, and as such it's perhaps unsurprising that there are lots of ways of doing it. If you want to understand the cost of an individual operation, you might want a micro-benchmarking library like Haskell's Criterion, or our own Core_bench. If you're trying to understand properties of a whole application, like which lines of code it's spending its time on, or how many cache-misses are occurring, you might want to use something like Linux's perf tools, which use CPU-level counters to efficiently gather profiling statistics from your program.

Another useful technique is for the programmer to modify the source to add explicit probes that keep track of when certain program points are reached, and write out that information to a log that can be analyzed after the fact.

Daniel Richman (who will be returning for an internship next summer) worked along with Roshan James (formerly an intern himself, now fulltime) on a library called Core_profiler, which aims to make the use of such probes easy and efficient. Efficiency matters quite a bit, because if logging a probe takes more time than the thing you're trying to measure, you basically can't extract reliable data. Keeping the overhead small, therefore, is a must.

Accordingly, a lot of Daniel's time was spent thinking very carefully about how to write the probes in a way that would only minimally disrupt the execution of the program. He started a simple but less efficient library, called Stats_reporting, which took about 60ns and two words of allocation per probe, and started machining it down from there.

The first step was to avoid all allocation, which meant we could no longer use bin-prot, our standard binary serialization technique, since bin-prot requires that you allocate an OCaml object representing the data to be serialized. So they moved to using an internal library called Protogen for generating zero-allocation serialization code.

That brought us down to about 30ns, and zero allocations. We then decided to try out writing our own hand-rolled binary protocol, so we could have a yet-more-closely optimized binary layout. That brought us down to about 20-25ns. The next hack was to customize the binary format yet more by packing multiple values into a single, 63-bit OCaml integer. That, plus some cleverness to make sure that writes were word-aligned, brought the cost of a simple probe down to 12ns. In addition to the design changes, Daniel also spent a lot of time carefully examining the assembly code, to make sure that there were no surprises from the code generator.

We're pretty happy with the end result. We think that Core_profiler probes are now a good bit cheaper than the dynamic probes you can insert using a system like DTrace, and are in any case the best tool we have for tracking the performance of even relatively quick code-paths. And, as a nice bonus to all this optimization, the offline processing got about twice as fast as a side effect of the runtime improvements.


There's more to be said about both of these projects, and about the many other projects that were done this summer. If you're interested in applying, you can do so here:

http://janestreet.com

You don't need to know anything about finance or functional programming to apply. But if you come, you'll learn a ton about both by the time you're done.

What is gained and lost with 63-bit integers?

Almost every programming language uses 64-bit integers on typical modern Intel machines. OCaml uses a special 63-bit representation. How does it affect OCaml?

OCaml int memory representation

Most of OCaml's types are in memory represented as a header followed by data. The header is a 64-bit integer containing the length of the data and a tag. Tag is a rough classification of the type. The only OCaml's types which differ from this are ints and sometimes floats.

Floats normally have header and data, data being the value of the float itself. This representation is called "boxed". If a record's field is float, record's data will actually contain the pointer to the float data. The only exceptions are records with only floats and float arrays, whose data instead of pointers contain the values of floats. This representation is called "unboxed".

Values of type int are never stored as header and data (boxed). Int x is stored as (x << 1) | 1, where << is left shift and | is bitwise or, hence its least significant bit is always set. Pointers are word aligned, so they will never have this bit set, hence how ints and pointers are discerned. It is assumed that much of typical data is integers, so this is done to significantly improve performance:

  • there's no need to dereference a pointer when getting an int
  • no memory allocation is needed when creating ints
  • less work for the garbage collector
  • less memory fragmentation
  • no memory is needed for int headers

Distinguishing whether a value is int or pointer is as simple as testing x & 1, so this feature doesn't slow down garbage collector, polymorphic hash, polymorphic compare and whatever else structurally inspects data. One should note that this doesn't apply to the types int32 and int64, which are always boxed.

Penalty

Having the extra bit comes with a price - arithmetic operations are more complicated. For example

  • x + y is translated to CPU instructions x + y - 1
  • x * y is translated to CPU instructions (x >> 1) * (y - 1) + 1
  • x / y is translated to CPU instructions (((x >> 1) / (y >> 1)) << 1) + 1
  • x lsl y is translated to CPU instructions ((x - 1) << (y >> 1)) + 1

Sometimes this penalty is small or nonexistent. For instance there's no need to fix the bit in x + y - z. Only one bit fixing is needed for all five additions in x + y + z + w + u + v.

Another help is the Intel CPU instruction LEA, which can compute the sum of three integers with a single instruction, like x + y - 1. Unfortunately, LEA became very slow in the recent generations of CPUs. Intel doesn't suggest this will change.

This benchmark (test.ml) tries to estimate the difference in the performance. The results from Sandy Bridge show about 2 times speed difference in arithmetic operations. Assembly can be examined by compiling using "ocamlopt -S test.ml".

speed(ns)       63-bit   64-bit   slowdown
add independent 0.327588 0.121502 2.696156
add   dependent 0.328160 0.169375 1.937477
mul independent 0.614824 0.328060 1.874120
mul   dependent 1.094343 0.328872 3.327565
lsl independent 0.359828 0.166088 2.166489
lsl   dependent 0.762251 0.177468 4.295151
xor independent 0.249350 0.122900 2.028886
xor   dependent 0.404255 0.170380 2.372668

Agner's instruction tables show that the difference is even bigger with later generations of CPUs. For instance, Haswell can do four integer adds per cycle versus one LEA.

Conclusion

The benefits of unboxed ints are amazing. On the other hand, arithmetic operations are significantly slower. How much do arithmetic operations affect an average program? Could we have a solution which would keep ints unboxed but have fast arithmetic operations?

4