OCaml Annoyance #23: type declarations are implicitly recursive

Unlike let declarations, type declarations in OCaml are automatically recursive. This seems harmless at first, but it actually causes more trouble than it's worth. To see why, let's look at an example. Here's a simple signature that uses nested modules and that adopts the reasonable convention of using t for the primary type of a module.

  1. module Thing : sig
  2. type t
  4. module Collection : sig
  5. type t
  6. end
  8. val create : string -> t
  9. val collect : t list -> Collection.t
  10. end

Unfortunately, implementing this is made more painful by the fact that type declarations are recursive by default. If you do the obvious thing:

  1. module Thing = struct
  2. type t = string
  4. module Collection = struct
  5. type t = t list
  6. end
  8. let create x = x
  9. let collect xs = xs
  10. end

You get the following error:

File "type_is_implicitly_rec.ml", line 5, characters 9-20:
The type abbreviation t is cyclic

You can fix this by introducing an extra dummy type definition in between to break the recursion:

  1. module Thing = struct
  2. type t = string
  4. module Collection = struct
  5. type z = t
  6. type t = z list
  7. end
  9. let create x = x
  10. let collect xs = xs
  11. end

This works, but it's ugly and kind of confusing.

The ML sweet spot

I just got back from visiting Northeastern and Harvard where I yet again flogged a version of my POPL talk. Olin Shivers was my host at Northeastern and Greg Morrisett at Harvard. It was a bit of a rushed visit, but a lot of fun nonetheless.

Both Greg and Olin are very interested in making the next big jump in programming languages, and they both think that that next step will require better ways of reasoning statically about programs. I think they're dead-on in terms of what the right direction to go is, but I think they've got their work cut out for them. It will be hard to beat ML because ML sits in a kind of sweet spot; make it a little bit better in one aspect, and you give something up in another. I can think of two ways in which this is true. The first has to do with the expressiveness of the type system. If you make the type system much more expressive, you generally need to give up a lot on the type-inference side and in the quality and comprehensibility of error messages. You can actually see this at work in OCaml. Some of the more advanced features, like polymorphic variants and objects, do indeed suffer from more obscure error messages and worse type inference than the rest of the language. I think polymorphic variants in particular are worth the trouble, but it just underlines the fact that adding to the Hindley-Milner type system is tricky. And compared to some of the things Olin and Greg and others in the PL community are thinking about, OCaml's extensions to HM are pretty tame.

The second way in which ML is in a sweet spot has to do with the execution strategy. In ML you can write code that is reasonably "declarative" (which I think mostly means that you can express your logic cleanly and concisely), and at the same time there is a straight-ahead execution strategy that allows you to run the code reasonably efficiently. You can make things a little more declarative, but you often give up quite a bit on the predictability of the performance you get out of your code. There are plenty of other languages where you can see this tradeoff play out. Haskell's laziness allows for a more declarative style at the expense of making it harder to think about space complexity (and indeed, Haskell's designers are aware of this. Simon Peyton Jones like to say that he thinks the next ML should be pure and the next Haskell should be strict). SQL and Prolog make it possible to specify queries without worrying too much about how the search for an answer is to be performed, at the expense that both time and space complexity become hard to reason about.

ML is hardly perfect. But it's such a nice compromise that it's going to take a real leap forward on the research side to definitively surpass it.

Bind without tears

One of the annoyances of using monads in OCaml is that the syntax is awkward. You can see why if you look at a simple example. Assume that you're using the usual option monad. If we define >>= to be the bind operator, you might end up with a piece of code that looks like this:

  1. let f x_opt y_opt z_opt =
  2. x_opt >>= (fun x ->
  3. y_opt >>= (fun y ->
  4. z_opt >>= (fun z ->
  5. return (x + y + z))))

This is awful for two reasons: the indentation is absurdly deep, and secondly, and there are too many closing parens at the end. One solution to this is

pa_monad, a camlp4 syntax extension that lets you write the same piece of code like this:

  1. let f x_opt y_opt z_opt =
  2. perform
  3. x <-- x_opt;
  4. y <-- y_opt;
  5. z <-- z_opt;
  6. return (x + y + z)

This is much less painful to look at, but introducing new syntax has a cost, and it's worth seeing if we can make things more livable without resorting to a syntax extension. We can make our original code less eye-gougingly awful by just changing the indentation:

  1. let f x_opt y_opt z_opt =
  2. x_opt >>= (fun x ->
  3. y_opt >>= (fun y ->
  4. z_opt >>= (fun z ->
  5. return (x + y + z))))

That still leaves the parens. But as it happens, the parens can simply be dropped! This was passed on to me by Adam Chlipala, who got it in turn from Xavier Leroy at POPL this year. Our code can thus be written as follows:

  1. let f x_opt y_opt z_opt =
  2. x_opt >>= fun x ->
  3. y_opt >>= fun y ->
  4. z_opt >>= fun z ->
  5. return (x + y + z)

This seems clean enough to actually use in practice. Now all I need is to coerce tuareg into indenting my code this way, and I'll be happy...

Variable-argument functions

Here's another puzzle:

Is it possible in OCaml to define a variable-argument function? For example, can one define a function f and values a and z such that the following assertions hold:

  2. assert (f z = 0);
  3. assert (f a z = 1);
  4. assert (f a a z = 2);
  5. assert (f a a a z = 3);
  6. ...

Once you've got that, how about generalizing it to a variable-argument sum, i.e. define

f, a, and z such that:

  1. assert (f (a i1) (a i2) ... (a ik) z = i1 + i2 + ... + ik);

Or, if you want to eliminate the parens, define an f, a, and z such that:

  1. assert (f a i1 a i2 ... a ik z = i1 + i2 + ... + ik);

Typing RPCs

At Jane Street, we end up writing lots of messaging protocols, and many of these protocols end up being simple RPC-style protocols, i.e., protocols with a client and a server, where communication is done in a simple query/response style.

I've always found the writing of these protocols rather unsatisfying, because I could never find a clean way of writing down the types. In the following, I'd like to describe some nice tricks I've learned recently for specifying these protocols more cleanly.

A Simple Example

I'll start with a concrete example: a set of RPCs for accessing a remote filesystem. Here are the signatures for a set of functions that we want to make available via RPC.

  1. type path = Path of string list with sexp
  2. type 'a result = Ok of 'a | Error of string with sexp
  4. val listdir : path -> string list result
  5. val read_file : path -> string result
  6. val move : path * path -> unit result
  7. val put_file : path * string -> unit result
  8. val file_size : path -> int result
  9. val file_exists : path -> bool

The with sexp appended to the end of the type definitions comes from the Jane Street's publicly available sexplib macros. These macros generate functions for converting values to and from s-expressions. This is fantastically helpful for writing messaging protocols, since it gives you a simple hassle-free mechanism for serializing values over the wire. (Unfortunately, s-expression generation is not the fastest thing in the world, which is why we've written a set of binary serialization macros for high-performance messaging applications, which we intend to release.)

The usual next step would be to build up two types, one for the queries and one for the responses. Here's what you might write down to support the functions shown above.

  1. module Request = struct
  2. type t = | Listdir of path
  3. | Read_file of path
  4. | Move of path * path
  5. | Put_file of path * string
  6. | File_size of path
  7. | File_exists of path
  8. with sexp
  9. end
  11. module Response = struct
  12. type t = | Ok
  13. | Error of string
  14. | File_size of int
  15. | Contents of string list
  16. | File_exists of bool
  17. with sexp
  18. end

In some ways, this is great. The types are simple to write down and understand, and you get the wire protocol virtually for free from the s-expression converters. And both the server and the client code are pretty easy to write. Let's look at how that code might look.

First, let's assume we have functions for sending and receiving s-expressions over some connection object, with the following signature:

  1. val send : conn -> Sexp.t -> unit
  2. val recv : conn -> Sexp.t

Then the server code should look something like this:

  1. let handle_query conn =
  2. let module Q = Query in
  3. let module R = Response in
  4. let msg = Q.t_of_sexp (recv conn) in
  5. let resp =
  6. match query with
  7. | Q.Listdir path ->
  8. begin match listdir path with
  9. | Ok x -> R.Contents x
  10. | Error s -> R.Error s
  11. end
  12. | Q.Read_file path ->
  13. .
  14. .
  15. .
  16. in
  17. send (R.sexp_of_t resp)

And the client code could look like something this:

  1. let rpc_listdir conn path =
  2. let module Q = Query in
  3. let module R = Response in
  4. send conn (Q.sexp_of_t (Q.Listdir path));
  5. match R.t_of_sexp (recv conn) with
  6. | R.Contents x -> Ok x
  7. | R.Error s -> Error s
  8. | _ -> assert false

Unfortunately, to make this all work, you've been forced to turn your type definitions sideways: rather than specifying for each RPC a pair of a request type and a response type, as you do in the specification of ordinary function type, you have to specify all the requests and all the responses at once. And there's nothing in the types tying the two sides together. This means that there is no consistency check between the server code and the client code. In particular, the server code could receive a File_size query and return Contents, or Ok, when really it should only be returning either a File_size or Error, and you would only catch it at runtime.

Specifying RPCs with Embeddings

But all is not lost! With just a little bit of infrastructure, we can specify our protocol in a way that ties together the client and server pieces. The first thing we need is something that we're going to call an embedding, but which you might see referred to elsewhere as an embedding-projection pair. An embedding is basically a pair of functions, one for converting values of a given type into some universal type, and the other for converting back from the universal type. (For another take on universal types, take a look at this post from Steven). The universal type we'll use is S-expressions:

  1. type 'a embedding = { inj: 'a -> Sexp.t;
  2. prj: Sexp.t -> 'a; }

It's worth noting that the projection function is always going to be partial, meaning it will fail on some inputs. In this case, we'll encode that partiality with exceptions, since our s-expression macro library generates conversion functions that throw exceptions when a value doesn't parse. But it's often better to explicitly encode the partiality in the return type of the projection function.

We can now write up a type that specifies the type of the RPC from which we can derive both the client and server code.

  1. module RPC = struct
  2. type ('a,'b) t = {
  3. tag: string;
  4. query: 'a embedding;
  5. resp: 'b embedding;
  6. }
  7. end

Here's a how you could write the RPC.t corresponding to the listdir function:

  1. module RPC_specs = struct
  2. type listdir_resp = string list result with sexp
  3. let listdir = { RPC.
  4. tag = "listdir";
  5. query = { inj = sexp_of_path;
  6. prj = path_of_sexp; };
  7. resp = { inj = sexp_of_listdir_resp;
  8. prj = listdir_resp_of_sexp; };
  9. }
  11. .
  12. .
  13. .
  15. end

One slightly annoying aspect of the above code is that we had to define the type listdir_resp purely for the purpose of getting the corresponding s-expression converters. At some point, we should do a post on type-indexed values to explain how one could get around the need for such a declaration.

Note that the above specifies the interface, but not actually the function used to implement the RPC on the server side. The embeddings basically specify the types of the requests and responses, and the tag is used to distinguish different RPCs on the wire.

As you may have noticed, an ('a,'b) RPC.t corresponds to a function of type 'a -> 'b. We can put this correspondence to work by writing a function that takes an ('a,'b) RPC.t and an ordinary function of type

  1. 'a -> 'b

and produces an RPC handler. We'll write down a simple implementation below.

  1. type full_query = string * Sexp.t with sexp
  2. (* The first part is the tag, the second half is the s-expression for the arguments to the query. We only declare this type to get the s-expression converters *)
  4. module Handler : sig
  5. type t
  6. val implement : ('a,'b) RPC.t -> ('a -> 'b) -> t
  7. val handle : t list -> Sexp.t -> Sexp.t
  8. end
  9. =
  10. struct
  11. type t = { tag: string;
  12. handle: Sexp.t -> Sexp.t; }
  14. let implement rpc f =
  15. { tag = rpc.RPC.tag;
  16. handle = (fun sexp ->
  17. let query = rpc.RPC.query.prj sexp in
  18. rpc.RPC.resp.inj (f query)); }
  20. let handle handlers sexp =
  21. let (tag,query_sexp) = full_query_of_sexp sexp in
  22. let handler = List.find ~f:(fun x -> x.tag = tag) handlers in
  23. handler.handle query_sexp
  24. end

Using the RPC.t's we started writing as part of the RPC_specs module, we can now write the server as follows:

  1. let handle_query conn =
  2. let query = recv conn in
  3. let resp =
  4. Handler.handle [ Handler.implement RPC_specs.listdir listdir;
  5. Handler.implement RPC_specs.read_file read_file;
  6. Handler.implement RPC_specs.move move;
  7. Handler.implement RPC_specs.put_file put_file;
  8. Handler.implement RPC_specs.file_size file_size;]
  9. query
  10. in
  11. send conn resp

And we can implement the client side just as easily.

  1. let query rpc conn x =
  2. let query_sexp = rpc.RPC.query.inj x in
  3. send (sexp_of_full_query (rpc.RPC.tag,query_sexp));
  4. rpc.RPC.resp.prj (recv conn)
  6. module Client : sig
  7. val listdir : path -> string list result
  8. val read_file : path -> string result
  9. val move : path * path -> unit result
  10. val put_file : path * string -> unit result
  11. val file_size : path -> int result
  12. val file_exists : path -> bool
  13. end
  14. =
  15. struct
  16. let listdir = query RPC_specs.listdir
  17. let read_file = query RPC_specs.read_file
  18. let move = query RPC_specs.move
  19. let put_file = query RPC_specs.put_file
  20. let file_size = query RPC_specs.file_size
  21. let file_exists = query RPC_specs.file_exists
  22. end

Pleasantly, the signature of the Client module is exactly the same as the signature of the underlying functions we're exposing via RPC.

To be clear, this is far from a complete implementation --- particularly notable is the weak error handling, and we haven't said anything about how to deal with versioning of the protocol. But even though the implementation we've sketched out is a toy, we think this approach scales well to a full implementation.

There are still some problems. Although we've added static checks for some errors, we've eliminated some others. For instance, it's now possible for the user to specify multiple RPC.t's with the same tag, and there's no guarantee that the server has exhaustively implemented all of expected RPC.t's. I'm not aware of a clean way of getting all of these static checks working cleanly together in the same implementation.

Using let module for matching

In OCaml, referring to constructors defined in other modules can be somewhat awkward. Suppose we have a module like the following.

  1. module Example = struct
  2. type t = Foo | Bar | Baz
  3. end

To write a function that pattern matches on values of type Example.t we could directly refer to the variants as follows.

  1. let f e =
  2. match e with
  3. | Example.Foo -> ...
  4. | Example.Bar -> ...
  5. | Example.Baz -> ...

That is pretty verbose. We could alleviate the problem by opening Example.

  1. open Example
  2. let f e = match e with
  3. | Foo -> ...
  4. | Bar -> ...
  5. | Baz -> ...

That is nicer to look at, but the open potentially brings a lot of things into scope (and not just for f, but for the rest of the file). Using open is generally bad style because it makes it hard for a reader to connect definitions with uses. The open would be less problematic if we could reduce its scope. We can do that by using a local module.

  1. let f e =
  2. let module M = struct
  3. open Example
  4. let res =
  5. match e with
  6. | Foo -> ...
  7. | Bar -> ...
  8. | Baz -> ...
  9. end in
  10. M.res

That's pretty verbose too. The approach we've settled on at Jane Street is to use let module to rebind the module to a short name, thereby making the code concise and avoiding the open entirely.

  1. let f e =
  2. let module E = Example in
  3. match e with
  4. | E.Foo -> ...
  5. | E.Bar -> ...
  6. | E.Baz -> ...

Extracting an exception from a module

The Unix module defines the Unix_error exception constructor.

  1. module Unix : sig
  2. exception Unix_error of error * string * string
  3. ...
  4. end

Suppose you want to create your own My_unix module that defines some Unix utility functions and exports the same Unix_error. How would you do it? You can't redeclare

Unix_error, since that would make a new constructor, which won't match Unix.Unix_error.

  1. module My_unix = struct
  2. exception Unix_error of error * string * string (* a new exception *)
  3. ... my unix functions ...
  4. end

You could include the whole Unix module, but that pollutes the namespace of My_unix unnecessarily.

  1. module My_unix = struct
  2. include Unix
  4. ... my unix functions ...
  5. end

A trick to bring just the exception constructor you want into scope is to use a constrained include of the form include (M : sig ... end).

  1. module My_unix = struct
  2. include (Unix : sig exception Unix_error of Unix.error * string * string end)
  4. ... my unix functions ...
  5. end

This does require duplicating the exception declaration in the signature, but the type checker will of course guarantee that the declaration you write matches the original, so there is no real chance for error.

A universal type?

Is it possible in OCaml to implement a universal type, into which any other type can be embedded? More concretely, is possible to implement the following signature?

  1. module type Univ = sig
  2. type t
  3. val embed: unit -> ('a -> t) * (t -> 'a option)
  4. end

The idea is that t is the universal type and that embed () returns a pair (inj, prj), which inject to and project from the universal type. Projection is partial (returns an option) because injection is not surjective.

Here is an example of how to use `Univ'.

  1. module Test (U : Univ) = struct
  2. let (of_int, to_int) = U.embed ()
  3. let (of_string, to_string) = U.embed ()
  4. let r : U.t ref = ref (of_int 13)
  5. let () = begin
  6. assert (to_int !r = Some 13);
  7. assert (to_string !r = None);
  8. r := of_string "foo";
  9. assert (to_int !r = None);
  10. assert (to_string !r = Some "foo");
  11. end
  12. end

Try it for yourself and see if you can implement module Univ : Univ so that Test (Univ) passes. No Obj.magic or other unsafe features allowed!

Talking at Penn

I just got back from an enjoyable visit at Penn. I gave a version of my POPL talk for an audience consisting in large part of students taking Benjamin Pierce's advanced programming class, which is being done in Haskell with a little bit of ML. I also got a chance to chat with some of the PL faculty and grad students and to hear what people are up to on the research front.

It was a fun afternoon. I hope among other things that it stirs up some more interest (and proposals) for this year's OCaml Summer Project.

I also spoke with Benjamin about the evolution of their intro programming course. A few years back they were teaching it in OCaml. Then, for all sorts of perfectly understandable reasons, they ended up moving the course to Java. This despite the fact that Benjamin's feeling was that the students ended up better prepared for thinking about Java when the intro course was focused more on OCaml.

It all makes sense, but it is still too bad to see one of the few places in the US really teaching functional programming as an early part of the curriculum give up on it. Maybe it will get resurrected at some point. That said, Benjamin also pointed out that there are advantages to teaching ML or Haskell to an advanced programming class, in that you get to hit the students with it when they're really ready to appreciate the power of the approach. It certainly seems like they put together a good set of students this year.

HOWTO: Static access control using phantom types

We thought that phantom types would be an appropriate topic for our first real post because they are a good example of a powerful and useful feature of OCaml that is little used in practice.

In this post, I'll cover a fairly simple use of phantom types: enforcing a capability-style access-control policy. In particular, I'll describe how you can create easy to use read-only handles to a mutable data structure. We'll explore this using the example of an int ref. The int ref is a toy example, but the same approach can be used for more realistic cases, such as a string library or a database interface.

We'll start by implementing an int ref module on top of OCaml's built-in ref.

  1. module Ref : sig
  2. type t
  3. val create : int -> t
  4. val set : t -> int -> unit
  5. val get : t -> int
  6. end
  7. =
  8. struct
  9. type t = int ref
  10. let create x = ref x
  11. let set t x = t := x
  12. let get t = !t
  13. end

The simplest way of getting a read-only handle is to create another module with a different, more constrained signature.

  1. module RORef : sig
  2. type t
  3. val import : Ref.t -> t
  4. val get : t-> int
  5. end
  6. =
  7. struct
  8. type t = Ref.t
  9. let import x = x
  10. let get = Ref.get
  11. end

An RORef.t is just a Ref.t underneath, but the signature hides that fact by making the RORef.t abstract. Note that there is a function for converting Ref.t's to RORef.t's (import), but not the other way around. This gives you a way to create the read-only handle, but prevents someone with such a handle from recovering the underlying read-write handle. The downside to this approach is that it is impossible to write code that is polymorphic over Ref.t's and RORef.t's, even if that code only uses the functionality common to both, i.e., if it only reads.

A better solution is to use a phantom type to encode the access control rights associated with a particular value. But what is a phantom type? The definition unfortunately makes it sound more complicated than it is. A phantom type is a type that is used as a parameter to another type (like the int in int list), but which is unused in the actual definition (as in type 'a t = int). The fact that the phantom parameter is unused gives you the freedom to use it to encode additional information about your types, which you can then convince the type checker to keep track of for you. Since the phantom type isn't really part of the definition of the type, it has no effect on code-generation and so is completely free at runtime. The way in which you convince the type-checker to track the information you're interested in is by constraining the appearance of the phantom types using a signature.

It's easier to understand once you look at an example.

  1. type readonly
  2. type readwrite
  4. module PRef : sig
  5. type 'a t
  6. val create : int -> readwrite t
  7. val set : readwrite t -> int -> unit
  8. val get : 'a t -> int
  9. val readonly : 'a t -> readonly t
  10. end
  11. =
  12. struct
  13. type 'a t = Ref.t
  14. let create = Ref.create
  15. let set = Ref.set
  16. let get = Ref.get
  17. let readonly x = x
  18. end

In the above code, the phantom type tells you what your permissions are. A readwrite PRef.t can read and write, and a readonly PRef.t can only read. Note that the get function doesn't pay any attention to the phantom type, which is why get can be used with both readwrite and readonly PRef.t's. The only function that can modify a ref is get, and that requires a readwrite PRef.t.

Note that the types readonly and readwrite have no definitions. They look like the declaration of an abstract type, except that these definitions are not in a signature. They're actually examples of uninhabited types, i.e., types without associated values. The lack of values presents no problems here, since we're using the types only as tags.

The great thing about this approach is how seamlessly it works in practice. The user of the library can write things in a natural style, and the type system propagates the access-control constraints as you would expect. For example, the following definitions

  1. let sumrefs reflist =
  2. List.fold_left (+) 0 (List.map PRef.get reflist)
  4. let increfs reflist =
  5. List.iter (fun r -> PRef.set r (PRef.get r + 1)) reflist

will be given the following inferred types

  2. val sumrefs : 'a PRef.t list -> int
  3. val increfs : readwrite PRef.t list -> unit

In other words, the first function, which only reads, can operate on any kind of ref, and the second, which mutates the refs, requires a readwrite ref.

There is one problem with the access control policy we implemented above, which is that there is no clean way of guaranteeing that a given value is immutable. In particular, even if a given value is readonly, it doesn't preclude the existence of another readwrite handle to the same object somewhere else in the program. (Obviously, immutable int refs are not a particularly compelling application, but having both mutable and immutable versions makes sense for more complicated data types, such as string or arrays).

But we can get immutable values as well by making the phantom types just slightly more complicated.

  1. type readonly
  2. type readwrite
  3. type immutable
  5. module IRef : sig
  6. type 'a t
  7. val create_immutable : int -> immutable t
  8. val create_readwrite : int -> readwrite t
  9. val readonly : 'a t -> readonly t
  10. val set : readwrite t -> int -> unit
  11. val get : 'a t -> int
  12. end
  13. =
  14. struct
  15. type 'a t = Ref.t
  16. let create_immutable = Ref.create
  17. let create_readwrite = Ref.create
  18. let readonly x = x
  19. let set = Ref.set
  20. let get = Ref.get
  21. end

Importantly, there's no way for an IRef.t to become immutable. It must be immutable from birth.

Extra credit: Making it more polymorphic

One thing that's notable about the IRef signature is that there is no way of creating an actual polymorphic IRef.t. The two creation functions both create values with specific tags, immutable or readwrite. These specialized create functions aren't strictly necessary, though. We could have instead written IRef with the following signature.

  1. sig
  2. type 'a t
  3. val create : int -> 'a t
  4. val set : readwrite t -> int -> unit
  5. val get : 'a t -> int
  6. val readonly : 'a t -> readonly t
  7. end

The user can force the creation of an immutable or readwrite Ref by adding a constraint. So, you could get the effect of

  1. let r = IRef.create_immutable 3

by instead writing

  1. let r = (IRef.create 3 : immutable IRef.t)

The advantage of the polymorphic create function is straightforward: it allows you to write functions that are more polymorphic, and therefore more flexible. For instance, you could write a single function that could create, depending on context, an array of readwrite refs, an array of readonly refs, or an array of immutable refs.

The downside is that it may require more type annotations when you do want to be explicit about the permissions, and it also allows some weird types to come up. In particular, you can create an IRef.t with any phantom parameter you want! Nothing stops you from creating a string IRef.t, even though string doesn't make any sense as an access-control tag. Interestingly, the signature doesn't actually make any reference to the immutable type, and in fact, using any phantom parameter other than readonly and readwrite makes the ref immutable. The access control restrictions still work in roughly the way you would expect, but it is still a bit harder to think about than the original signature.