A solution to the ppx versioning problem

Ppx is a preprocessing system for OCaml where one maps over the OCaml abstract syntax tree (AST) to interpret some special syntax fragments to generate code.

Ppx rewriters get to work on the same AST definition as the compiler, which has many advantages:

  • The AST corresponds (almost) exactly to the OCaml language. This is not completely true as the AST can represent programs that you can't write, but it's quite close.

  • Given that the compiler and pre-processor agree on the data-type, they can communicate between each other using the unsafe [Marshal] module, which is a relatively cheap and fast way of serializing and deserializing OCaml values.

  • Finally, the biggest advantage for the user is that the locations in the original code are exactly preserved, which is a requirement to get usable error messages. This is not so great for the generated code, as the best one can do is reuse some locations from the original source code and hope for the best. In practice the user sometimes gets non-sensical errors, but this is a commonly accepted trade-off.

There is however one drawback to all this, the compiler AST is not stable and code using it is at the mercy of its evolution. We got lucky with the 4.04 release of OCaml but the 4.03 one was quite disruptive. Even before releases, whenever the AST definition changes during a development cycle, many ppx might not be usable for a while, which make testing a lot harder.

Several ideas have been flying around, such as adding a layer to convert between different versions of the AST. While this would work, it has the drawback that you need this layer for every variant of the AST. And when you want to make a patch modifying the AST, you'll need to do the extra work of updating this layer first.

In this blog post we show how we managed to solve the ppx compatiblity problem in a way that improves the user experience and lets us produce releases that don't depend on ppx rewriters at all.

We did this work while working on Base, our upcoming standard library. In the end, it's likely we'll use only the third of the methods described below for Base, while the others will be used to improve the user experience with the rest of our packages.

What do other code generators do?

Ppx is not the first system for generating code that is a mix of user-written code and machine generated code. A typical class of generators that get it right, i.e. that preserve locations and are independant from the AST definition, are lexer/parser generators, and not only the ones distributed with the compiler.

Let's take the example of lexer generators (parser generators work basically the same way). The users write a series of rules, consisting of a rular expression and an action to take if the input matches it:

rule token = parse
| "+"  { PLUS  }
| "-"  { MINUS }
| '0'..'9'+ as s { INT (int_of_string s) }
| " "* { token lexbuf (* skip blanks *) }

This code is written in a .mll file and the generator then produces a .ml file with code for the lexing engine interleaved with the user written actions.

In order to keep the locations of the user-written code pointing to the right place in the .mll file, the generator produces:

# 42 "lexer.mll"
         token lexbuf (* skip blanks *)

The OCaml compiler interprets the line starting with a # and updates its current location to point to the begining of line 42 in file lexer.mll. This is called a line directive.

To go back into the generated code, the lexer generator produces:

# 100 "lexer.ml"

Where 100 correspond to the real line number in lexer.ml.

With this method, when there is an error in the user-written code, it points to the lexer.mll file, while when there is an error in the generated code it points to the lexer.ml file. Even if the generated code might not be particularly easy to understand, at least you get to see the real code the compiler chokes on.

Another big advantage is that when using a debugger, you can follow the execution through the generated code.

Can we do the same for ppx?

At first glance, it seems that ppx rewriters work in a very different way, but the result is the same: only parts of the file is generated and the rest is taken as if from what the user wrote. In fact, compared to the lexer case, most of the resulting code is user written.

There is however some work to do to get the same result as with lexer generators. First you have to distinguish the generated code from the user code.

If you take a ppx rewriter as a black box, then the only way is to apply some kind of tree diff between the input and the output. In our ppx framework however, we know exactly what fragments of the AST are rewritten by plugins and we know the rewriting is always local. This makes the job a lot simpler and probably faster as well, so we chose to take advantage of this information.

The method

It works this way: while mapping the AST, we collect all the fragments of generated code with the location of the code they replace in the original file. At the end we sort them in the order of the file and make sure there is no overlap. Every fragment is pretty printer to a string.

What we end up with is a list of text substitutions: beginning position, end position, replacemen text. The next step is to simply apply these substitutions to the original file. If you read the bog post about how we switched from camlp4 to ppx, you'll notice the resemblance here.

This is what the transformation looks like:

(* ----- input ----- *)
type t = int [@@deriving sexp_of]

let f x = x + 1

let g x = [%sexp_of: t] x

(* ----- output ----- *)
# 1 "foo.ml"
type t = int [@@deriving sexp_of]
# 4 "foo.ml.pp"
let sexp_of_t = sexp_of_int
#2 "foo.ml"

let f x = x + 1

let g x =
# 11 "foo.ml.pp"
sexp_of_t
# 5 "foo.ml"
                        x

The result for [@@deriving sexp_of] is not bad at all. For code rewritten inside expressions, the result is not as good given that it breaks those expressions up. But given than extensions are often sparse in our source files, this is still acceptable.

This mode can be selected with ppx_driver based rewriter by passing the flag -reconcile.

Solving the compatiblity problem

With this mode, one can first generate a .ml.pp file and feed that to the compiler. Given that the concrete syntax of the language breaks much less often than the internal AST definition, a working ppx is likely to work for a very long time.

We'll soon start releasing a separate package that snapshots one version of the lexer/parser/AST/printer of the OCaml compiler. This package will have its own release schedule and will typically be updated soon after each relase of OCaml. This will give time for ppx authors to upgrade their code when it breaks while still allowing people to try out the new compiler with their favorite packages.

Mode for release tarballs

In addition to the mode described above, ppx_driver has a second mode -reconcile-with-comments where the result is similar to the one with line directives expect than the generated code is enclosed with comment markers:

type t = int [@@deriving sexp_of]
(* GENERATED CODE BEGIN *)
let sexp_of_t = sexp_of_int
(* GENERATED CODE END *)

let f x = x + 1

let g x =
(* GENERATED CODE BEGIN *)
sexp_of_t
(* GENERATED CODE END *)
                        x

This mode is intended for release tarballs. One can replace all the files in-place by the pre-processed version using =-reconcile-with-comments=. The result is readable and has the big advantage that you you don't need to depend on the ppx rewriter, which means the package is faster to install for users.

Jane Street packages will eventually move to this scheme, either for the next stable release or the one after that. One technical issue with this method is that to take full advantage of it, the runtime support libraries of the various ppx rewriters must be installable without the rewriter itself. Splitting the packages in opam is fine but splitting the repository is not desirable as often both components make strong assumption about the other.

For Jane Street packages, we'll need to update our release system so that it supports generating two opam packages from one repository.

ppx as a verfication tool only

While these new methods improve the ppx story in general, for Base we wanted to go even further and allow users to build Base without the need for ppx at all, both for the release and for the development versions. Not only to cut down the dependencies, but also to provide a better experience in general. For instance if you are working on a patched compiler and need the development version of Base, you shouldn't need all the ppx rewriters that might not work for some reason.

We explored various bootstrap story, and while they worked they were not very nice, especially for such an important library. Its development and build processes should be straightforward.

We even looked into not using ppx at all. While this is OK for many ppx rewriters that are mostly syntactic sugars, it is more problematic for [@@deriving ...]. It's not so much that the code is hard to write by hand, most data-structures in Base are either simple datatypes or require and written combinators anyway, but it is a pain to review. This code is very mechanical and you have to make sure that the constant strings correspond to the constructor/field names and other things where the machine can do much better than a human.

In the end we found a solution to keep the best of both worlds, /i.e./ being able to build the original source code without pre-processing and avoid having to write and review this boilerplate code.

The idea is to use ppx in the same way that we write expect tests; the tool only checks that what's comes after the type-definition correspond to what the rewriters derive from it. In case of mismatch it produces a .corrected file just like expect tests.

We are currently experimenting with this method for Base. It's possible that we'll have some marker to delimit the end of the generated code. In the end the code could look like this:

type t = A | B [@@deriving sexp_of]

let sexp_of_t = function
  | A -> Sexp.Atom "A"
  | B -> Sexp.Atom "B"

[@@@end_of_derived_code]

Given that the compiler ignores attributes it doesn't understand, this code compiles just fine without any pre-processing.

When running the ppx rewriter in this expect mode, the generated AST is matched against the source AST without taking locations into account, so that mean that you can reformat the code as you wish and even add comments.

The challenge now is to update our ppx rewriters so that they produce code that we are happy to show. Until now we didn't focus too much on that, but we have a good idea about how to do it. The plan is to move more of the logic of the various deriving system into proper functions instead of generating more code. Note that this is an improvement in general as proper functions are a lot easier to understand and maintain than code generators.

Conclusion

In this blog post we described a simple and clean method to decouple ppx rewriters from the release schedule of the compiler. This method has the advantage that once the ppx is written is likely to work for a long time and especially to work out of the box with development compilers.

Moreover, this method has is better for users as errors point to the real code the compiler sees and when debugging they can follow the execution through generated code without trouble.

All this is currently implemented in ppx_core/ppx_driver. Our github repositories haven't been updated in a while has the Base refactoring disrupted our public release process quite a bit. These new features should be published in the coming weeks and we'll be part of the next stable release of our packages, planned for the beginning of December.

Converting a code base from camlp4 to ppx

As with many projects in the OCaml world, at Jane Street we have been working on migrating from camlp4 to ppx. After having developed equivalent ppx rewriters for our camlp4 syntax extensions, the last step is to actually translate the code source of all our libraries and applications from the camlp4 syntax to the standard OCaml syntax with extension points and attributes.

For instance to translate code using pa_ounit and pa_test, we have to rewrite:

TEST = <:test_result< int >> ~expect:42 (f x)

to:

let%test _ = [%test_result: int] ~expect:42 (f x)

For small to medium projects it is enough to just take a couple hours to translate the source code by hand. But at Jane Street where we have a huge OCaml code base making extensive use of camlp4, it is simply not realistic. So we needed a tool to do that for us.

Writing a tool to automatically convert the syntax

Since the output of such as tool has to be accepted as the new code source that is committed in our repository, it must preserve the layout of the original file as much as possible and of course keep the comments. This mean that any approach using an AST pretty-printer would be extremely complex.

The path we choosed is to textually substitute the foreign syntaxes in the original file for the new ones. One could imagine doing that with a tool such as sed, awk, perl, ... however doing it properly would be fastidious and it would be pretty-much impossible to be 100% sure it would never translate things it is not supposed to. Plus writing perl is not as fun as writing OCaml programs.

Instead there is an easy way to find the foreign syntaxes: using camlp4 itself. To subsitute the text of foreign syntaxes the only thing we need to know is their location in the original file, and camlp4 can help us with that.

Writing dummy camlp4 syntax extensions

The idea is to write for each camlp4 syntax extension a dummy one that define the same grammar productions as the real one, but instead of generating code it simply record substitutions at certain locations.

Then we do the following:

  • parse a file with camlp4 and our dummy syntax extensions
  • apply all the recorded substitutions to the original file

This approach has the advantage of interpreting the original file in the exact same way as our regular syntax extensions. Giving us good confidence we did not change the syntactic constructions by mistake.

To do so we define this API:

(** [replace loc repl] records a text substitution that replaces the
    portion of text pointed by [loc] by [repl]. *)
val replace : Loc.t -> string -> unit

Then writing a dummy camlp4 syntax extension is pretty easy. For instance for a subset of pa_ounit:

EXTEND Gram
  GLOBAL: str_item;

  test: [[ "TEST" -> replace _loc "let%test" ]];

  name_equal:
    [[ `STRING _; "=" -> ()
     |            "=" -> replace _loc "_ ="
     ]];

  str_item:
    [[ test; name_equal; expr -> <:str_item< >>
    ]];
END

On the fly conversion and diffing the generated code

Since this tool was convenient to use, we used it to check that our newly written ppx rewriters did the same thing as the old camlp4 syntax extensions:

  • for a given OCaml source file of a library or application, we converted it using camlp4-to-ppx and saved the result
  • we processed the original file using camlp4 and the translated one using our ppx rewriters
  • in both case we saved the output of -dparsetree (human-readable version of the internal OCaml AST) and -dsource (pretty-printed code)
  • we diffed the camlp4 and ppx outputs of -dparsetree, as well as the outputs of -dsource

This was all quite easy to do with jenga. We kept looking at the generated diffs until they were all empty.

We have quite a lot of code in our camlp4 syntax extensions and converting them to ppx was a long mechanical job and so quite error-prone. Given that this diffing turned out to be really helpful to find errors.

The Camlp4 Syntax is not quite the OCaml syntax

While using this we noticed that quite a few syntaxes accepted by camlp4 are not accepted by OCaml, for instance:

let _ x = x

let f l = List.map l ~f:fun x -> x + 1

These where quite easy to fix automatically as well using camlp4-to-ppx.

Github repo and extension

We published a slightly modified version of this tool on github.

The method we used doesn't work out of the box with all syntax extensions. For instance to convert code using lwt.syntax some more work needs to be done on camlp4-to-ppx. But it is a good starting point.

4