The Parable Of The Perfect Connection

Every programmer in the Intertube connected era eventually has to write, or at least use, an API for a network service - something like a database, a message queue, or web service. And, each and every one of them begins with the enthusiasm of the recently inducted as they realize that they can reach out their hand and control something else. And, each and every one of them experiences that moment of frustration and anger when they realize that their buddy out in cyberspace is a bit of a flake.

Now, we aren't talking about a seriously unreliable friend. In fact, your buddy isn't really unreliable at all. He's there 99.9% of the time, and even when he's out for a quick coffee break he tends to come back quickly. Besides, you don't have any real control over him. He's maintained by some other people in a bunker far far away. Those people are confusing, hard to reach, and don't seem to care about your problems. So, you do what countless programmers have done in the past...

You write a loop.

let rec connect_until_success host_and_port =
  connect host_and_port
  >>= function
  | Ok t -> t
  | Error _ ->
    after (sec 5.)
    >>= fun () ->
    connect_until_success host_and_port

Because you are feeling helpful you enshrine the loop in an API to help other people, because, after all, your buddy is pretty reliable, and it would be a shame if other people had to deal with all the nasty complexity that you've just programmed away.

There are a lot of classic twists and variations on this core storyline:

  • count the number of failures and give up after x tries (x is usually 3 or 1)

  • back off exponentially so you don't "hammer" the service

  • don't wait at all and actually hammer the service in a tight loop because latency is important

  • log the error, because someone will look at the logs carefully. Then retry.

  • keep careful track of the time of the last failure, and always retry, unless the last retry was "recent", because one blip makes sense but not two.

  • return an elaborate type that encompasses all possible failure modes, including the fact that we are retrying. Maybe deliver that information in a side channel stream of updates.

  • forget giving people a connect method at all. Just give them a query method and handle the pesky connection details away from prying eyes. You get bonus points if the API doesn't look like you can ever fail.

Hidden Failure Is Still Just Failure

Sadly, the problem isn't in the cleverness of the technical wizardry you use to cover up for your buddy, it's the fact that covering up failure is just another form of failing.

The connection failed. Not telling the world outside of your API is like hiding a bad grade from your parents. They might not catch you once or twice, but you still got the bad grade, and eventually they are going to notice that something is very very wrong - likely after things have really gone off the rails.

Which leads us to three useful principles of failure that apply to self-healing network connections, and most other failure besides.

Fail Quickly, Clearly, and Cleanly

When you design an API, or a system, or even a big complex collection of systems, and you think about how it should fail, make sure that the failure is:

  • Quick: Taking too long to fail is a cardinal sin. Don't retry a thousand times, don't get an hour deep into a computation only to realize that one of the config parameters is bad, and don't forget to add a timeout when the other side might never respond. The sooner you can tell the outside world that you have failed the sooner it can react.

  • Clear: Make sure that your failure behavior is clear, well documented, and can't be missed in a decently written program. It should be obvious from a read of the API and shouldn't require a dip into the underlying code to understand. Beyond that, don't mumble when you fail (I'm looking at you errno in C). Similarly, don't go on about all the little nuances surrounding your failure with a 20 case variant response. Most API consumers only care about the binary state of failure in the code. The details are generally uninteresting outside of debug logs and human readable messages.

  • Clean: Clean up anything and everything you can after you fail, as aggressively as you can. That means close your file descriptors, free your memory, kill your child process. Work harder than normal to make the cleanup portion of your code simple and obviously correct. But still remember to be quick. Do your work after you tell everyone that you have failed if there is any chance that you won't succeed. Don't be that function/program/system that never responds again because it hung trying to clean up before it reported the error.

How Should It Look?

Something like the following API, comments and all.

This makes heavy use of some nice things from our publicly released libraries. If you aren't already familiar with them you can take a deeper look here.

If you want the TLDR version, you really only need to understand Deferred and Or_error to get the gist.

A Deferred is a value that will get filled in at some point in the future (these are sometimes called promises), and when you read it here it just means that the function doesn't return immediately - usually because some network communication needs to happen to get the result.

Or_error is a fancy way of saying, "this might work, or it might give you an error". Returning an Or_error forces the caller to check for an error case in a very clear and explicit way. It's our standard way in an API to indicate that a function might not succeed because, unlike a comment about an exception that might be thrown, or a special return value (like NULL), Or_error can't be missed.

So, if you see something like:

response Or_error.t Deferred.t

You can read it as, "this won't return immediately, and when it does it will either be an error, or a response".

type t

(** connect to the service, returning t or an error if the connection could not
    be established. *)
val connect : ?timeout:Time.Span.t -> ... -> t Or_error.t Deferred.t

(** a simple helper function that calls connect with the original parameters.
    The passed in t is always closed when reconnect is called.  Multiple calls
    to reconnect on the same t will result in multiple connections. *)
val reconnect : t -> t Or_error.t Deferred.t

(** connects to the service and runs the provided function if successful.
    If the connection fails or [f] raises an Error is returned.  [close] is
    automatically called on [t] when [f] completes or raises. *)
val with_t
  :  ?timeout:Time.Span.t
  -> ...
  -> f:(fun t -> 'a Deferred.t)
  -> 'a Or_error.t Deferred.t

(** If timeout is not given it defaults to a sensible value. *)
val query : t -> ?timeout -> ... -> response Or_error.t Deferred.t

val query_exn : t -> ?timeout -> ... -> response Deferred.t

(** If timeout is not given it defaults to a sensible value.  The returned
    reader will be closed when the underlying connection is closed, either by
    choice or error.  It is a good idea for the update type to express the closed
    error to differentiate a normal close from an error close.  *)
val pipe_query
  :  t
  -> ?timeout:Time.Span.t
  -> ...
  -> update Pipe.Reader.t Or_error.t Deferred.t

val pipe_query_exn : t -> ?timeout -> ... -> update Pipe.Reader.t Deferred.t

(** close is idempotent and may be called many times.  It will never raise or
    block.  Once close has been called all future queries will return Error
    immediately.  A query in flight will return error as soon as possible. *)
val close : t -> unit

(** fulfilled when t is closed for any reason *)
val closed : t -> unit Deferred.t

(** closed is an error state.  Once a connection is in an error state it will
    never recover. *)
val state : t -> unit Or_error.t

Seriously, Never?

Up until now I've been making the case for try once, fail quickly and clearly, and I think that much, if not most of the time, it's the argument that should hold. But the world is a complex place. Sometimes things fail, and somebody somewhere has to try again. So where should that happen, and what should we consider when we start talking about retry logic?

How will this stack?

Loops stack poorly and lead to confusing non-linear behavior. This means that you should usually confine retry logic to a component near the bottom or the top of your stack of abstractions. Near the bottom is nice, because, like TCP, everyone can rely on the behavior. Near the top is nice because you have the most knowledge of the whole system there and can tune the behavior appropriately. Most network service API's are in the middle somewhere.

Can I opt out?

TCP sits on top of UDP and provides a solid retry mechasnism that works really well for most of the world, but it would be a mistake in design to only expose the TCP stack. If you are going to provide a self-healing connection/query system as part of your API, make sure to build and expose the low level simple API too. This lets clients with needs you didn't anticipate interact in the way that they want.

Love shouldn't be forever

It's more likely to be a mistake to try forever than to retry once, or for a set period of time. It's one thing to protect a client against a transient failure, but when the transient error lasts for minutes or hours, it's probably time to give up.

Your resource usage should be bounded

Loops, especially loops that create and clean up resources, have a tendency to consume more than their fair share. This is especially true when the loop is trying to cover for an error case, where things like resource cleanup might not work entirely as advertised. So, it's on the writer of a loop to test it heavily and to have strong bounds on how much CPU, memory, file handles, bound ports, etc. a single self-healing connection can take. Getting this right is hard, and you should be nervous about doing it quickly.

How bad is failure?

It's much easier to justify a looping retry if it's the only thing keeping a large complex system from crashing completely, and it's correspondingly harder to justify when it covers just one more case that any client needs to deal with anyway. For instance, a retry loop on my database connection might cleanly cover the occasional intermitent outage, but there are probably real reasons that the database might be out (network failure, bad credentials, maintenance window), and my program likely has to handle this case well anyway.

Not all failure is created equal

Some failures justify a retry. Some failures don't. It's important in retry logic to avoid big try/with blocks that catch any and every error on the assumption that any query or connection will eventually succeed. Retrying because my connection closed is different than retrying my malformed query. Sadly you can't always tell the difference between the two cases, but that doesn't mean you shouldn't make an effort.

You still have to consider failure

You can use a retry loop to limit errors above a certain abstraction boundary, or to limit the impact of small glitches, but you can't recover gracefully from all of the errors all of the time. When you add a retry loop to your system at any level stop to consider what should happen when the error is a real error and isn't transient. Who is going to see it? What should they do about it? What state will clients be in?

It's easier to solve a specific problem than a general one

It's much easier to come up with retry logic that makes sense for a particular application in a particular environment than it is to come up with retry logic that is generically good for all clients. This should push you to confine retry logic to clients/API's that have a single well considered role and to keep it out of API's that may be used in many different contexts.

Quick, Clear, and Clean still (mostly) apply

Even when you are considering retry logic, make sure you think about getting stuck (quick), getting debug information about your state to the outside world (clear), and keeping resource usage bounded (clean).