The networks connecting our clients and servers are, on average, more reliable than
consumer-level last miles like cellular or home ISPs, but given enough information
moving across the wire, they’re still going to fail given enough time. Outages,
routing problems, and other intermittent failures may be statistically unusual on
the whole, but still bound to be happening all the time at some ambient background rate.
While in microservices architecture, the number of network connections grows exponentially and the risk of network issues will be much higher than with monolithic applications.
To overcome this sort of inherently unreliable environment, it’s important to design
APIs and clients that will be robust in the event of failure, and will predictably
bring a complex integration to a consistent state despite failures. Remember that a
service might be a client when calling another service.
Planning for failures
Consider a call between any two nodes. There are a variety of failures that can occur:
The initial connection could fail as the client tries to connect to a server.
The call could fail midway while the server is fulfilling the operation, leaving
the work in limbo.
The call could succeed, but the connection breaks before the server can tell its
client about it.
Any one of these leaves the client that made the request in an uncertain situation. In some cases, the failure is definitive enough that the client knows with good certainty that it’s safe to simply retry it, for example, in the case of a total failure to even establish a connection to the server. In many others though, the success of the operation is ambiguous from the perspective of the client, and it doesn’t know whether retrying the operation is safe. A connection terminating midway through message exchange is an example of this case.
This problem is a classic staple of distributed systems, and the definition is broad when
talking about a “distributed system” in this sense: as few as two computers connecting
via a network that are passing each other messages.
With microservices getting popular, partial failure is more complicated as it can happen deep in the chain of services. It is not a simple retry from any client and the retry has to be initialized from the original client. This requires all the services in the chain must be idempotent.
Use of idempotency
The easiest way to address inconsistencies in a distributed state caused by failures is to implement server endpoints so that they’re idempotent, which means that they can be called any number of times while guaranteeing that side effects only occur once.
When a client sees any kind of error, it can ensure the convergence of its own state with the server’s by retrying, and can continue to retry until it verifiably succeeds. This fully addresses the problem of an ambiguous failure because the client knows that it can safely handle any failure using one simple technique.
As an example, consider the API call for a hypothetical DNS provider that enables us to add
subdomains via an HTTP request:
All the information needed to create a record is included in the call, and it’s perfectly
safe for a client to invoke it any number of times. If the server receives a call that it
realizes is a duplicate because the domain already exists, it simply ignores the request and
responds with a successful status code.
According to HTTP semantics, the PUT and DELETE verbs are idempotent, and the PUT verb in
particular signifies that a target resource should be created or replaced entirely with the
contents of a request’s payload (in modern RESTful service, a partial modification would be
represented by a PATCH).
We have used request/response communication style as an example; however, messaging based microservices should follow the same approach in case duplicate messages are delivered in the pipeline. In fact, most message driven async microservices are more tolerant to partial failures than sync style of interactions like REST, GraphQL and RPC. That is why we have provided light-tram-4j(transactional messaging), light-eventuate-4j(eventual consistency with event sourcing and CQRS) and light-saga-4j(distributed transaction orchestration between services) frameworks and all of them rely on service idempotency.
Guarantee exactly once
While the inherently idempotent HTTP semantics around PUT and DELETE are a good fit for many
API calls, what if we have an operation that needs to be invoked exactly once and no more?
An example might be if we were designing an API endpoint to charge a customer money; accidentally
calling it twice would lead to the customer being double-charged, which is very bad.
This is where idempotency keys come into play. When performing a request, a client generates a
unique ID to identify just that operation and sends it up to the server along with the normal
payload. The server receives the ID and correlates it with the state of the request on its end.
If the client notices a failure, it retries the request with the same ID, and from there it’s
up to the server to figure out what to do with it.
If we consider our sample network failure cases from above:
On retrying a connection failure, on the second request the server will see the ID for the
first time, and process it normally.
On a failure midway through an operation, the server picks up the work and carries it through. The exact behavior is heavily dependent on implementation, but if the previous operation was successfully rolled back by way of an ACID database, it’ll be safe to retry it. Otherwise, the state is recovered and the call is continued.
On a response failure (i.e. the operation executed successfully, but the client couldn’t get
the result), the server simply replies with a cached result of the successful operation.
In light-4j, we have a component traceability that is a middleware handler to move traceabilityId from request header to response header. Also, the client module which is used to call services has the ability to propagate traceabilityId to the next request in the service call stack. This header can be used as an idempotency key on mutating services(i.e. anything under POST in our case).
If the above one request fails due to a network connection error, you can safely retry it
with the same idempotency key, and the services must be designed to work with the idempotency
key/traceabilityId to decide what the business logic with the key.
Being a good distributed citizen
Safely handling failure is hugely important, but beyond that, it’s also recommended that it be
handled in a considerate way. When a client sees that a network operation has failed, there’s
a good chance that it’s due to an intermittent failure that will be gone by the next retry.
However, there’s also a chance that it’s a more serious problem that’s going to be more tenacious;
for example, if the server is in the middle of an incident that’s causing hard downtime. Not only
will retries of the operation not go through, but they may contribute to further degradation.
It’s usually recommended that clients follow something akin to an exponential backoff algorithm
as they see errors. The client blocks for a brief initial wait time on the first failure, but
as the operation continues to fail, it waits proportionally to 2n, where n is the number of
failures that have occurred. By backing off exponentially, we can ensure that clients aren’t
hammering on a downed server and contributing to the problem.
Exponential backoff has a long and interesting history in computer networking.
Furthermore, it’s also a good idea to mix in an element of randomness. If a problem with a
server causes a large number of clients to fail at close to the same time, then even with
back off, their retry schedules could be aligned closely enough that the retries will hammer
the troubled server. This is known as the thundering herd problem.
We can address thundering herd by adding some amount of random “jitter” to each client’s wait
time. This will space out requests across all clients, and give the server some breathing room
The client module in light-4j will be enhanced to retry on failure automatically with an idempotency key using increasing backoff times and jitter. Given the fact that not all clients need to retry on failure, we need to make sure it is configurable in client.yml so that retry can only be initiated from the original client.
Design robust APIs
Considering the possibility of failure in a distributed system and how to handle it is of
paramount importance in building APIs that are both robust and predictable. Retry logic on
clients and idempotency on servers are techniques that are useful in achieving this goal and
relatively simple to implement in any technology stack.
Here are a few core principles to follow while designing your clients and APIs:
Make sure that failures are handled consistently. Have clients retry operations against
remote services. Not doing so could leave data in an inconsistent state that will lead to
problems down the road.
Make sure that failures are handled safely. Use idempotency and idempotency keys to allow
clients to pass a unique value and retry requests as needed.
Make sure that failures are handled responsibly. Use techniques like exponential backoff
and random jitter. Be considerate of servers that may be stuck in a degraded state.