Limit Work for Reliable Servers (with examples in Go/gRPC) (evanjones.ca)

[ 2020-August-10 08:56 ]

One of the leading causes of cascading failures in software systems is trying to do too much work at the same time. Many servers, such as the Go HTTP and gRPC servers, will start working on an unlimited number of requests. Unfortunately, if requests arrive faster than they can be processed, a backlog of messages builds up. This eventually causes the server to be killed because it runs out of memory. To build reliable services, we must limit the amount of work that is in progress. This allows servers to survive overload scenarios by rejecting some of the work, rather than exploding. In this article, I’ll briefly describe why this happens, then discuss how to prevent it, with examples for Go HTTP and gRPC servers. (This was originally written for the Bluecore Engineering blog. This is also similar to a previous article looking at mathematical models for smarter queues.)

Why this happens

One model is to think of the server as a pipe with a flow rate corresponding to the maximum requests per second it can process. If responses arrive slower than the capacity, everything flows through happily, as shown in the figure below.

However, if requests arrive faster than they can be processed, the excess flow has to go somewhere. By default, most servers try to process all the requests, which means they queue requests in memory, as shown below. If the overload persists long enough, the server will run out of memory and will be terminated.

Things get even worse with a replicated service, where there are multiple copies of the server. In an overload scenario, eventually one of the servers exceeds its memory limit and is killed, as shown in the figure below on the left. This causes the requests flowing to that server to be redirected to the remaining servers. The remaining servers get even more load, so they also exceed their memory limits, as shown on the right. This is a classic cascading failure, and causes all the servers to get killed as soon as one is overloaded, usually within a few seconds.

Solution: Limit work

A server should reject work when it is overloaded. This allows it to serve some requests correctly, which is much better than not processing any requests. In our pipes analogy, this looks like explicitly rejecting some of the flow, as shown below:

To implement this, you need to check the current number of concurrent requests before doing any work, and reject requests if it is above some threshold. This limit will need to be tuned for each different server. If you set it too high, your server will still fail when it is overloaded. If it is too low, the server will not be able to hit its peak throughput and will waste resources.

When the programs sending the requests get a rejected response, they need to wait before retrying, to reduce the request rate rather than increase it. In extreme cases they need to stop sending requests entirely (see: exponential backoff, back pressure, circuit breakers, addressing cascading failures).

Go HTTP and gRPC servers: Unlimited by default

Unfortunately, the default Go HTTP and gRPC servers accept an unlimited amount of work, so are susceptible to this overload problem. Thankfully, it is easy to implement a limit yourself, using HTTP middleware or a gRPC interceptor (see an example).

For a really robust server, you should also limit the number of connections. Each open connection requires some memory, even if it is doing nothing. In particular, Go gRPC server connections are pretty expensive, using approximately 200 kiB of memory in our tests. This means a large number of idle connections can still cause a server to run out of memory, even with a limit on the number of executing requests. You can use netutil.LimitListener to implement a limit, and consider setting the idle connection timeout on the server to a reasonable value, to ensure that connections don’t stay idle for too long.