gRPC is easy to misconfigure (evanjones.ca)

[ 2021-March-14 11:35 ]

Google's gRPC is an RPC system that supports many languages, and is relatively widely used. I think its popularity is due to being used for parts of Docker and Kubernetes. I think gRPC is mostly fine, but it is surprisingly easy to screw up by misconfiguring it. Part of that is because RPC systems are challenging to get right. They need to be useful for a wide variety of services, ranging from high request rate services that handle thousands of tiny requests each second, to services that need to transfer a lot of data, or servics with thousands of concurrent, slow requests that take minutes to complete. As a result, an RPC system like gRPC needs to be very configurable. Unfortunately, it is also pretty easy to configure it in a way that causes hard to understand errors.

The rest of this blog post is going to describe two examples of annoying edge cases I ran into recently. I wasted about a day to debug and understand each of these. Mostly I'm hoping that if someone else runs into these errors, they will find this article and I can save them time. I'm also hopeful that the gRPC team will eventually make this library easier to use, by improving the error messages and documenting best practices.

Client keepalive is dangerous: do not use it

gRPC is designed to reuse TCP connections for multiple requests. However, many networks terminate connections that are idle for too long. For example, the AWS NLB TCP load balancer has a 350 second timeout. TCP has an optional keepalive mechanism. It is enabled by default on Linux, but with a 2 hour timeout before sending the first keepalive. This is useful for cleaning up long dead connections, but not to keep these connections alive through NATs or load balancers. Go configures TCP keepalives to 15 seconds by default, which should be often enough to keep the network connection alive.

Unfortunately, TCP keepalives are invisible to applications, and may not be supported by some operating systems. As a result, gRPC has its own keepalives. However, the gRPC client-side keepalive specification itself says: "client authors must coordinate with service owners for whether a particular client-side setting is acceptable". If a client sends keepalive pings too often, servers close the connection. This is to prevent large numbers of idle clients from consuming too many resources. The default is to allow one ping every 5 minutes, if an RPC call is active.

In my opinion, this means this setting is virtually unusable and should be avoided. The server defaults are insufficient for some load balancers (e.g. Azure's TCP load balancer drops idle connections after 4 minutes by default). It is hard and error prone to deploy a configuration that is more aggressive. You will have to first deploy all servers to permit the more aggressive pinging, then you will need to deploy the clients. If you want to undo it, you will need to do the opposite: first deploy the clients to ping less, then deploy the servers. If you screw this order up, or make a configuration error, you get intermittent "transport is closing" errors. As an alternative, always set a deadline on your gRPC requests, and use a reasonable retry policy. If that is insufficent, then you can set keepalives on the server, which avoids most of the problems. I wrote a long bug report asking for the gRPC documentation to be improved so hopefully others can avoid my mistake.

For others who might encounter this error, when client keepalive is too aggressive, client RPCs will fail with gRPC code UNAVAILABLE (14) and message "transport is closing". The solution is to remove the client keepalive configuration. If you enable verbose gRPC logs, you will see:

INFO: 2021/03/14 11:02:26 [transport] Client received GoAway with http2.ErrCodeEnhanceYourCalm.

The server will log an error that the following:

ERROR: 2021/03/14 11:02:26 [transport] transport: Got too many pings from the client, closing the connection.

Servers cannot return errors larger than 7 kiB

If you return an error from a gRPC request, it returns a status code, a status message (unicode string), and an optional error details (undocumented but supported by the library). So what happens when a server accidentally returns a really large error message? Unfortunately, the connection may get closed. In general, you can only return a maximum of about 7 kiB in your gRPC error message (3 kiB if you use the optional error details). That should be plenty. However, if you have an error message that prints a variable-length data structure, the right request can cause this limit to be exceeded. That is how I ran into this problem. The solution is to return shorter error messages. I also added a gRPC server interceptor to truncate errors, to make sure I don't accidentally do this again.

By default, the Go gRPC implementation defaults to limits error messages to 16 MiB. If you exceed this limit, on the client, you will see one of two errors. On the server you will see nothing, because as far as it is aware, it returned the error correctly.

gRPC code=13 (Internal): peer header list size exceeded limit
gRPC code=14 (Unavailable): transport is closing

The C client limits error messages to 8 kiB. If you exceed this limit, on the client you will see one of two errors, depending on if the server is a Go or C server.

code=StatusCode.RESOURCE_EXHAUSTED: received trailing metadata size exceeds limit
code=StatusCode.INTERNAL: Received RST_STREAM with error code 2

On the server, you will see an error like:

ERROR: 2021/03/05 10:13:18 [transport] header list size to send violates the maximum size (8192 bytes) set by client