Load Balancing gRPC services

about | archive

[ 2020-March-22 12:02 ]

Well-designed distributed systems can be easily scaled by adding or removing servers. This requires a load balancing mechanism to distribute work. Unfortunately, doing this with gRPC requires some configuration tuning. In this article, I'll discuss a few possible approaches. Unfortunately, I don't know which are better or more widely used, although based on the poor quality of the documentation, I suspect most people either do nothing or rely on service proxies like Envoy or Linkerd. I wrote this article because I was investigating the alternatives and wanted to write down my notes.

I will discuss two simple scenarios that illustrate some of the challenges. The first is a single client making a large number of requests to a pool of stateless servers. Any server can handle any request, so ideally these requests should be distributed across all servers. The second is multiple clients sending a large number of requests to a pool of servers, and a new server is added (scaling the service up). Ideally, the clients should immediately start sending requests to the new server.

gRPC's default configuration

By default, a gRPC client does a DNS lookup and establishes a single TCP connection to a server. It then sends all requests on this single connection. This is efficient, since connection set up costs are paid once at startup, rather than per request, and there is exactly one connection per client, reducing the amount of state that needs to be tracked. There are two easy ways to set this up. The first is using round-robin DNS, where the DNS name has multiple IP addresses. In this case, the DNS client will pick one answer and gRPC will connect to a random server. The second is using a single IP that distributes TCP connections, such as a Kubernetes "ClusterIP" service, or a TCP load balancer.

If there are many clients, connections will be distributed uniformly at random. This is not bad, although there will still be variance between servers. However, in the scenario where one client is sending all the requests, only a single server will be used. The default configuration also fails in the scale-up scenario, since the existing clients will never reconnect. You can configure gRPC's maximum connection age on the server to force clients to occasionally reconnect. This will cause clients to eventually use the new server, at the cost of occasional unnecessary reconnects.

Proxies or service mesh

Another approach is to send all requests to a proxy which can do per-request load balancing. This is the approach taken by Envoy and Linkerd, as well as most HTTP proxies/load balancers. The advantage is there is a single load balancing implementation for all clients and servers, regardless of programming language or other implementation choices. The disadvantage is there is now another process between the clients and servers, so there is some additional overhead. As an example, see Using Linkerd to load balance gRPC on Kubernetes.

In the case of a single client generating load, the proxy will balance the requests across multiple servers. In the case of a new server being added, the proxy can frequently poll or subscribe to updates to the service information (e.g. DNS, Kubernetes, etcd, Zookeeper, Consul, etc). This allows it to quickly detect the new server and start sending requests to it. The approach has the advantage of providing a uniform way to configure services, but the overhead and complexity will need to be monitored carefully. This seems like it can be an excellent choice for "heterogeneous" environments where you want a uniform way to manage different kinds of connections (e.g. HTTP, gRPC, MySQL, ...) and have many different implementations, some of which you might not have control over. It is less obviously a good choice in more homogeneous environments, where you control both the servers and the clients, and it may be simpler and more efficient to configure them directly.

Multiple connections per client / Connect to all servers

It is fairly straightforward to configure gRPC to use multiple TCP connections in a round-robin fashion. Google does this for their Google Cloud APIs, or see the grpc-go-pool project. (Although the Google Cloud APIs seem to do this to "improve throughput," not load balancing. Maybe because these APIs need to work over the Internet?) The advantage is that load will be more evenly balanced, with the disadvantage of having more total connections. This means in the single client scenario, load is distributed over multiple servers, but still not all.

To distribute the load from a single client, you can instruct it to make separate connections to all backends. In Kubernetes, this can be done by using a headless service, which configures Kubernetes DNS to return multiple IP addresses. Then you need to configure your gRPC client to use round-robin load balancing. This now will distribute load across all servers, at the cost of (num clients)×(num servers) total connections.

Neither of these helps when adding a new server, but setting the maximum connection age. gRPC will re-resolve the DNS name again when any connection closes, so this server-side setting can control how often clients poll for DNS updates. This is a reasonable solution that improves on the default configuration, but polling for DNS updates does mean adding servers will not happen instantly.

Lookaside load balancing (xDS, gRPCLB)

The gRPC clients support a "lookaside" load balancer. Each client asks the load balancer which servers they should use, but the clients then directly send requests to those servers. This allows the lookaside load balancer to implement any complex policy it wants. The gRPC clients then only need to implement a very simple policy (e.g. round-robin), rather than requiring duplicate implementations in each language. This is effectively separating the control plane (which servers to use) and the data plane (sending the requests), and is documented in a gRPC blog post. The initial protocol was called gRPCLB, but it is now deprecated. The gRPC project is now supporting the xDS API from the Envoy project.

Despite the official blog post, given the extremely poor documentation, I suspect that few people are using this approach in production. Let's hope that the move to piggy-back on Envoy's API means this will have more support in the open source implementation in the future. I did write a tiny example to show how to configure gRPCLB and it does work. However, it did require wading through poorly documented features such as gRPC service config, name resolution and balancers.

Related stuff