TCP and gRPC Failed Connection Timeouts

about | archive


[ 2021-August-16 08:52 ]

After my last article about how gRPC is hard to configure, with a client keepalive example, I realized that I don't understand how TCP and gRPC handle failed connections. This article is my attempt to figure it out, and document it so I can find it again in the future. The best reference I've found for how TCP handles failures on recent Linux kernels is When TCP sockets refuse to die. It gets into the details much more than this article does. Instead, I will try to briefly summarize what TCP and gRPC do in some failure scenarios that I have run into before.

My opinion is that most TCP applications probably should turn on TCP keepalives (use setsockopt to set SO_KEEPALIVE, TCP_KEEPIDLE, and TCP_KEEPINTVL), and set the TCP_USER_TIMEOUT option. These settings ensure the kernel will report a failed connection much faster than it would by default (seconds to minutes instead of hours). For example, setting SO_KEEPALIVE=1, TCP_KEEPIDLE=15, TCP_KEEPINTVL=15, TCP_KEEPCNT=5 and TCP_USER_TIMEOUT=90 should probe idle connections every 15 seconds, and close a connection after 90 seconds have passed without an acknowledgement. The disadvantage is this sends additional keepalive TCP packets for idle connections, but 1 packet every 15 seconds sounds reasonable to me (the Go team agrees; 100k connections = 6.7k packets/sec = 417 kiB/sec = 0.3% of a 1 Gbps Ethernet). If your application really cares about the exact timing of when to detect failure, it should send application level "keep alive" messages. This avoids depending on any operating system specific behaviour. This seems like an instance of the end-to-end argument.

Applications using gRPC should set timeouts on all requests to some "reasonable" value, instead of using the default of no timeout. Avoid gRPC's client keepalives, because it must be configured the same way on the clients and servers, which in my opinion is error prone as applications evolve (details). However, to enable the TCP_USER_TIMEOUT setting to detect failures faster, it is worth setting the client KEEPALIVE_TIME setting to 5 minutes. This causes requests on dead connections to fail after 20 seconds, rather than 2.5 minutes or 15 minutes, without a chance of a server closing the connection due to too many client keepalive pings. Most applications should probably should also use the round_robin load balancing policy instead of the default pick_first, since round_robin will distribute requests across multiple backend instances.

Connecting to a failed process/machine

TCP: When connecting to a machine that is running but no longer listening on the port, the client quickly gets ECONNREFUSED "Connection refused", because the host replies to the TCP SYN request with a RST packet. However if the IP address is no longer being served, like if the host is down or a Kubernetes pod was deleted, then it takes much longer, since there is no reply to the SYN packet. This is particularly annoying for Kubernetes, since during each deploy, many IP addresses effectively disappear, which causes long connection timeouts rather than a nearly immediate "connection refused" error. This probably means that clients should aggressively try to connect to multiple IPs in parallel after a fairly short timeout (e.g. maybe 100ms?).

By default, establishing a TCP connection will send a SYN packet and 6 retries at the following times in seconds: 0, 1, 2, 4, 8, 16, 32. After the last SYN the client waits for 64 seconds before returning ETIMEDOUT "Connection timed out". I'm either bad at math, or something actually adds a bit more time, because this takes 130 seconds, but by my math (and the Kernel's ip-sysctl.txt), it should take 127 seconds. This can be configured on a per-socket basis using the TCP_SYNCNT socket option (see man 7 tcp). The initial 1 second retransmit timeout comes from RFC6298 Appendix A, which has a nice description of why this value was chosen (long enough for 97% of real-world end-to-end paths, but short enough to recover from lost packets promptly).

gRPC Go: if using DialContext WithBlock, the connection hangs forever. The gRPC client retries the connection every 20 seconds, with some exponential backoff sleep between connection attempts. Without the WithBlock DialContext option, the connection fails after 20 seconds with code=UNAVAILABLE (14) "transport: Error while dialing dial tcp 192.168.1.123:12345: i/o timeout". Future retries will either immediately return the same error, or will wait for 20 seconds again, if they happen while gRPC is trying to reconnect. If you call DialContext with the WithBlock option, it hangs forever.

gRPC C/Python: The gRPC connection fails after 20 seconds with code=StatusCode.UNAVAILABLE "failed to connect to all addresses". The next call will similarly wait for 20 seconds. If you pass the `wait_for_ready=True` option to the RPC call, it will retry the connection forever, reconnecting every 20 seconds.

Sending a message to a failed machine on an established connection

TCP: When sending a message on an established connection, the default is to retry for a long time. Retransmits start after 200 ms, and then exponentially increase for up to 15 times, for a total of around 925 seconds or 15 minutes! The "official" Kernel ip-sysctl documentation describes how this can be adjusted by changing tcp_retries2. Unfortunately, the minimum retransmit time of 200ms is a compile-time constant in the kernel, but TCP Tail Loss Probe which is enabled by default probably means most retransmits happen much sooner (see the tcp_early_retrans sysctl or detailed slides about this).

TCP keepalives can be enabled, which will detect when idle connections die. This is useful to promptly free per-connection resources from dead connections, or to keep connections active through NATs. Go turns on TCP keepalives by default, with an initial delay of 15 seconds (TCP_KEEPIDLE), and an interval of 15 seconds (TCP_KEEPINTVL). See the original discussion on a Go issue, and the original change for details. On Linux, the kernel defaults to sending 9 keepalive probes before returning an error to the application. This means with Go, it will take 2.5 minutes to detect a dead connection (15 seconds × 10 intervals). Interestingly, if the application transmits a packet that is not acknowledged, then the connection is no longer considered idle, so keepalives do not apply. This means it will take 15 minutes to time out (see a related Go issue).

To detect a failure faster when a connection is trying to send data, you can set the TCP_USER_TIMEOUT option. This will cause writes to fail if data remains unacknowledged for longer than the specified time. This setting overrides keepalives to determine when the connection should be closed, so set TCP_USER_TIMEOUT to the time that keepalives will close the connection (= TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT). In general, it seems to me that most applications should set this option to something more aggressive (e.g. 1 minute?), because waiting 15 minutes to give up is almost certainly much too long.

gRPC Go: If the connection drops after sending a request, it will fail after TCP gives up retrying the request, which is about 15 minutes. It does not matter that Go enables TCP keepalives by default, because the connection is not idle. The gRPC call will return code=StatusCode.Unavailable "read: connection timed out". Enabling gRPC client keepalives (setting gRPC KEEPALIVE_TIME to any value) will cause the client to set TCP_USER_TIMEOUT to the same value as the gRPC KEEPALIVE_TIMEOUT (default: 20 seconds). The connection will then fail after 20 seconds with code=StatusCode.Unavailable "read: connection timed out".

gRPC C/Python: If the connection drops after sending a request, it will fail after TCP gives up retrying the request, which is about 15 minutes. It will fail with StatusCode.UNAVAILABLE "Connection timed out". Enabling gRPC client keepalives (setting gRPC KEEPALIVE_TIME) will cause the client to set TCP_USER_TIMEOUT to the same value as the gRPC KEEPALIVE_TIMEOUT (default: 20 seconds). The connection will then fail after 20 seconds with the same error: StatusCode.UNAVAILABLE "Connection timed out".