A TCP "stuck" connection mystery (evanjones.ca)

[ 2011-September-19 19:17 ]

There is almost nothing as satisfying as fixing a difficult bug. I just figured out a rare hang that was plaguing a performance test for my PhD research for about the last year. The problem was that occasionally some simulated MySQL clients would get stuck waiting to connect. When I examined the server, there was no trace of the connection. The short version of the story is that TCP connections, under rare circumstances, can get stuck in a state where the client thinks the connection is established, but the server doesn't know about it. I was unlucky enough that my test triggered just the right combination of circumstances to make this possible. This long story is a lesson about how difficult bugs can be caused by the interactions of multiple features, and that to debug them you must always question your assumptions.

To begin, let's briefly recap how TCP connections are established. The client begins by sending a SYN packet to the server. The server replies with a SYN-ACK packet and must remember some state about the connection. When the client receives the SYN-ACK, it considers its connection established, and replies with an ACK (and potentially some initial data). When the server gets the ACK, it considers its side of the connection to be established. However, what happens if a packet goes missing in this process? Well, if the SYN or SYN-ACK packets are lost, the client will retransmit the SYN a few times before giving up, returning an error to the application. If the ACK packet goes missing, then the server will resend the SYN-ACK, causing the client to retransmit the ACK. In other words: in nearly all cases, the client will either eventually have a connection, or will terminate with an error.

However, there is a rare situation where this does not occur. If the client gets the SYN-ACK, it thinks the connection is established. However, if the server never gets the ACK, or forgets about the connection (e.g. crashes and restarts), then the client will have an established connection, and the server will not know about it. In most protocols, this won't be a problem: the client will send a request to the server as soon as the connection is established. However, some protocols have the server begin the communication by sending some greeting. MySQL's protocol works this way, as do many of the mail protocols (SMTP, POP, IMAP). Thus, the client can be left waiting forever. You can trigger this behaviour by adding a firewall rule on the server to block the client's ACK packets (On Linux: iptables -A INPUT -p tcp --dport SERVER_PORT --tcp-flags ALL ACK -j DROP)

However, how can this happen without a well-timed reboot? In my test, I start around 200 client connections at approximately the same time. Sometimes, this causes the Linux kernel's queue of pending connections to overflow (a SYN flood). This triggers syncookies, and prints the following kernel log message:

Sep 17 22:08:09 vise4 kernel: [379367.113864] possible SYN flooding on port 3311. Sending cookies.

Syncookies causes the server to send a SYN-ACK, but immediately forget about the connection. Then, on the client side, this huge number of SYN-ACK packets arrive, causing it to reply with a huge batch of ACK packets. Some of these packets get lost because some buffer somewhere overflows. I believe this problem is aggravated by the fact that my client load generator is running in a virtual machine, so there is extra batching and extra buffers where packets can get dropped. As a result: on the client, some connections are established, but the server has forgotten about them because of syncookies.

Now that I know what is happening, I am amazed that this bug happened as frequently as it did. Here are all the things that are needed for this to occur:

The protocol requires the client to wait for the server to send the first message.
The server gets too many connections at once, so it enables syncookies for some of them.
The client responds with a big batch of ACKs at the same time, so some of them get lost somewhere.

Can this occur in the real world?

Yes, but very rarely. The most likely cause is a well-timed reboot, but some weird packet loss, firewalls, or a well-timed SYN flood could also trigger it. Since the world is a big place, I suspect that this has absolutely happened to others. However, without a test set up like mine, this would only happen once in a blue moon. They would have seen a "stuck" MySQL client connection that they couldn't explain, but they would probably just retry the operation and never see it happen again.

So how do we fix it?

First, for protocols that have the server send the first message, clients should have an application-level connection timeout rather than relying on TCP to detect failed connections. Really, this is good advice for any network application, since even if your code is perfect there could be buggy network devices somewhere between the client and the server. The official MySQL client library does not do this, but it probably should.

Second, if you are using TCP, it is better to design your protocol so the client sends the first message. This would have fixed this problem because the client would detect a missing ACK and retransmit. In general, if you want to know the state of the other end of a TCP connection, you need to send it a message, and retry if it takes to long to get a response. Requiring the client to send the first message also removes an additional network round-trip, since the client can send the initial request with the first ACK, rather than waiting for the server to return its greeting.

How did you ever find this?

The first few times this happened I tried to debug it. However, I quickly realized that it was a"hard" problem, and that it also wasn't very important because restarting my test would make it go away. However, I eventually figured out a way to repeat it more often, so that it was triggered once every 50-100 runs. Each run takes about 1 minute, so this wasn't very fast, but it was way better than before (and also more annoying, encouraging me to fix it). I then started adding some additional logging. However, too much logging slowed the client down and the bug would disappear. I ended up logging events using an in-memory circular buffer, which told me that TCP connections were being established correctly.

It was at this point, about a day later, I finally looked at both the client and servers using lsof and netstat. This showed that connections were established on the client, but did not exist on the server. I thought this was impossible at first. However, some web searching determined that TCP connections can appear "stuck" in a few rare states, and that this is one of them. I finally figured out that syncookies were involved when I looked at /var/log/syslog just to see if anything weird was going on.

The lesson here is that you should start debugging by using existing tools to collect as much information as possible, rather than "fixing" your code. There are more debugging tools than just gdb and printf. This is really critical for rare bugs, where you may only get the occasional chance to examine the system's state.

At this point, I can now reproduce the bug reliably: start MySQL, then pause it with kill -STOP. Start the clients on the virtual machine host. Resume MySQL with kill -CONT. With about 200 clients, I get a few stuck connections every time I do this, and a message about syncookies in the log file.