TCP Performance Gotchas

about | archive


[ 2008-April-25 17:38 ]

I helped fix a really difficult performance bug recently. A fellow student is working on making databases that work with arbitrary failures. This involves placing another process in front of the databases. In order to measure the performance impact, he created a minimal proxy that just passes commands from clients through to MySQL. He was getting really bad performance with this proxy in some scenarios. The problem turned out to be TCP. TCP is good, but there are some issues that can be important if you are writing high-performance network applications. This problem was caused by the interaction of a few features.

The first well-known performance issue with TCP is the Nagle algorithm. By default, TCP will delay sending a packet that is less than full size in an attempt to combine multiple small writes into a single packet. Linux, by default, waits 40 milliseconds. A carefully written, high-performance application will do its own buffering, and only write complete chunks of data to the TCP connection, which need to get sent immediately. To ensure that this happens, the TCP_NODELAY option needs to be set.

This was not our performance problem, as TCP_NODELAY was already enabled. Using tcpdump/Wireshark on the proxy, we saw that five packets were being sent to the client for each result from MySQL. The first four packets were one byte long, while the last packet contained a variable amount of data. It turns out that Java's DataOutputStream is not buffered. Even better, the writeInt method makes four separate calls to write on the underlying OutputStream. Hence, the five packets were coming from one call to writeInt, and one call to write. Wrapping a BufferedOutputStream fixed the problem. However, what was the root cause?

More digging in the packet trace revealed that occasionally, after waiting for a "long" transaction (~500 ms), the proxy would only send three one byte packets. It would then wait for an ACK, which arrived ~40 milliseconds. This delay is Linux's delayed acknowledgment timeout. After receiving the ACK, the proxy sent out one packet containing the fourth byte from writeInt plus the remaining data. This implied that Linux's TCP implementation was waiting for the ACK before sending more data. This happens when the congestion window is full. However, why was it considering the congestion window to be full? It only sent three bytes!

There are two reasons. The first is that Linux tracks the TCP congestion window by counting packets, and treating each packet as if it is a maximum length packet (called the maximum segment size or MSS, typically 1460 bytes). However, the packet trace showed that receiver's congestion window was larger than 3 × 1460 = 4380 bytes.

The second reason is that the "standard" TCP congestion control specification, RFC 2581, states that after the connection has been idle for a while (specifically: the retransmit timeout or RTO), the congestion window should be set to its initial value. On Linux, its initial value is 3 packets. The idea is that if it has been some time since any packets were sent, the network conditions may have changed and the previous congestion window might now be too large. Linux also implements RFC 2861: Congestion Window Validation, which could have exacerbated the issue.

A good reference on how Linux actually implements TCP is Congestion Control in Linux TCP. It is from 2002, so it is guaranteed to not be completely correct anymore, but it is still educational.

The conclusion: TCP is really complicated, and having an understanding of all layers of the software stack is critical for tracking down performance issues. This particular issue was caused by a combination of the application not buffering its writes, TCP's delayed acknowledgements, and the congestion window validation features. This was also a reminder of how useful packet traces can be for revealing problems in distributed systems. Using strace on the JVM probably would have also revealed the problem, as we would have seen a sequence of 1 byte calls to write.