Corrupt data over TCP: It was a kernel bug!

about | archive


[ 2016-February-12 13:45 ]

I wrote previously about how it is possible that an application can receive corrupt data over a network, when there is both an Ethernet and TCP checksum. The answer is that there was a Linux kernel bug that wasn't checking the checksum! A large team at Twitter was involved debugging this, and Vijay Pandurangan and I wrote a kernel patch (removing 2 lines of code). Read Vijay's detailed description for all the details.

One lesson that I will remember from debugging this: Network corruption is on average very rare. On most days, across Twitter's entire machine fleet, the machines never see a packet with a corrupt TCP checksum. However, if a hardware failure occurs, a huge number of packets can be corrupt. The usual error model that assumes corruption is evenly distributed is wrong. In our case, about 10% of packets passing through a bad switch were corrupt, and something like 0.5% had two bits of errors. This means it is very unlikely but possible that this corruption could be undetected by the TCP checksum. If anything, this has made me even more convinced that you probably shouldn't rely on the TCP checksum to protect your data across a network. This is a bit paranoid, but its easy and cheap to attach a CRC32C to your message and make it nearly impossible bad things to happen. You probably should be encrypting your data, which ensures it uses an even stronger check. Data corruption causes very expensive and time consuming failures. This is definitely a case where an ounce of prevention is worth 100 pounds of cure.