TCP Checksums Are Not Enough (evanjones.ca)

[ 2008-July-27 06:49 ]

Amazon's S3 storage system had a long outage last week. The cause for the outage has been described as single bit corruption in messages exchanged internally. This led to a failure of the entire system, in an event that is reminiscent of the catastrophic failure of AT&T's telephone switching network in 1990. This is a nice reminder to developers of large scale, reliable distributed systems: TCP checksums are not enough to protect your application [Update 2015-10-08: This does happen in reality].

There are two reasons that you can't rely on TCP checksums. The first is that the TCP checksum is very weak and can easily fail to detect errors in packets. This means that it is possible for a packet to be corrupted somewhere between the sender and the receiver, without the receiver ever noticing. If you build a large system, this is almost guaranteed to happen occasionally. The second problem is that the TCP checksum happens too late, and only protects the contents of a single packet, which is not sufficient. This is the classic end-to-end argument. If the corruption happens before your data gets to TCP, because of a memory or software error, for example, the TCP checksum can't help. If the corruption happens while reassembling a message from multiple packets, again TCP is no help. However, intelligent use of higher level checksums can detect these types of errors.

Amazon's problem was likely more insidious than simply relying on TCP checksums, but it still provides a good example. They mention in their failure analysis that they are going to add checksums to system state messages. Google uses in-memory checksums to protect against software bugs in their Paxos implementation, which is part of their Chubby distributed lock service. In conclusion, if you have state that is critical for a system's correctness, it needs some form of redundancy, in order to detect and recover from errors.