Gigabit Ethernet Latency with Intel's 1000/Pro Adapters (e1000) (evanjones.ca)

[ 2008-November-07 11:30 ]

For my research, I have been carefully measuring network latency. The simplest case is an application sends a single byte via TCP over the network to another application. That application reads the byte and writes a reply. This round trip time represents a minimum latency for a request to be sent to a server, and a response to come back. I was measuring this using netperf over a 100 Mb Ethernet switch, the measured latency was 250 µs. When I measured it over a gigabit Ethernet switch, the latency fell exactly in half to 125 µs. This is when I became suspicious that something strange was going on. It turns out that the problem is interrupt coalescing, which many Ethernet adapters use to improve performance, at the cost of latency.

Typically, when a network device receives a packet, it copies it into the system memory using DMA, then raises an interrupt to signal that a packet has arrived. This is perfect for low loads. However, for Gigabit or 10G Ethernet, the maximum packet rate is extremely high, and handling one interrupt per packet could be very inefficient. Interrupt coalescing, also called interrupt moderation, is a feature where the network adapter will raise one interrupt for a group of packets. My problem was that the old version of the e1000 driver on my Linux systems used a fixed minimum inter-interrupt interval of 125 µs. Thus, the client would send the packet, the server would process it and reply, and then the response would sit in memory until the timer expired. In reality, the round trip latency was lower than 125 µs, but the interrupt throttle timer imposed a minimum latency.

This interrupt timer does strange things to the performance of a client and server which makes many small requests. For example, netperf will frequently measure very close to 8000 round trips per second, but it will occasionally measure a smaller value. The reason is that sometimes the timing of the interrupts on the two ends are closely synchronized. This causes two interrupt timer periods to elapse between message receptions: one for the transmit interrupt, and another for the receive interrupt. This is probably a performance anomaly which would only rarely happen in reality, since real applications will likely do more than 125 µs of work with the request, so the interrupt timer will be less of an issue. However, for simple benchmarks, it can make a huge difference between reliable, low-latency performance and performance with additional delay and unpredictable variation.

In my more realistic benchmark, where the server does approximately 110 µs of processing of each request, tuning this parameter only makes a small difference. It increases the throughput with a small number of clients significantly (almost double), requiring fewer clients before it saturates (3 instead of 5). However, it decreases the peak throughput with large numbers of clients. This is exactly what one would expect, since improving performance under high load is the entire purpose of interrupt combining.

Resources

Interrupt Coalescence in the GEANT2 Knowledge Base - An excellent overview of the issue and how to tune it on various operating systems. The rest of this wiki also has interesting information
Interrupt Moderation Using Intel Gigabit Ethernet Controllers (AP-450) - Intel application note describing interrupt moderation features and tuning knobs.
e1000 driver home page - Latest versions of the e1000, e1000e, ixgb, ixgbe Linux drivers, and developer's manuals. The version of the driver on this web site is typically newer than the one in the Linux kernel source.
Small Packet Traffic Performance Optimization for 8255x and 8254x Ethernet Controllers (AP-453) - Intel application note describing latency optimization, with a good discussion of the trade-offs between low latency and high throughput.
Topics in High-Performance Messaging - A guide to low-latency communication. It is geared towards 29West's LBM messaging product, but must of the advice is applicable to other systems.