Network Latencies in the Data Center (evanjones.ca)

[ 2021-September-24 20:59 ]

Jeff Dean used to do a talk that included "Numbers Everyone Should Know" (2007 at Stanford, 2009 at LADIS), which included "round trip within same data center" as 500 us. I was recently wondering if that is still true, and more importantly what does the distribution of latencies look like? To figure this out, I started 4 VMs in Google Cloud's us-west1 region, 2 in one zone (us-west1b) and 2 in another zone (us-west1c). I then ran a TCP echo server and client, so I could look at 3 different types of latencies: localhost, within a single zone of a cloud region, and between two zones in a single cloud region. Each server sent 10 pings per second, each message was 100 bytes. This should measure "best case" network latencies, since none of the machines were being used for anything else. This isn't the absolute best that is possible. With appropriate kernel and software tuning, or with kernel bypass networking like DPDK, you can get substantially lower latencies. I wanted to measure a baseline for a "standard" networking configuration, without any specific tuning. I ran this experiment over about 70 hours, which included one entire one weekend day and one entire weekday (in the North American time zones). My conclusion is that 500 us is still a good estimate. That is approximately the 80th percentile latency in my experiments. The real latencies are often substantially better. The approximate percentiles are in the table below. I've rounded these to the nearest 50 us to reflect my own uncertainty in these measurements.

There are also some interesting observations in the raw data. First, localhost latencies vary substantially. This should be measuring software scheduling latency, since there is no networking hardware involved. Even for localhost, there were on rare occasions delays longer than 1 ms (around p9999 or 0.001% of all localhost pings). The worst observed localhost latencies were around 5 ms! This suggests that many large networking latency outliers inside data centers may in fact be software latency, and not network queuing or packet loss. Even on these nearly idle machines, I observed unusual delays. This suggests to me that if you have low timeouts on network services, like say less than 10 ms, you should expect to see retries even when everything is functioning normally. Not surprisingly, high latency outliers occur more frequently within the same zone pairs, and even more across zones. That suggests that network conditions also play a role in latency outliers.

The other interesting observation is that latency can vary rapidly, but there are correlations to both time and paths. For example, for one of the same-zone pairs, one direction was consistently faster than the other. There must be something that is distributing packets using their port numbers, either on the hosts or in the network itself, since otherwise these two connections are identical. In general, if a specific path was "slow" for the last minute, the next minute is likely to be slow as well. Sometimes, these periods can be widespread and last a while. In the cross zone case, there was a time period of about 30 minutes where all the p50 latencies were about 100 us higher than "normal." I assume this means the network links between the zones were quite busy during that time period for some reason.

Finally, I saw no observable difference between the weekend day and the weekday. I was expecting there might be a slight increase in latency, due to the network being more utilized on the weekday. That is not visible in my data.

The table and charts below attempt to summarize the data.

Approximate percentile latency by connection type
Connection type	p50	p95	p99	p999
Localhost	143 us	298 us	387 us	543 us
Same zone	362 us	597 us	748 us	1044 us
Cross zone	420 us	669 us	826 us	1111 us

Experimental details

Machine type: n2d-highcpu-4: 4 VCPUs, 4 GiB RAM, 10 Gbps max network egress
Ubuntu 20.04.3 LTS
Kernel: 5.11.0-1017-gcp