How much does the read/write buffer size matter for socket throughput?

about | archive


[ 2023-July-16 12:04 ]

The read() and write() system calls take a variable-length byte array as an argument. As a simplified model, the time for the system call should be some constant "per-call" time, plus time directly proportional to the number of bytes in the array. That is, the time for each call should be time = (per_call_minimum_time) + (array_len) × (per_byte_time). With this model, using a larger buffer should increase throughput, asymptotically approaching 1/per_byte_time. I was curious: do real system calls behave this way? What are the ideal buffer sizes for read() and write() if we want to maximize throughput? I decided to do some experiments with blocking I/O. These are not rigorous, and I suspect the results will vary significantly if the hardware and software are different than one the system I tested. The really short answer is that a buffer of 32 KiB is a good starting point on today's systems, and I would want to measure the performance to go beyond that. However, for large writes, performance can increase.

On Linux, the simple model holds for small buffers (≤ 4 KiB), but once the program approaches the maximum throughput, the throughput becomes highly variable and in many cases decreases as the buffers get larger. For blocking I/O, approximately 32 KiB is large enough to hit the maximum throughput for read(), but write() throughput improves with buffers up to around 256 KiB - 1 MiB. The reason for the asymmetry is that the Linux kernel will only write less than the entire buffer (a "short write") if there is an error (e.g. a signal causing EINTR). Thus, larger write buffers means the operating system needs to switch to the process less often. On the other head, "short reads", where a read() returns less than the maximum length, become increasingly common as the buffer size increases, which diminishes the benefit. There is a SO_RCVLOWAT socket option to change this that I did not test.

The experiments were run on two 16 CPU Google Cloud T2D instances, which use AMD EPYC Milan processors (3rd generation, released in 2021). Each core is a real physical core. I used Ubuntu 23.04 running kernel 6.2.0-1005-gcp. My benchmark program is written in Rust and is available on Github. On localhost, Unix sockets were able to transfer data at approximately 9000 MiB/s. Localhost TCP sockets were a bit slower, around 7000 MiB/s. When using two separate cloud VMs with a networking throughput limit of 32 Gbps = 3800 MiB/s, I needed to use 6 TCP sockets to reliably reach that maximum throughput. A single TCP socket gets around 1400 MiB/s with 256 KiB buffers, with peaks as high as 2200 MiB/s.

Experiment 1: /dev/zero and /dev/urandom

My first experiment is reading from the /dev/zero and /dev/urandom devices. These are software devices implemented by the kernel, so they should have low overhead and low variability, since other tasks are not involved. Reading from /dev/urandom should be much slower than /dev/zero since the kernel must generate random bytes, rather than just zeros. The chart below shows the throughput for reading from /dev/zero as the buffer size is increased. The results show that the basic linear time per system call model holds until the system reaches maximum throughput (256 kiB buffer = 39000 MiB/s for /dev/zero, or 16 kiB = 410 MiB/s for /dev/urandom). As the buffer size increases further, the throughput decreases as the buffers get too big. This suggests that some other cost for larger buffers starts to outweigh the reduction in number of system calls. Perhaps CPU caches become less effective? The AMD EPYC Milan (3rd gen) CPU I tested on has 32 KiB of L1 data cache and 512 KiB of L2 data cache per core. The performance decreases don't exactly line up with these numbers, but it still seems plausible. The numbers for /dev/urandom are substantially lower, but otherwise similar.

I did a linear least-squares fit on the average time per system call, shown in the following chart. If I use all the data, the fit is not good, because the trend changes for larger buffers. However, if I use the data up to the maximum throughput at 256 KiB, the fit is very good, as shown on the chart below. The linear fit models the minimum time per system call as 167 ns, with 0.0235 ns/byte additional time. If we want to use smaller buffers, using a 64 KiB buffer for reading from /dev/zero gets within 95% of the maximum throughput.

Experiment 2: Unix and localhost TCP sockets

Exchanging data with other processes is the thing I am actually interested in, so I tested Unix and TCP sockets on a single machine. In this case, I varied both the write buffer size and the read buffer size. Unfortunately, these results vary a lot. A more robust comparison would require running each experiment many times, and using some sort of statistical comparison. However, this "quick and dirty" experiment satisfied my curiousity, so I didn't do that. As a result, my conclusions here are vague.

The simple model that increasing buffer size should decrease overhead is true, but only until the buffers are about 4 KiB. Above that point, the results start to be highly variable, and it is much harder to draw general conclusion. However, appears that increasing the write buffer size generally is quite helpful up to at least 256 KiB, and often needed as much as 1 MiB to get the highest localhost throughput. I suspect this is because on Linux with blocking sockets, write() will not return until it has written all the data in the buffer, unless there is an error (e.g. EINTR). As a result, passing a large buffer means the kernel can do a lot of the work without needing to switch back to user space. Unfortunately, the same is not true for read(), which often returns "short reads" with any data that is available in the buffer. This starts with buffer sizes around 2 KiB, with the percentage of short reads increasing as the buffer size gets larger. This means the simple model does not hold, because we aren't actually increasing the bytes per read call. I suspect this is a factor which means this microbenchmark is likely not representative of real programs. A real program will do something with the buffer, which will provide time for more data to be buffered in the kernel, and would probably decrease the number of short reads. This likely means larger buffers are in practice more useful than this microbenchmark suggests. As a result of this, the highest throughput often was achievable with small read buffers. I'm somewhat arbitrarily selecting 16 KiB at the best read buffer, and 256 KiB as the best write buffer, although a 1 MiB write buffer seems to be

To give a sense of how variable the results are, the plot below shows the local Unix socket throughput for each read and write buffer throughput size. I apologize for the ugly plot. I did not want to spend the time to make it more beautiful. This plot is interactive so you can slice the data to the area of interest. I recommend zooming in to the left hand size with read buffers up to about 300 KiB.

The first thing to note is at least on Linux with blocking sockets, the writer will almost never have a "short write", where the write system call returns before writing all the data in the buffer. Unless there is a signal (EINTR) or some other "error" condition, write() will not return until all the bytes are written. The same is not true for reads. The read() system call will often return a "short" read, starting around buffer sizes of 2 KiB. The percentage of short reads generally increases as buffer sizes get bigger, which is logical.

Another note is that sockets have in-kernel send and receive buffers. I did not tune these at all. It is possible that better performance is possible by tuning these settings, but that was not my goal. I wanted to know what happens "out of the box" for general-purpose programs without any special tuning.

Experiment 3: TCP between two hosts

In this experiment, I used two separate hosts connected with 32 Gbps networking in Google Cloud. I first tested the TCP throughput using iperf, to independently verify the network performance. A single TCP connection with iperf is not enough to fully utilize the network. I tried fiddling with some command line options and with Kernel settings like net.ipv4.tcp_rmem and wasn't able to get much better than about 12 Gb/s = 1400 MiB/s. The throughput is also highly varible. Here is some example output with iperf reporting at 2 second intervals, where you can see the throughput ranging from 10 to 19 Gb/s, with an average over the entire interval of 12 Gb/s.

$ iperf -c 10.128.0.38 --time 30 --interval 2 -l 262144
------------------------------------------------------------
Client connecting to 10.128.0.38, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.128.0.39 port 53712 connected with 10.128.0.38 port 5001 (icwnd/mss/irtt=13/1408/229)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-2.0000 sec  2.66 GBytes  11.4 Gbits/sec
[  1] 2.0000-4.0000 sec  2.61 GBytes  11.2 Gbits/sec
[  1] 4.0000-6.0000 sec  2.42 GBytes  10.4 Gbits/sec
[  1] 6.0000-8.0000 sec  2.41 GBytes  10.4 Gbits/sec
[  1] 8.0000-10.0000 sec  2.39 GBytes  10.3 Gbits/sec
[  1] 10.0000-12.0000 sec  2.42 GBytes  10.4 Gbits/sec
[  1] 12.0000-14.0000 sec  2.37 GBytes  10.2 Gbits/sec
[  1] 14.0000-16.0000 sec  2.54 GBytes  10.9 Gbits/sec
[  1] 16.0000-18.0000 sec  2.36 GBytes  10.1 Gbits/sec
[  1] 18.0000-20.0000 sec  2.40 GBytes  10.3 Gbits/sec
[  1] 20.0000-22.0000 sec  2.39 GBytes  10.3 Gbits/sec
[  1] 22.0000-24.0000 sec  2.44 GBytes  10.5 Gbits/sec
[  1] 24.0000-26.0000 sec  2.98 GBytes  12.8 Gbits/sec
[  1] 26.0000-28.0000 sec  4.52 GBytes  19.4 Gbits/sec
[  1] 28.0000-30.0000 sec  4.45 GBytes  19.1 Gbits/sec
[  1] 0.0000-30.0052 sec  41.4 GBytes  11.8 Gbits/sec

To hit the maximum network throughput, I need to use 6 or more parallel TCP connections (iperf -c IP_ADDRESS --time 60 --interval 2 -l 262144 -P 6). Using 3 connections gets around 26 Gb/s, and using 4 or 5 will occasionally hit the maximum, but will also occasionally drop down. Using at least 6 seems to reliably stay at the maximum.

Due to this variability, it is hard to draw any conclusions about buffer size. In particular: a single TCP connection is not limited by CPU. The system uses about 40% of a single CPU core, basically all in the kernel. This is more about how the buffer sizes may impact scheduling choices. That said, it is clear that you cannot hit the maximum throughput with a small write buffer. The experiments with 4 KiB write buffers reached approximately 300 MiB/s, while an 8 KiB write buffer was much faster, around 1400 MiB/s. Larger still generally seems better, up to around 256 KiB, which occasionally reached 2200 MiB/s = 17.6 Gb/s. The plot below shows the TCP socket throughput for each read and write buffer size. Again, I apologize for the ugly plot.