Write Latency and Block Size (evanjones.ca)

[ 2010-August-09 10:11 ]

Today's episode of how to get good performance from disks looks at the impact of block size and alignment on write latency. There are two layers involved in writes. The first layer is the disk cache or page cache, which stores disk pages in memory temporarily. On Linux x86 and x86-64, the page cache stores 4 kB chunks of files from disk (other architectures may use a different size, for example 8 kB on Sparc or IA64). Thus, performing I/O in 4 kB chunks will probably be more efficient. The second layer is the disk interface itself. In the dark ages, someone decided that disk sectors should be 512 bytes long (although this is now increasing to 4 k). This means that every access to the disk must perform I/O in 512 byte chunks. Again, this means that if you perform I/O in 512 byte chunks, you should get more efficient I/O. Combining these two seems to suggest that performing I/O in 4 kB chunks is the best. However, I wanted to find out exactly how much this matters, so I performed a simple experiment. (Short version: align writes on a 4 k boundary, and always write a multiple of 4 k bytes).

The experiment pre-allocates a 64 MB file, then writes to a random location. I test three parameters: cached/uncached, O_DIRECT enabled or disabled, and the block size/alignment. The detailed results are shown in the table below. There are a few important observations. First, if your data is in the page cache, write size doesn't really matter: all are basically equally fast, ignoring the fact that larger writes are more efficient per byte, since the cost of the system call is amortized over more bytes. However, if your data is not in the page cache, then modifying less than a page causes the kernel to read the page in from disk before performing the write. This is terribly slow, because it requires a disk seek (~8 ms on my system). However, if you write entire pages, then the kernel is fast, as it recognizes that it doesn't need to read the old page in. It still takes longer than modifying a cached page (~25 µs VS 10 µs), likely because it needs to find and zero a free page before performing the copy.

Finally, if you want to perform I/O that avoids the page cache, Linux provides the O_DIRECT option to open or fcntl. In this case, it doesn't matter if the page is in the cache or not, but you must perform I/O in blocks that are a multiple of 512 bytes long, corresponding to the disk sector size. This is because this interface literally passes the write to the disk immediately, rather than modifying the page cache. I'm guessing that on newer disks that have 4096 byte sectors, the I/O will need to be 4 k aligned. Mac OS X provides a similar feature (via fcntl's F_NOCACHE option), although its implementation is different. It permits writes of any size, but performs terribly if you perform writes that are not a multiple of 4 kB.

Similar issues will appear with RAID. In the case of RAID5 or 6, I'll call the amount of data written to a disk the chunk size, and I'll use stripe size to refer to number of data disks times the chunk size. Unfortunately, different RAID systems use different terms for these concepts. A write of an entire stripe can simply be passed through to the disks. A write that touches less than the entire stripe will need to read the modified and parity chunk(s), compute the new parity chunk(s), then write the modified and parity chunks. This is clearly going to be a slower operation, so it is best to write entire stripes at a time, which will probably be a very large chunk of data on most RAID configurations. In the case of hardware RAID, a battery backed cache can prevent much of the overhead, since it can delay the write until the entire stripe is updated, and/or hide the latency from the additional reads.

Conclusion: the size and alignment of your writes matters. Do the largest writes you can, but try to align them to page boundaries. For RAID, some additional tests may be required, but the best performance should come from writing the entire stripe (chunk size × data disks).

Side note: To be strictly correct, alignment refers to the starting address or offset, whereas the block size refers to the size of the data transfer. While these are technically independent, in this benchmark I've changed them together. That is, a block size of 4096 is also aligned on a 4096 byte boundary. I suspect that unaligned writes are just equivalent to two less than block sized writes.

Microseconds per write

These tests were performed on a Linux 2.6.27 kernel, on a Western Digital Caviar 250 GB 7200 RPM SATA disk (WD2500JS), using an ext3 file system. I performed similar experiments on an older Macbook running Mac OS X 10.5.8 and obtained similar results, except for the differences between Linux O_DIRECT and Mac OS X F_NOCACHE noted above, and the laptop hard drive is much slower.

Block Size	Cached Normal	Cached Direct	Uncached Normal	Uncached Direct
1	5.887	(n/a)	8306	(n/a)
27	6.135	(n/a)	8479	(n/a)
512	8.713	2498	8742	2579
2048	9.733	2422	8670	2721
4096	10.350	2514	25.039	2560