Durability: Linux File APIs (evanjones.ca)

[ 2020-October-12 10:26 ]

As part of investigating the durability provided by cloud systems, I wanted to make sure I understood the basics. I started by reading the NVMe specification, to understand the guarantees provided by disks. The summary is that you should assume your data is corrupt between when a write is issued until after a flush or force unit access write completes. However, most programs use system calls to write data. This article looks at the guarantees provided by the Linux file APIs. It seems like this should be simple: a program calls write() and after it completes, the data is durable. However, write() only copies data from the application into the kernel's cache in memory. To force the data to be durable you need to use some additional mechanism. This article is a messy collection of notes about what I've learned. (The really brief summary: use fdatasync or open with O_DSYNC.) For a better and clearer overview, see LWN's Ensuring data reaches disk, which walks from application code through to the disk.

The semantics of write()

The write system call is defined in the IEEE POSIX standard as attempting to write data to a file descriptor. After it successfully returns, reads are required to return the bytes that were written, even when read or written by other processes or threads (POSIX standard write(); Rationale). There is an addition note under Thread Interactions with Regular File Operations that says "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them." This suggests that all file I/O must effectively hold a lock.

Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. However, write is not required to be complete, and is allowed to only transfer part of the data. For example, we could have two threads, each appending 1024-bytes to a single file descriptor. It would be acceptable for the two writes to each only write a single byte. This is still "atomic", but also results in undesirable interleaved output. There is a great StackOverflow answer with more details.

fsync/fdatasync

The most straightforward way to get your data on disk is to call fsync(). It requests the operating system to transfer all modified blocks in cache to disk, along with all file metadata (e.g. access time, modification time, etc). In my opinion, that metadata is rarely useful, so you should use fdatasync unless you know you need the metadata. The fdatasync man page says it is required to flush as much metadata as necessary "for a subsequent data read to be handled correctly", which is what most applications care about.

One issue is this is not guaranteed to ensure you can find the file again. In particular, when you first create a file, you need to call fsync on the directory that contains it, otherwise it might not exist after a failure. The reason is basically that in UNIX, a file can exist in multiple directories due to hard links, so when you call fsync on a file, there is no way to tell which directories should be written out (more details). It appears that ext4 may actually fsync the directory automatically, but that might not be true for other filesystems.

The way this is implemented will vary depending on the file system. I used blktrace to examine what disk operations ext4 and xfs use. They both issue normal disk writes for both the file data and the file system journal, use a cache flush, then finish with a FUA write to the journal, probably to indicate the operation has committed. On disks that do not support FUA, this involves two cache flushes. My experiments show that fdatasync is slightly faster than fsync, and blktrace shows fdatasync tends to write a bit less data (ext4: 20 kiB for fsync vs 16 kiB for fdatasync). My experiments also show that xfs is slightly faster than ext4, and again blktrace shows it tends to flush less data (xfs: 4 kiB for fdatasync).

fsync controversies

In my professional career, I can remember three fsync-related controversies. The first, in 2008, was that Firefox 3's UI would hang when lots of files were being written. The problem is the UI used the SQLite database to save state, which provides strong durability guarantees by calling fsync after each commit. On the ext3 filesystem of the time, fsync wrote out all dirty pages on the system, rather than just the relevant file. This meant that clicking a button in Firefox could wait for many megabytes of data to be written on magnetic disks, which could take seconds. The solution, as I understand from a blog, was to move many database commits to asynchronous background tasks. This means Firefox was previously using stronger durability guarantees than it needed, although the problem was made much worse by the ext3 filesystem.

The second controversy, in 2009, was that after a system crash, users of the new ext4 filesystem found many recently created files would have zero length, which did not happen with the older ext3 filesystem. In the previous controversy, ext3 was flushing too much data, which caused really slow fsync calls. To fix it, ext4 flushes only the relevant dirty pages to disk. For other files, it keeps them in memory for much longer to improve performance (defaulting to 30 seconds, configured with dirty_expire_centiseconds; note). This means after a crash, lots of data might be missing. The solution is to add fsyncs to applications that want to ensure data will survive crashes, since fsyncs are much more efficient with ext4. The downside is this still makes operations like installing software slower. For more details, see LWN's article, or Ted Ts'o's explaination.

The third controversy, in 2018, is that Postgres discovered that when fsync encounters an error, it can mark dirty pages as "clean", so future calls to fsync do nothing. This leaves modified pages in memory that are never written to disk. This is pretty catastrophic, since the application thinks some data has been written, but it has not. There are very few things an application can do in this rare case when fsync fails. Postgres and many other applications now crash when it happens. A paper titled Can Applications Recover from fsync Failures? published in USENIX ATC 2020 investigates the issue in detail. The best solution at the moment is to use Direct I/O with O_SYNC or O_DSYNC, which will report errors on specific write operations, but requires the application to manage buffers itself. For more details, see the LWN article or the Postgres wiki page about fsync errors.

open with O_SYNC/O_DSYNC

Back to system calls for durability. Another option is to use the O_SYNC or O_DSYNC options with the open() system call. This causes every write to have the same semantics as a write followed by fsync/fdatasync, respectively. The POSIX specification calls this Synchronized I/O File Integrity Completion and Data Integrity Completion. The main advantage of this approach is that you need a single system call, instead of write followed by fdatasync. The biggest disadvantage is that all writes using that file descriptor will be synchronized, which may limit how the application code is structured.

Direct I/O with O_DIRECT

The open() system call has an O_DIRECT option which is intended to bypass the operating system's cache, and instead do I/O directly with the disk. This means in many cases, an application's write call will translate directly into a disk command. However, in general this is not a replacement for fsync or fdatasync, since the disk itself is free to delay or cache those writes. Even worse, there are edge cases that mean O_DIRECT I/O falls back to traditional buffered I/O. The easiest solution is to also use the O_DSYNC option to open, which means each write is effectively followed by fdatasync.

It turns out that XFS somewhat recently added a "fast path" for O_DIRECT|O_DSYNC writes. If you are overwriting blocks with O_DIRECT|O_DSYNC, XFS will issue a FUA write if the device supports it, rather than using a cache flush. I used blktrace to confirm this happens on my Linux 5.4/Ubuntu 20.04 system. This should be more efficient, since it writes the minimum amount of data to disk, and uses a single operation instead of a write followed by a cache flush. I found a link to the kernel patch that implemented this in 2018, which has some discussion about implementing this optimization for other filesystems, but as far as I know XFS is the only one that does.

sync_file_range

Linux also has sync_file_range, which can allow flushing part of a file to disk, rather than the entire file, and triggering an asynchronous flush, rather than waiting for it. However, the man page states that it is "extremely dangerous" and discourages its use. The best description of some of the differences and dangers with sync_file_range is Yoshinori Matsunobu's post about how it works. Notably, it seems that RocksDB uses this to control when the kernel flushes dirty data to disk, and still uses fdatasync to ensure durability. It has some interesting comments in its source code. For example, it appears that with zfs, the sync_file_range call does not actually flush data. Given my experience that code that is rarely used probably has bugs, I would recommend avoiding this system call without very good reasons.

System calls for durable I/O

My conclusion is there are basically three approaches for durable I/O. All of them require you to call fsync() on the containing directory when you first create a file.

fdatasync or fsync after a write (prefer fdatasync).
write on a file descriptor opened with O_DSYNC or O_SYNC (prefer O_DSYNC).
pwritev2 with the RWF_DSYNC or RWF_SYNC flag (prefer RWF_DSYNC).

Some random performance observations

I have not measured these carefully, and many of these differences are very small, which means they could be false, or are highly likely to change. These are roughly ordered from largest to smallest effects.

Overwriting is faster than appending (~2-100% faster): Appending involves additional metadata updates, even after an fallocate system call, but the size of the effect varies. My recommendation is for best performance, call fallocate() to pre-allocate the space you need, then explicitly zero-fill it and fsync. This ensures the blocks are marked as "allocated" rather than "unallocated" in the file system, which is a small (~2%) improvement. Additionally, some disks may have a performance penalty for the first access to a block, which means zero filling can cause large improvements (~100%). Notably, this may happen to AWS EBS disks (not official, I have not confirmed) and GCP Persistent Disk (official; confirmed with benchmark). Others have made the same observation with different disks.
Fewer system calls are faster (~5% faster): It seems to be slightly faster to use open with O_DSYNC or pwritev2 with RWF_SYNC instead of explicitly calling fdatasync. I suspect this is because there is slightly less system call overhead (one call instead of two). However, the difference is pretty small, so do whatever makes your application logic simpler.