Durability: NVMe disks

about | archive


[ 2020-September-27 10:26 ]

Durability is the guarantee that data can be accessed after a failure. It seems like this should be very simple: either your system provides durable data storage, or it does not. However, durability is not a binary yes/no property, and instead should be defined as the kinds of failures you want your data to survive. Since there is usually some performance penalty for durability, many systems provide a way for only "important" writes to be durable, while "normal" writes will eventually be durable, with no specific guarantee about when. Finally, durability is rarely tested, since really testing it involves cutting the power to computer systems, which is disruptive and hard to automate. Production environments are designed to avoid these failures, so bugs are rarely triggered and hard to reproduce.

I've recently been investigating the durability guarantees in cloud platforms. I decided to start at the beginning: what guarantees are provided by the disks we connect to computers? To find out, I read the relevant sections of the Non-Volatile Memory Express (NVMe) specification (version 1.4a), since it is the newest standard for high-performance SSDs. It also has an easy to find, freely available specification, unlike the older SATA or SCSI standards that were originally designed for magnetic disks. In the rest of this article, I will attempt to summarize the durability NVMe devices provide. I believe that most of this should also apply to SATA and SCSI. NVMe was designed as a higher performance replacement for those protocols, so the semantics can't be too different.

Ordering and atomicity

Before we can discuss durability, we should discuss some basic semantics of NVMe writes. Commands are submitted to devices using a set of queues. At some time later, the device acknowledges that the commands have completed. There is no ordering guaranteed between commands. From Section 6.3: "each command is processed as an independent entity without reference to other commands [...]. If there are ordering requirements between these commands, host software or the associated application is required to enforce that ordering". This means if the order matters, the software needs to wait for commands to complete before issuing the next commands. However, read commands are guaranteed to return the most completed write (Section 6.4.2.1), although they may also return data from uncompleted writes that have been queued.

A related issue with concurrent updates is atomicity. If there are concurrent writes to overlapping ranges, what are the permitted results? The answer is there are no guarantees. Specifically, "After execution of command A and command B, there may be an arbitrary mix of data from command A and command B in the LBA [logical block address] range specified" (Section 6.4.2). This seems to permit literally any result in the case of concurrent writes, such as alternating bytes from command A and command B.

NVMe includes optional support for atomic writes, with different values for "normal operation" and after power failure. The couple of NVMe devices I looked at don't support atomic writes, but apparently some higher-end devices do. The device exposes the size of atomic writes so software can configure itself to use it. For example, see the MariaDB documentation about atomic writes. This can replace MySQL's "doublewrite buffer," which is a mechanism that provides atomic writes on devices that don't natively support them (nearly all disks).

Basically, NVMe provides "weak" semantics similar to shared memory in multi-threaded programs. There are no guarantees if there are concurrent operations. This means if the order of writes matters, the software needs to submit the commands and wait for them to complete, and never have concurrent writes to overlapping ranges.

The Flush command

Without special commands, NVMe provides no guarantees about what data will survive a power failure (Section 5.15.2.2, Figure 249 in the documentation about the Atomic Write Unit Power Fail (AWUPF) field and Section 6.4.2.1). My reading of this means devices are permitted to return an error for all ranges where writes were "in flight" at the time of failure. If you want to be completely safe, you should avoid overwriting critical data by using write-ahead logging. This matches the semantics I found during power fail testing of SATA magnetic hard drives and SSDs in 2010.

The first NVMe mechanism that can be used to ensure data is durably written is the Flush command (Section 6.8). It writes everything in the write cache to non-volatile memory. More specifically, "The flush applies to all commands [...] completed by the controller prior to the submission of the Flush command" (Section 6.8). This means if you want a durable write, you need to submit the write, wait for it to complete, submit the flush, and wait for that to complete. If you submit writes after submitting the flush, but before it completes, they might also be flushed ("The controller may also flush additional data and/or metadata", section 6.8). Most importantly, if you issue a flush, and it fails in the middle, there is no guarantee about what writes might exist on disk. The disk could have any of the writes, with no relation to the order they were submitted or completed. It could also choose to return an error for all the ranges.

Force Unit Access (FUA)

The second mechanism to ensure durability is to set the Force Unit Access option on Write commands. This means that "the controller shall write that data and metadata, if any, to non-volatile media before indicating command completion" (Section 6.15 Figure 404). In other words, data written with a FUA write should survive power failures, and the write will not complete until that is true. Interestingly, you can also specify FUA on a Read command, which is a bit surprising. It forces the referenced data to be flushed to non-volatile media, before reading it (Section 6.9, Figure 374). This mean you can do a set of normal writes, then selectively flush a small portion of it by executing a FUA read of the data you want committed.

Disabling write caching

The last mechanism that may ensure durability is to explicitly disable the write cache. If an NVMe device has a volatile write cache, it must be controllable. This means you can disable it (Section 5.21.1.6). It appears to me that if the cache is disabled, then every write must not complete until it is written to non-volatile media, which should be equivalent to setting the FUA bit on every write. However, this is not clearly described in the specification, and I suspect this is rarely used.

Devices with power loss protection

Finally, it is worth pointing out that some disks provide "power loss protection." This means the device has been designed to complete any in-flight writes when power is lost. This can be implemented by providing backup power with a supercapacitor or battery that is used to flush the cache. In theory, these devices should show that they do not have volatile write cache, so software could detect that and just use normal writes. However, these devices should ideally also treat FUA writes the same as non-FUA writes, and ignore cache flushes. As a result, I think it is best to design software for disks that have caches, since it can then work with any storage device. If you are using a device with power loss protection, you should still get better performance and some additional protection from failures.