Durability: NVMe disks (evanjones.ca)

[ 2020-September-27 10:26 ]

Durability is the guarantee that data can be accessed after a failure. It seems like this should be very simple: either your system provides durable data storage, or it does not. However, durability is not a binary yes/no property, and instead should be defined as the kinds of failures you want your data to survive. Since there is usually some performance penalty for durability, many systems provide a way for only "important" writes to be durable, while "normal" writes will eventually be durable, with no specific guarantee about when. Finally, durability is rarely tested, since really testing it involves cutting the power to computer systems, which is disruptive and hard to automate. Production environments are designed to avoid these failures, so bugs are rarely triggered and hard to reproduce.

I've recently been investigating the durability guarantees in cloud platforms. I decided to start at the beginning: what guarantees are provided by the disks we connect to computers? To find out, I read the relevant sections of the Non-Volatile Memory Express (NVMe) specification (version 2.0), since it is the newest standard for high-performance SSDs. It also has an easy to find, freely available specification, unlike the older SATA or SCSI standards that were originally designed for magnetic disks. In the rest of this article, I will attempt to summarize the durability NVMe devices provide. I believe that most of this should also apply to SATA and SCSI. NVMe was designed as a higher performance replacement for those protocols, so the semantics can't be too different.

[Updated 2021-10-28]: Russ Cox asked if disk sector overwrites are atomic on Twitter. It turns out that the NVMe specification requires that at a minimum, writes of a single logical block must be atomic. I've updated this article, and also updated the references to the NVMe 2.0 specification. For more details, see this mailing list post from Matthew Wilcox about the NVMe specification, and this excellent StackOverflow answer.

Ordering and atomicity

Before we can discuss durability, we should discuss some basic semantics of NVMe writes. Commands are submitted to devices using a set of queues. At some time later, the device acknowledges that the commands have completed. There is no ordering guaranteed between commands. From the Command Set Specification Section 2.1.2 "Command Ordering Requirements": "each command is processed as an independent entity without reference to other commands [...]. If there are ordering requirements between these commands, host software or the associated application is required to enforce that ordering". This means if the order matters, the software needs to wait for commands to complete before issuing the next commands. However, read commands are guaranteed to return the most completed write (Command Set Section 2.1.4.2.2 "Non-volatile requirements"), although they may also return data from uncompleted writes that have been queued.

A related issue with concurrent updates is atomicity. If there are concurrent writes to overlapping ranges, what are the permitted results? Typically, there are no guarantees. Specifically, "After execution of command A and command B, there may be an arbitrary mix of data from command A and command B in the LBA [logical block address] range specified" (Command Set Section 2.1.4.1.1 AWUN/NAWUN Example). This seems to permit literally any result in the case of concurrent writes, such as alternating bytes from command A and command B.

NVMe includes optional support for atomic writes, with different values for "normal operation" and after power failure. These are defined by the Atomic Write Unit Normal (AWUN) and Atomic Write Unit Power Fail (AWUPF) settings for the device. The couple of NVMe devices I looked at have these values set to zero (according to the nvme id-ctrl command). Somewhat confusingly, this means writes of a single logical block are atomic. The specification defines these values as "0's based" (Base 1.4.2 Numerical Descriptions "A 0’s based value is a numbering scheme in which the number 0h represents a value of 1h [...]"; Command Set 4.1.5.2 I/O Command Set specific fields: "This field is specified in logical blocks and is a 0’s based value"). The device exposes the size of atomic writes so software can configure itself to use it. For example, see the MariaDB documentation about atomic writes. This can replace MySQL's "doublewrite buffer," which is a mechanism that provides atomic writes on devices that don't natively support them (nearly all disks).

Basically, NVMe provides "weak" ordering semantics similar to shared memory in multi-threaded programs. There are no guarantees if there are concurrent operations. This means if the order of writes matters, the software needs to submit the commands and wait for them to complete, and never have concurrent writes to overlapping ranges. The specification requires single logical block writes to be atomic. However, I would be nervous to rely on this. It requires very careful reading of the specification to determine that this is required. Older devices did not provide this guarantee. I suspect many devices may have bugs, particularly when power fails.

The Flush command

Without special commands, NVMe provides no guarantees about what data will survive a power failure (Command Set 2.1.4.2 "AWUPF/NAWUPF"). My reading of this means devices are permitted to return an error for all ranges where writes were "in flight" at the time of failure. If you want to be completely safe, you should avoid overwriting critical data by using write-ahead logging. This matches the semantics I found during power fail testing of SATA magnetic hard drives and SSDs in 2010.

The first NVMe mechanism that can be used to ensure data is durably written is the Flush command (Base Specification 7.1 "Flush command"). It writes everything in the write cache to non-volatile memory. More specifically, "The flush applies to all commands [...] completed by the controller prior to the submission of the Flush command". This means if you want a durable write, you need to submit the write, wait for it to complete, submit the flush, and wait for that to complete. If you submit writes after submitting the flush, but before it completes, they might also be flushed ("The controller may also flush additional data and/or metadata"). Most importantly, if you issue a flush, and it fails in the middle, there is no guarantee about what writes might exist on disk. The disk could have any of the writes, with no relation to the order they were submitted or completed. It could also choose to return an error for all the ranges.

Force Unit Access (FUA)

The second mechanism to ensure durability is to set the Force Unit Access option on Write commands. This means that "the controller shall write that data and metadata, if any, to non-volatile media before indicating command completion" (Command Specification 3.2.6 "Write Command" Figure 63). In other words, data written with a FUA write should survive power failures, and the write will not complete until that is true. There is no ordering with other FUA writes, so you should avoid issuing writes for overlapping ranges. Interestingly, you can also specify FUA on a Read command, which is a bit surprising. It forces the referenced data to be flushed to non-volatile media, before reading it (Command Specification 3.2.4 "Read command" Figure 48). This mean you can do a set of normal writes, then selectively flush a small portion of it by executing a FUA read of the data you want committed.

Disabling write caching

The last mechanism that may ensure durability is to explicitly disable the write cache. If an NVMe device has a volatile write cache, it must be controllable. This means you can disable it (Base Specification 5.27.1.4 "Volatile Write Cache"). It appears to me that if the cache is disabled, then every write must not complete until it is written to non-volatile media, which should be equivalent to setting the FUA bit on every write. However, this is not clearly described in the specification, and I suspect this is rarely used.

Devices with power loss protection

Finally, it is worth pointing out that some disks provide "power loss protection." This means the device has been designed to complete any in-flight writes when power is lost. This can be implemented by providing backup power with a supercapacitor or battery that is used to flush the cache. In theory, these devices should show that they do not have a volatile write cache, so software could detect that and just use normal writes. However, these devices should ideally also treat FUA writes the same as non-FUA writes, and ignore cache flushes. As a result, I think it is best to design software for disks that have caches, since it can then work with any storage device. If you are using a device with power loss protection, you should still get better performance and some additional protection from failures.