Durability: Cloud applications (evanjones.ca)

[ 2020-December-28 12:49 ]

When designing applications that store critical data, it is to important to ensure the data will be durable, so it is accessible after a failure. To better understand the guarantees provided by modern systems, I previously wrote about the durability guarantees provided by the NVMe disk interface, and by Linux file APIs. In this article, I will describe what I know about durable writes available on cloud infrastructure. In particular, I will look at AWS and Google Cloud, since I know them the best. After reading their documentation carefully, my cloud storage durability recommendations are:

Instance storage can be lost when the machine is restarted by the cloud provider. I would only call it durable if the application is replicated across more than one zone. Don't bother replicating across machines in one zone, since your data will be lost when the whole zone fails at once.
A network disk in one zone is like a physical server: on rare occasions data will be lost (e.g. 1 out of 100k disks/year). This is probably sufficient for many applications. However, you should take periodic snapshots, since they are easy to set up. In case of a rare failure, you can restore from the last snapshot. With snapshots, this is good enough for most applications.
Network disk replicated across zones should be highly available and durable. This is sufficient for mission critical applications with high availability requirements (e.g. 99.999% or greater availability).
Instance storage replicated across zones can be highly available and durable, but you should be careful due to the risk of correlated failure (e.g. accidentally turning off all instances), or misconfigurations leading to everything running in one zone, or not being correctly replicated.
Object storage is basically as durable as a storage system gets.
All important applications need backups, even applications using object storage.

In general, if you have data that really matters, it should be replicated across zones and also backed up periodically. One interesting note is that it is impossible for an external customer to test if the cloud storage options correctly implement the durability they claim to provide. This means we can only really judge based on documentation, observed performance, and experience reports. Now let's look at the available options in more detail.

Instance storage (AWS)/Local disk (GCP)

The lowest-level storage option available in the cloud is EC2 Instance Storage with AWS and Local SSD with GCP. This is effectively part of a disk that is directly attached to the physical machine running the VM. This should provide the highest performance, since there is no network and fewer devices between your application and the storage medium. However, it also provides the worst durability. Both cloud providers document that applications should be extremely careful when using instance storage. AWS says "local instance store volumes are not intended to be used as durable disk storage". Google says "Local SSDs are suitable only for temporary storage such as caches, processing space, or low value data". Let's ignore these warnings, and pretend we write an application that uses only a single instance storage disk. How durable is our data (assuming we use the correct Linux file APIs)? AWS's documentation states that if there is any issue with the host, such as a power failure or an unscheduled reboot, your data will be lost. Google's documentation has a more precise list but is roughly equivalent. The biggest difference is that Google attempts to use live migration to move your VM to another host for scheduled maintenence, and they clearly document that "If the host system experiences a host error, and the underlying drive does not recover within 60 minutes, Compute Engine does not attempt to preserve the data on your local SSD".

In short, local instance storage is lower durability than a single physical disk in a physical server. Not only will your data be lost if the disk fails, but your data will also be lost for any other issue that causes the cloud provider to turn off that machine, such applying a security patch. It also is not clear if your data will survive the machine losing power. AWS's documentation suggests reboots will always lose data, but Google's suggests it might, if it comes back within 60 minutes. In either case, I would not trust instance storage with the only copy of my data.

What about a replicated application using instance storage? AWS's documentation suggests this is a great idea: "instance store volumes backed by SSD [...] are ideally suited for many high-performance database workloads [...] like Cassandra and MongoDB, clustered databases, and online transaction processing (OLTP) systems. [...] applications using instance storage for persistent data generally provide data durability through replication". In my opinion, if you want to rely on replicated instance storage, it must be replicated across different zones. If your data is stored on multiple instance storage disks in one zone, then all of them will be lost if that zone loses power, catches fire, or someone accidentally triggers the fire suppression system. From the history of major public cloud providers, single zone failures are relatively likely (e.g. 2021-06-11 AWS eu-central-1 air conditioning and fire suppression issue, AWS us-east-1 power issue on 2019-08-31, AWS Sydney region issue on 2016-06-05, GCP europe-west1-b on 2015-08-13). On the other hand, multiple zone failures are exceptionally rare, so it should be possible to build highly available and very durable systems by replicating between zones. I am a bit nervous about the a mistake or misconfiguration taking down all instances and entirely losing all data, so I think backups are critical.

Network storage (AWS EBS/GCP Persistent Disk)

Cloud providers provide "disks" where the data is actually stored on multiple physical devices across the network. The providers call these disks durable. AWS in particular says "EBS volumes are designed for an annual failure rate (AFR) of between 0.1 and 0.2 percent [...] This AFR makes EBS volumes 20 times more reliable than typical commodity disk drives, which fail with an AFR of around 4 percent." Google does not publish their design targets, but similarly claims "Persistent disks have built-in redundancy to protect your data against equipment failure". Google also provides regional persistent disk, that automatically replicates across zones to survive single zone failures.

In my opinion, application data is safer in a cloud network disk than on a single physical disk. There is more than one disk involved, so it will survive the occasional "random" disk failure. However, they are not perfect. In particular, single zone failures that cause a large number of physical machines to fail at once do happen. During these incidents, customers have permanently lost disks. I personally like to think of an application with data on network storage to be roughly the same durability as a physical server with RAID. It will survive many failures, but data will be lost on rare occasions. Both AWS and Google recommend that customers take periodic disk snapshots, so disks can be restored in case of failure. These disk snapshots are fast, easy and cheap, so I highly recommend doing this for any disks with important data. Just remember to test restoring the snapshots periodically, to make sure you know how to do it, and that they actually work.

The next level of durability is to store application data on multiple network disks in different zones. These applications should be highly durable, since a failure would have to occur across multiple disks in multiple zones. Cloud providers design zones to fail independently, so this should be an exceptionally rare event. This is an excellent configuration for applications that need the highest availability and durability.

Do you need to use forced disk writes with cloud network disks?

An interesting question is if applications need to use forced disk writes with cloud network disks to get durable writes. That is, if the does not send a FUA write or cache flush (see my article about the NVMe interface), will writes survive if the host machine or the entire zone loses power? Unfortunately, only the cloud providers can really answer this question. However, from the information I have found, it appears that both AWS and GCP disks only acknowledge writes after replication when the writes are considered durable. I found a comment from an AWS engineer that claims "All writes to EBS are durability recorded to nonvolatile storage before the write is acknowledged to the operating system running in the EC2 instance [...] explicit FUA / flush / barriers are not required for data durability". This means writes will survive both the host failing and the entire zone losing power. I have not been able to find similar information for GCP persistent disk, but testing with FUA writes and cache flushes shows that they appear to have nearly zero impact on throughput or latency, so I suspect it is the same. This means that applications could get away with O_DIRECT writes only, and FUA writes and cache flushes are probably unnecessary. However, since there is nearly no performance impact, you might as well use O_DSYNC or fdatasync anyway, since then your application will work correctly on other disks. My guess is that Amazon and Google implement this with the equivalent of battery-backed RAM, just like hardware RAID disk controllers often do.

Object storage (S3/GCS)

The object storage systems (AWS S3, GCP GCS) provide very ambitious durability claims. They store data in multiple zones, and claim they will only lose some astronomically small fraction of objects. This type of storage is basically as good as it gets in terms of durability guarantees. However, these impressive claims do nothing to protect you from bugs, mistakes, or bad actors. If your primary data storage is in one of these systems, you should still consider having backups to a different region with different permissions. Another easy solution is to turn on versioning or the other automatic data retention options, so you can restore data that may have been deleted for some period of time. Due to the relatively low cost of storage and impressive durability, object storage systems are great places to store backups or other long-term archival data.