Lessons from the AWS Kinesis Outage

about | archive

[ 2020-December-01 09:57 ]

I love reading post-mortem analyses, because there is a lot to learn both about how systems are designed, and how they fail. On November 25th 2020, AWS Kinesis had a 17 hour outage that affected other AWS services, and customer services built on top of AWS. The post-mortem is excellent, with a lot of interesting details about how this service, and other services at AWS work. I wrote myself some notes with the lessons I think we can learn from this outage. I would like to point out that this does not mean I think the AWS teams did a bad job in any way. It is very easy to point out weaknesses after something has failed. It is much harder to figure out which of the infinite weaknesses in a system you should prioritize in advance. However, we should try our best to learn from examples like this, to attempt to be able to predict how things will fail in the future, and get better and handling the inevitable failures that do occur.

Check limits/quotas carefully
Robust systems need limits on all critical resources, so that unexpected events cause specific requests to fail, rather than crash the entire system. However, these limits are also frequent causes of outages. In this case, Kinesis ran into an operating system thread limit. It is very hard to know in advance which limits matter and which do not, since there are probably thousands of limits that might apply in a software system like this. However, this is a good reminder that a useful reliability exercise is to consider what limits a system might run into, and at least ensure you can estimate if you might be approaching these limits. When adding new limits, it is also worth thinking about how to report or warn operators that a system is approaching these limits, so that it can be investigated before it is a critical problem.
O(N2) resource usage is potentially dangerous
In this Kinesis incident, one front-end server has at least one thread for each other front-end server, which is O(number of servers). However, in aggregate, the usage is O(N2). Another way to think about this is adding one front-end increases the number of threads on every other server. I generally support using the simplest possible implementation (e.g. see my article about the unreasonable effectiveness of linear search). However, any resource usage that scales worse than linearly should be monitored carefully, since it is likely to fail catastrophically with a modest amount of growth. Amazon's solution to this is "cellularization", which is effectively creating a two-level hierarchy. I suspect inside each cell, there will still be all-to-all communication. However, the cells are size limited. At the next level, you could repeat the same all-to-all pattern, except only once for each cell, which reduces N by a large factor. This is a bit similar to how Internet routing scales. In that case, local Ethernet networks use broadcast to find where to send messages, but multiple Ethernet networks are connected by routers, where only the routers talk to each other.
Be able to deploy and revert quickly
The most common cause of outages is change, and the most consistent tool for fixing them is undoing that change. These operations should be fast, so that you can reduce the mean time to repair (MTTR) of outages, by either quickly rolling back changes, or being able to deploy a bug fix or corrected configuration change. For stateful services, making deploys or restarts work quickly can be very hard, but it is worth trying. In the case of Kinesis, "It takes up to an hour for any existing front-end fleet member to learn of new participants". This seems related to the fact that restarting everything "would be a long and careful process". This meant it took about 17 hours to recover from this outage. If instead a fleet-wide restart took 1 hour, they would have been able to recover the service by around 10 AM PST instead of 10 PM PST, reducing a 17 hour outage to 5 hours. It is also worth noting that deploying too quickly can also be problematic, since it can mean you deploy a bug everywhere, instead of only affecting a small portion of your service. However, in case of emergencies, it is useful to have the ability to rush out a change.
Fallback mechanisms are dangerous
The Amazon Cognito service also failed, because it sends data to Kinesis. Interestingly, they had a fallback mechanism: it buffers data locally, any only discards data when the local buffer is full. This is a classic fallback mechanism, which is only used in the rare case of failure. It seems like adding this mechanism should improve the reliability of a system. However, fallback mechanisms are extremely dangerous because they are almost never used. This means they are much more likely to have bugs. In this case, when the buffer is full, the code using the buffer blocked forever, which took down the service. If Cognito had not built this buffer, it probably would have had to immediately discard data on any Kinesis error. This would probably mean that Cognito would be more likely to drop data in general, as it would not tolerate any small Kinesis delays or errors. However, it would have been less engineering effort, and would have avoided this outage entirely. If you really must have a fallback, you should make sure you test it continuously and automatically. If you can only test a rare failure scenario manually, then inevitably bugs will creep in as the code evolves and people forget to run the tests. Ironically, the AWS "builder's library" has a great article about avoiding fallback and "why we almost never use them at Amazon." This specific error involved a queue, which is a frequent cause of a related issue: in normal operation, queues are empty and delays are low. However, when issues occur, the queues fill up, which causes significant delay. The challenges of dealing with the bimodal behaviour of queues are described in detail in another good AWS Builder's Library article.
Critical systems must be as simple as possible
Amazon's status page continued to work during this incident, but Amazon had trouble updating it, because the workflow that updates it uses Cognito. A system which is intended to work even when other parts the system is down, like a status page, needs to be as simple as possible. The fact that the status page was still operating correctly is excellent. In particular, the serving system seems to be separate from the update system. This is an example of robust design, and mechanisms functioning as intended. However, the fact that the update system depends on Cognito is potentially problematic. In particular, this means the update system can't be used to report errors in Cognito itself. It is worth being aware of what dependencies systems like this have, so you can decide if this dependency is acceptable or not.
Practice and streamline any process that people may need during an outage
Amazon could not update the status page, since updating it depended on a system that was down. They had a manual alternative process that explicitly avoided this dependency, however "we encountered several delays [...] as it is a more manual and less familiar tool for our support operators". This is both a good and bad example. They knew the normal system had a high risk of being broken during certain types of outages, and had designed a fallback. Unfortunately, the fallback did not work as intended (because fallbacks are dangerous, see above). If a fallback involves people, you need to regularly practice this kind of failure, to ensure everyone on your team knows how to find the documentation, and can execute the necessary steps. This is like regularly testing backups by actually restoring them. If you don't, you are highly likely to find that your backups do not work, or require a very long time to restore.
Don't forget to scale up as well as out
Historically, the primary way to handle more users or work was to move an application to a bigger computer. In the last ~10-20 years, the pendulum has swung the other way, with most systems designed to run on multiple separate computers. In many cases, this means we are running individual instances of our applications with too few resources. When your application gets big enough, make sure you re-evaluate the resources you are giving each individual instance. Managing a small number of large instances can be more efficient than large numbers of small instances. It may also be more reliable, because you are less likely to hit some scaling limits. This is less important than the other points, but for both cost and reliability reasons, it is worth periodically re-evaluating the resources you are giving your largest applications. Running a smaller number of bigger instances would have avoided this outage, or at least deferred it. The post-mortem notes the team was already working on cellularization, so in an ideal world, running on larger instances could have deferred this outage until the cellularization was completed.