Model Checking a Paxos Implementation

about | archive


[ 2009-April-15 13:57 ]

I find distributed consensus algorithms, such as Viewstamped Replication or Paxos, to be somewhat magical. Given a set of processes communicating over a network, they will either all eventually agree on the same value, or they will fail to reach agreement if the network is failing in arbitrary ways. These algorithms are used to build reliable distributed systems using replication. However, they are notoriously tricky to implement correctly. Subtle bugs that only occur in rare situations with unusual sequences of failures are easy to introduce. In order to build a replicated distributed system that can survive failures, the replication protocol effectively needs to be bug free. Thus, it is worth spending considerable effort to verify that the implementation is correct.

One approach to validating software is model checking. Model checking takes a system and gives it all possible combinations of inputs, then verifies that it does not violate some safety properties. This is similar to random testing, which I have written about before, just with more intelligence. For consensus algorithms, the safety property is that all participants either must not have agreed on a value, or must agree on the same value. I decided to see if I could take a simple implementation of the plain Paxos algorithm and exhaustively check it via a very naive model checker. I should stress that there is lots of academic work in this area, and that I don't claim my approach is particularly good. However, it provides an interesting and easily understood case study.

Results

My motivation came from originally reading Paxos Made Live, which describes how Google used a simulator to stress test their Paxos implementation with random failures. For a class project, tested my own Paxos implementation with a similar approach, which was very effective in finding bugs. However, I found that many of the random inputs that my simulator would choose were actually equivalent. Thus, if I could be smarter about how to choose inputs, the random simulator would be more effective, as it would test more hard to reach corner conditions. This led me to wonder how large of a system could I exhaustively check, and how could I be smarter about choice of messages to deliver. Thus, I wrote a very simple Paxos implementation in Python, with two known bugs, along with a simple simulator to exhaustively test it.

The first bug was easy to find. It only requires five processes (three acceptors, two proposers). My system is able to exhaustively test this configuration in about 20 minutes. However, my system struggles to find the second bug. The bug requires six processes (three acceptors, three proposers). My system finds the problem after running for 30 hours. I gave up on trying to exhaustively check this configuration. It took 150 hours (6.5 days) to check 615 million execution traces. Based on its progress, I estimate it would take my system approximately a month using one 3 GHz Xeon CPU to exhaustively test this configuration. I also tried checking random execution paths, but the bug was undetected after running for one day.

This reveals the weakness of model checking: there are simply too many possible orders of events to search the space exhaustively for anything more than trivial systems. Certainly, a combination of better algorithms and a better implementation (particularly not in Python) would be able to exhaustively search my simple Paxos implementation in a reasonable amount of time. However, it will always be infeasible to exhaustively test larger systems. That said, this is still a very useful tool, as I have written before. Larger systems are likely to have more bugs, so even simple randomized testing is useful. My experience suggests that designing software so it can be easily tested in this fashion is worthwhile. Unfortunately, if it doesn't find any bugs, that does not mean that none exist.

Details: Searching for Bugs

My model is very simple. Each process is a Python class which handles two methods: start, called when the system is starting execution, and messageReceived, called when a message is delivered to the process. While executing those methods, the Python code can send messages to other processes. There is also a validator method, called at the end of an execution trace. To verify that the state of the system is correct. The simulation assumes nothing about the state of the processes or the contents of the messages.

My first implementation was very simple: depth first search over all possible message interleavings. This works only for very simple protocols with very few processes. It ends up visiting many interleavings which are identical. For example, if message a goes to process 1, and message b goes to process 2, the message interleavings (a, b) and (b, a) lead to the same result. Thus, you only need to visit one or the other.

My next implementation is slightly smarter. It is based on the observation that if you deliver message a to process 1, then explore all possible states under that choice, then to enter a new state you must deliver some other message to process 1 as the next message. This dramatically reduces the search space. However, it might be necessary to deliver messages to other processes in order to generate a different message for process 1, so this still wastes time exploring redundant parts of the state space before determining that it isn't possible to deliver some other message first. For one configuration of my Paxos implementation, it finds 7 648 unique executions to validate, but examines 68 029 partial execution traces to find them.

To try and avoid exploring redundant parts of the state space, I implemented a more intelligent algorithm: dynamic partial order reduction. This is a clever strategy: basically you execute the system, then you examine the execution trace for points where a different choice can be made that will lead to a different state. This ends up visiting only unique partial orders of messages, which is perfect. Unfortunately, examining the trace is more complicated than for the simple "don't deliver these messages next" of my previous implementation. Thus, this actually ends up being slightly slower. Additionally, tracking states that you have visited before ends up being more complicated. Due to a simplification I made, my implementation still ends up visiting redundant states. However, for 7 648 unique executions, it explores 8 601 traces, which is still an improvement.

Code

The code here includes the Mercurial repository, if you want to examine the history of development. The code includes both the Paxos implementation (with the bug!) and the simulator which supports both search strategies.

pypaxos.tar.bz2

Useful Reading