LMAX Disruptor: Fast Concurrent Ring Buffer

about | archive


[ 2011-August-04 10:32 ]

LMAX, a trading firm, has released their Java concurrent ring buffer they call Disruptor (the name is a a play on the Java Phaser). Martin Fowler wrote a long, detailed overview of the architecture, and the project page has a brief, understandable technical paper that is a good introduction. The basic summary: they have very carefully engineered a ring buffer to pass data between threads in Java without locks and garbage collection. As a result, it gets very good performance, if this communication pattern matches your needs. Anyone interested in high-performance code, and high-performance Java in particular, should read the white paper. It contains many interesting lessons such as worrying about cache lines and garbage collection.

My minor complaint is that the Disruptor has been engineered for systems where there is a one-to-one mapping between threads and cores. You can see this in the way that producers add values to the ring: they first reserve a slot, then later publish it. If there are two publishers, and one pauses in the middle (e.g. because the operating system or virtual machine does a context switch), it stalls all consumers. A second example is that if you want to distribute work across a pool of threads, the Disruptor can't help. The FAQ suggests distributing work modulo the number of workers, but what happens when one work unit is much more expensive than others for some reason? It ends up stalling all the future work assigned to that task. ArrayBlockingQueue, which they clobber in their benchmarks, is much more forgiving.

My point is that even though their ring buffer is very well designed and tuned for high performance, it isn't an appropriate choice for all applications. However, if you have producer/consumer queues in your code, you probably should look to see if Disruptors will match your needs, and anyone interested in high performance code should read their paper and/or blog posts.

There is also one minor technical thing they discuss in the white paper in a slightly misleading way. The paper states that "[mutex] arbitration is achieved by a context switch to the operating system kernel." This is not strictly incorrect, however, I think it is important to note that this happens only when the mutex is contended, and the details depend on the implementation. For uncontended Pthread mutexes on Linux and Java synchronization, the acquire and release operations use the x86 compare-and-swap instruction (CAS), without involving the operating system. This is very inexpensive and efficient. However, if the mutex is held or there are other threads waiting, then it is true that the kernel is involved, although some implementations may spin for a short time, in an attempt to avoid context switches.

In other words: while it is true that locks are expensive, they aren't as expensive as the small microbenchmark they present make it seem. They show the following results for incrementing a 64-bit integer 500 million times:

ImplementationMilliseconds for 500 million increments
Single thread300
Single thread with lock10 000
Two threads with lock224 000
Single thread with CAS5 700
Two threads with CAS30 000
Single thread with volatile write4 700

This microbenchmark makes locks look horrible, since they take more than twice as long as a volatile write or CAS, and 33 times longer than the unprotected increment. I did a brief test of the single thread cases and got similar results. As a sanity check, it is logical that a single thread with a lock would be approximately two times more expensive than a CAS, since locking and unlocking the lock requires two CAS operations (see Ulrich Drepper's implementing mutexes on Linux). However, it is important to remember that this test shows an increment, which can be performed with a single instruction and synchronized with a single atomic operation. More complex data structure manipulations, such as rebalancing a binary tree, are far more expensive, so the relative overhead of synchronization will be lower. Additionally some operations will be extremely difficult to implement with primitive atomic operations. While locks aren't exactly trivial to use, they are much, much easier than lock-free algorithms.