Code that debugs itself: Fixing a deadlock with a watchdog (evanjones.ca)

[ 2018-May-01 11:04 ]

[Originally posted on the Bluecore Engineering blog]. Rare, hard to reproduce bugs are the worst. We recently fixed a deadlock that had plagued our code for nearly a year, but only showed up about once a month. We would notice something was stuck, debug it for a bit, then finally restart the process and the problem would disappear. Eventually something changed and it started happening two or three times a day, which meant we needed to fix it. The technique that found the cause was adding a thread to constantly check for a deadlock and then report one when found. It turns out it is fairly easy to have a process debug itself, which is useful when traditional tools aren’t working.

The problem occured in a Python web application running on Google App Engine. Occasionally, a request seemed to get “stuck”: it logged that it started, but then never completed. Eventually, the requests would time out and get killed by App Engine. We suspected we had a deadlock, since locking bugs are a common cause of rare hangs. However, the stack traces printed when the requests were killed were from all over our application. We expected that if we had a deadlock, then the stack should be consistent and show the thread waiting on a lock somewhere. The only hint was that about half the traces were in the Python standard library logging module, but with no consistent pattern.

We started by reading the Python logging source code. The module has a single lock that is held while formatting log entries, which seemed promising. To cause a deadlock, we need a cycle of threads waiting on each other. Our theory was that two threads were waiting on each other, as shown in the figure below.

Thread A acquires lock then tries to lock other_lock. Thread B performs the operations in the opposite order

In the figure, Thread A calls the logging library and holds logging._lock. At the same time, Thread B locks a lock named other_lock. Next, Thread A tries to lock other_lock, maybe because it calls a custom log formatter or log writer. It has to wait. Then, Thread B tries to log something, which then needs to get the logging lock, which causes the cycle between Thread A and Thread B, so they wait forever. Deadlock! Unfortunately, we couldn’t find anything in the logging code that might grab another lock. This seemed like a dead end.

The other hint was that many stack traces were in “startup” code, which executes on the first few requests to a new instance. Searching the Internet for things like “Python startup deadlock” revealed that Python 2 has a module import lock and there have been issues with deadlocks if code that runs at import time tries to grab a lock. At the time, this didn’t seem relevant, since we didn’t see imports in the stack traces.

We were starting to get desperate. While brainstorming, someone joked that we should just dump all the thread stacks every second. We would “just” need to find the right one out of the millions of stacks it would record. That crazy, impossible idea made us wonder: could we detect when the requests are stuck and log the thread stacks at precisely the right moment?

In general, detecting deadlocks correctly is challenging: You need to know what all threads are waiting for, which requires adding instrumentation to the lock implementation. However, we didn’t need to be 100% correct. If we detected requests that seemed to be stuck for a long time, say 10 seconds, it could be because of our bug. This is not perfect: If a thread is waiting for something that is making progress (e.g. sleep(15)), we will incorrectly think it is deadlocked. However, we should not miss any deadlocks as long as we detect it before the request timeout cancels the request. That is, this could have “false positives” but no “false negatives.”

To implement this, we added a background thread that dumped all thread stacks every 10 seconds, using Python’s sys._current_frames() function. It then compared the current stacks to the last stacks it captured. If a thread appeared to be in the same place, then it is suspicious and we logged the state of the program. This is a sort of “watchdog timer”, constantly searching our program for bugs. We tested it by manually causing a deadlock, then put it in production.

Unfortunately, we saw requests getting stuck, but nothing in the logs. We spent a number of hours retesting and convincing ourselves that it should be getting triggered. Eventually, we realized a possible problem: If the deadlock was in the logging library, using logs would cause the watchdog to also get stuck! We changed the code to write the debugging output to a separate service (Google Cloud Task Queues).

Within a few hours of deploying the fix, we had our answer: the code was stuck while logging. Mysteriously, requests were always stuck on this line of code, deep in Google App Engine’s logging service:

bytes_left -= 1 + group.lengthString(num_bytes)

This didn’t make any sense: how can this deadlock? It is just calling what seems to be a simple function and doing some math. Maybe the stack traces were inaccurate and it was actually stuck somewhere near this line? Eventually, we remembered the module import lock and the fact that our stuck requests usually happen at startup. What happens if we call the App Engine logging code while holding the module import lock (e.g. imp.acquire_lock())? Our test program immediately hung at the same “bytes_left -= 1 …” line.

With a way to reliably reproduce the problem, we finally understood the problem, shown in the figure below.

Some of our functions have import statements to avoid circular import errors when that statement was at the usual place at the top of the file. The first time the function is called, the import is executed, which holds the import lock (Thread A steps 1-3). If the code being imported logs anything, it then tries to grab the logging lock (Thread A step 4).

At the same time, a different thread calls the logging code and acquires the logging lock (Thread B step 1). The log handler then attempts to send the logs to the App Engine logging service, calling the line of code above. The group.lengthString() function is implemented in a native C library. For some reason, this library tries to acquire the import lock (Thread B step 3). At this point, we have a deadlock. Since the code trying to acquire the import lock is a native code library, we can’t “see” it from Python. Instead, all we see is that it is stuck on a function call, which explains how the code was stuck on the line above.

So why did the original stack traces from the requests killed by App Engine not reveal this problem? When App Engine kills one of the requests in the deadlock, the other request is unblocked and continues. However, it is very likely to also time out and get killed very shortly after, printing a stack from anywhere, which confused our debugging. The line of code above did in fact occur fairly frequently in the stuck traces, but we had dismissed it as not possibly causing a deadlock since it wasn’t obviously waiting on anything.

This was one of the most difficult bugs we have fixed. In retrospect, we could have saved a lot of time if we had realized that adding code that debugs itself can be easy. This is applicable to more than just deadlocks. For example, if your server is crashing on rare bad requests, you can log requests in a circular buffer, then dump the buffer on a crash. However, we also learned some other lessons for how to fix really hard bugs:

Persevere: The hints about the logging and module import locks we learned over the year of investigating this bug were critical to finding the root cause.
Brainstorm with others: The “crazy” suggestion to log all stacks was not practical, but it led to the eventual solution.
Tools don’t need to be perfect: Our deadlock detector is flawed, but is good enough to detect this bug.
If facts don’t make sense, test your assumptions: If you are starting to second guess your tools, figure out a way to verify if it is telling you the right thing. Sometimes, the things that are “impossible” are exactly where the bug is.

In the end, remember that every bug in a computer system has a cause, and you can figure it out if you have the patience to keep digging.

For more details about the bug and how we worked around it, see the test program we wrote to report this issue to Google.