Systematically Exposing OS Kernel Races – An Interview with Ben Blum

By Jakob Engblom

Engblom_lgFull-system simulators like Simics provide unparalleled insight into what is going on in a target system. Indeed, better insight is one of the main features of simulation that we get regardless of what we simulate and how. In addition, if we want to, we can also exert control over the target system to make it take different execution paths than it otherwise would. Earlier this year, Ben Blum at Carnegie-Mellon University  CMU presented a Master’s thesis that provides a very good example of just what can be achieved by combing the insight and control of a simulator with intelligence and domain knowledge. The system is called Landslide, and it is used to expose race conditions inside of operating-system kernels.

Landslide systematically explores the possible execution paths of the kernel in order to provoke latent bugs that might only happen very rarely in actual use of the operating system – but as we all know, such glitches are the ones that tend to be found by customers in critical situations and force engineering to spend months in reproduction attempts.

For this blog post, I interviewed the creator of Landslide, Ben Blum. 

Jakob Engblom: Please introduce yourself!


Bblum-by-alanvBen Blum
: I'm Ben Blum, a graduate student in the Computer Science Department at Carnegie Mellon (CMU) , advised by Garth Gibson. I spent last year doing this work as part of the Fifth Year Master's program in the Computer Science Department at CMU. Now I am staying on for several more years for a Ph.D

JE: What was the topic of your thesis?

BB: Systematic testing (or sometimes "exploratory testing" or "systematic exploration") is the idea of testing for race conditions in a concurrent system by forcing the system to execute as many different thread interleavings as possible.

Normally when you run a concurrent system, its threads run in an unpredictable pattern (governed by timer events or other device interrupts), and only one possible interleaving out of many is executed. And (by contrast), when you use conventional stress testing, you exercise a random set of thread interleavings, often a very small subset of the whole execution space. A systematic testing tool tries to enumerate and explore the whole state space of interleavings.

This approach is very powerful: it can find arbitrarily complicated race conditions, including TOCTTOU (Time of Check To Time of Use) bugs. Lighter-weight tools such as data race detectors or static analyses are often powerless against these, since simply adding locks around each access might satisfy such a tool but not actually solve the race.

But systematic testing is also very computationally expensive – you might be thinking "exponential explosion of possibilities", and you'd be right. A big part of the research involves figuring out how to reduce the coverage (and/or granularity) of possible interleavings in a way that still gives meaningful results. Sometimes we use sophisticated pruning algorithms; sometimes we use simple heuristics.

JE: How did you use Simics?

BB: I wrote my tool, called Landslide, as a Simics module. When Simics boots up a kernel, it also loads Landslide, which then gets to monitor and control the kernel's execution.

Being in a simulated environment gives a lot of power: Landslide sees every instruction and memory access the kernel runs, and so keeps fairly detailed internal representations of the kernel's state. There is about an order of magnitude slowdown in doing this (compared to regular Simics simulation speed), but it gets to know (for example) when each kernel thread is runnable and whether heap accesses are invalid.          

JE: And just what is Landslide and how does it work?

BB:  During the first run of a test case, Landslide identifies "decision points" – instructions during the execution at which it thinks that if the kernel had context switched to another thread, something interesting might have happened (the set of decision points defines the state space of thread interleavings). Annotations in the kernel (which I call "tell_landslide()") also help Landslide track what the kernel is doing (especially thread lifecycle events).

Then, when the test case finishes, Landslide chooses which decision point to execute differently (and chooses which thread to run instead), rewinds to that state, injects timer interrupts to force the kernel's scheduler to run that thread, and proceeds. (A pruning algorithm called Dynamic Partial Order Reduction drives the exploration – it chooses which interleavings to explore next in a way that skips over certain redundant ones.)

Simics has a really great feature that makes this possible: it supports setting bookmarks when the simulation is stopped and later doing reverse-execution back to that point. I used this to implement the "rewinding", and it turns out to be a lot faster than rebooting the kernel every time (especially in a depth-first search).

JE: Cool, so you actually make use of reverse execution to create a backtracking search through the state space of the target system. Kind of obvious once you see it, but also very clever.  Anyway, please continue.

BB: Finally, when Landslide finds a bug, it prints a "decision trace", which is a list of each thread switch that happened and a stack trace at each point. Landslide has a couple of bug-finding checks: tripped asserts, use-after-free heap accesses, deadlocks, and (heuristically) livelocks /infinite loops.

 

Landslide

Which events to consider "decision points" was also a hard question. During the project, I used voluntary reschedules (such as calls to yield()) and synchronisation operations (such as calls to mutex_lock()/mutex_unlock()) as decision points. I felt this struck a good balance between fine-grained interleavings (more likely to find races) and coarse-grained interleavings (more feasible to explore the state space).

But I also wanted to harness the user's own intuition, to focus the search space more intelligently. So the user can also configure decision points of their own – they can write "tell_landslide_decide()" in the kernel's code to say "hey, put a decision point here!", and they can also use another config file to whitelist/blacklist certain modules of the kernel (maybe you don't want to waste any time exploring interleavings in the virtual memory subsystem).

 


Partial-Order Reduction Explained

Landslide-reduction

Landslide uses an existing algorithm called Dynamic Partial Order Reduction to prune redundant subtrees out of the search space of different thread interleavings. The basic idea of the algorithm is to use a "memory independence relation", which identifies when certain state transitions of different threads have no conflicting shared memory accesses, to know when it wouldn't matter which order those transitions run in. In the picture above, the states are identical after "1" and "2", regardless of the order that these operations are done. Therefore, it is sufficient to explore only of the subtrees marked in yellow.


JE: That sounds like some pretty advanced Simics programming there (I would not call it scripting, it is way beyond that level of sophistication)!

BB: Thanks! Yes, Landslide is basically its own piece of software. It's about 4000 lines of code, written mostly in C.

JE: What types of kernel errors do you detect with Landslide?

BB: Well, in theory, the beauty of systematic testing is that it should be able to find any kind of race condition.

One example of a race that systematic testing can find that a data race detector couldn't is the everyday TOCTTOU: let's say the kernel validates a userspace buffer (with a lock held) then later accesses the buffer, but in the meantime another thread changed the permissions on that memory (with the same lock held), making the first thread's access fail. Because of the locking, it's not a data race, but Landslide can easily force an interleaving when the first thread drops the memory-validation lock (cf. mutex_unlock() being a good decision point, as I noted above).

Landslide itself is a little more limited than "in theory", of course. As implemented, the only source of nondeterminism it controls is the timer interrupts; so it controls thread scheduling, but not device input handling. As such, it's most useful for finding races in core or mostly-core parts of the kernel, such as the thread lifecycle implementation or the virtual memory system. Finding races in device drivers is beyond the scope of Landslide's current model (though potentially in the path of future work).

JE: Are you still working on the tool?

BB: Since I wrote my MS dissertation on it, I've put it on hold for a while. I've been doing some other concurrency-related projects in the meantime (which you could read about if you want on my website, linked above).

JE: What kinds of code have you tested Landslide on?

BB: I wrote Landslide targeting "Pebbles", which is a UNIX-like kernel architecture that students implement in six weeks in our undergrad operating systems class at CMU. Pebbles kernels have to be fully preemptible and support concurrent execution of standard thread-lifecycle events (fork, exec, exit, wait, yield, etc.). Using Pebbles as a case study was a good proving ground for potentially extending this work in the future to more mainstream kernels, such as Linux.

Students implement these kernels from the ground up (we only provide them enough starter code to boot up into an entrypoint function; none of the design or virtual memory or anything is done for them), so each kernel ends up with slightly different design and structure. Landslide had to be flexible enough to be able to attach to and control any of these kernels.

The real test was meeting with some of the students towards the end of the project to get them to try out Landslide, to see if it works "in the wild". Sometimes it was a struggle – the average time spent doing instrumentation was 100 minutes – but in the end, all four (out of five total) groups that completed the prerequisite instrumentation were able to find bugs in their kernels with Landslide.

JE: That is a very elegant twist to the project, I think. You both get to test your tool on real code that you know nothing about, and you teach the students that there are tools that can do almost magic things to help them check their code for errors.

JE: Do you have any fun examples from the work of bugs found?

BB: Sure. My favorite story is of one of the project's defining moments, a time when I just stopped in my tracks and thought "huh. I have made something smarter than me."

I had written a new unit test case, called double_wait, in which two threads in a parent process attempt to wait() on a single child process. I was testing it out on my own kernel (the one I wrote when I was an OS student), and I expected the search to complete with no bugs, each time having one parent thread succeed and the other return failure. Instead, after a bit of searching, Landslide made my kernel trip an assert in its condition variable linked-list logic.

This was a bug nobody ever knew about before – I'd written it three years before and didn't find it during stress-testing; and the TA who manually graded my kernel (by looking over it with a red pen) also overlooked it.

JE: That just goes to show how difficult reasoning about concurrency is for a human being.

JE: As you know, I previously interviewed the creator of SimTester. At a high level, these two tools built on Simics would seem to be similar, but I think that they are actually quite distinct. Could you expand a bit on that?

BB: Imagine my surprise when the SimTester interview came out just when I was writing my thesis's related work section. "Whoa, wait, another tool that runs in Simics and identifies key decision points and injects interrupts?"

So our projects are quite similar in those regards, but at a higher level my strategy is somewhat different from theirs. SimTester focuses on races involving device interrupt handling code, whereas Landslide is geared towards non-driver inter-thread races. Additionally, SimTester’s testing model involves forcing only one interrupt per test run, which is a lighter-weight approach than systematic exploration.

JE: Thank you for your time, this has been really interesting. Where can you go to find more information about Landslide?

BB: The thesis can be found at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-CS-12-118_abs.shtml (and the slides from the MSc defense at http://bblum.net/landslide-defence.pdf).

 

For additional information from Wind River, visit us on Facebook.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>