Systematically Exposing OS Kernel Races – An Interview with Ben Blum

Full-system simulators like Simics provide unparalleled
insight into what is going on in a target system. Indeed, better insight is one
of the main features of simulation that we get regardless of what we simulate and
how. In addition, if we want to, we can also exert control over the target
system to make it take different execution paths than it otherwise would.
Earlier this year, Ben Blum at Carnegie-Mellon University  CMU presented a Master’s thesis that provides
a very good example of just what can be achieved by combing the insight and
control of a simulator with intelligence and domain knowledge. The system is
called Landslide, and it is used to expose race conditions inside of
operating-system kernels.

Landslide systematically explores the possible execution
paths of the kernel in order to provoke latent bugs that might only happen very
rarely in actual use of the operating system – but as we all know, such
glitches are the ones that tend to be found by customers in critical situations
and force engineering to spend months in reproduction attempts.

For this blog post, I interviewed the creator of Landslide, Ben Blum. 

Jakob Engblom: Please
introduce yourself!


Bblum-by-alanvBen Blum
: I'm Ben Blum, a graduate student in the
Computer Science Department at Carnegie Mellon (CMU) ,
advised by Garth Gibson.
I spent last year doing this work as part of the Fifth Year Master's program in
the Computer Science Department at CMU. Now I am staying on for several more
years for a Ph.D

JE: What was the
topic of your thesis?

BB: Systematic
testing (or sometimes "exploratory testing" or "systematic
exploration") is the idea of testing for race conditions in a concurrent
system by forcing the system to execute as many different thread interleavings as possible.

Normally when you run a concurrent system, its threads run
in an unpredictable pattern (governed by timer events or other device
interrupts), and only one possible interleaving out of many is executed. And
(by contrast), when you use conventional stress testing, you exercise a random
set of thread interleavings, often a very small subset of the whole execution
space. A systematic testing tool tries to enumerate and explore the whole state
space of interleavings.

This approach is very powerful: it can find arbitrarily
complicated race conditions, including TOCTTOU (Time of Check To Time of Use) bugs. Lighter-weight tools such as data race detectors or static analyses are
often powerless against these, since simply adding locks around each access
might satisfy such a tool but not actually solve the race.

But systematic testing is also very computationally
expensive – you might be thinking "exponential explosion of
possibilities", and you'd be right. A big part of the research involves
figuring out how to reduce the coverage (and/or granularity) of possible
interleavings in a way that still gives meaningful results. Sometimes we use
sophisticated pruning algorithms; sometimes we use simple heuristics.

JE: How did you
use Simics?

BB: I wrote my
tool, called Landslide, as a Simics module. When Simics boots up a kernel, it
also loads Landslide, which then gets to monitor and control the kernel's
execution.

Being in a simulated environment gives a lot of power:
Landslide sees every instruction and memory access the kernel runs, and so
keeps fairly detailed internal representations of the kernel's state. There is
about an order of magnitude slowdown in doing this (compared to regular Simics
simulation speed), but it gets to know (for example) when each kernel thread is
runnable and whether heap accesses are invalid.          

JE: And just what
is Landslide and how does it work?

BB:  During the first run of a test case,
Landslide identifies "decision points" – instructions during the
execution at which it thinks that if the kernel had context switched to another
thread, something interesting might have happened (the set of decision points
defines the state space of thread interleavings). Annotations in the kernel
(which I call "tell_landslide()") also help Landslide track what the
kernel is doing (especially thread lifecycle events).

Then, when the test case finishes, Landslide chooses which
decision point to execute differently (and chooses which thread to run
instead), rewinds to that state, injects timer interrupts to force the kernel's
scheduler to run that thread, and proceeds. (A pruning algorithm called Dynamic
Partial Order Reduction drives the exploration – it chooses which interleavings
to explore next in a way that skips over certain redundant ones.)

Simics has a really great feature that makes this possible:
it supports setting bookmarks when the simulation is stopped and later doing
reverse-execution back to that point. I used this to implement the
"rewinding", and it turns out to be a lot faster than rebooting the
kernel every time (especially in a depth-first search).

JE: Cool, so you
actually make use of reverse execution to create a backtracking search through
the state space of the target system. Kind of obvious once you see it, but also
very clever.  Anyway, please continue.

BB: Finally, when
Landslide finds a bug, it prints a "decision trace", which is a list
of each thread switch that happened and a stack trace at each point. Landslide
has a couple of bug-finding checks: tripped asserts, use-after-free heap
accesses, deadlocks, and (heuristically) livelocks /infinite loops.

 

Landslide

Which events to consider "decision points" was
also a hard question. During the project, I used voluntary reschedules (such as
calls to yield()) and synchronisation operations (such as calls to
mutex_lock()/mutex_unlock()) as decision points. I felt this struck a good
balance between fine-grained interleavings (more likely to find races) and
coarse-grained interleavings (more feasible to explore the state space).

But I also wanted to harness the user's own intuition, to
focus the search space more intelligently. So the user can also configure
decision points of their own – they can write
"tell_landslide_decide()" in the kernel's code to say "hey, put
a decision point here!", and they can also use another config file to
whitelist/blacklist certain modules of the kernel (maybe you don't want to
waste any time exploring interleavings in the virtual memory subsystem).

 


Partial-Order Reduction Explained

Landslide-reduction

Landslide uses an
existing algorithm called Dynamic Partial Order Reduction to prune redundant
subtrees out of the search space of different thread interleavings. The basic
idea of the algorithm is to use a "memory independence relation",
which identifies when certain state transitions of different threads have no
conflicting shared memory accesses, to know when it wouldn't matter which order
those transitions run in. In the picture above, the states are identical after
"1" and "2", regardless of the order that these operations
are done. Therefore, it is sufficient to explore only of the subtrees marked in
yellow.


JE: That sounds
like some pretty advanced Simics programming there (I would not call it
scripting, it is way beyond that level of sophistication)!

BB: Thanks! Yes,
Landslide is basically its own piece of software. It's about 4000 lines of
code, written mostly in C.

JE: What types of
kernel errors do you detect with Landslide?

BB: Well, in
theory, the beauty of systematic testing is that it should be able to find any
kind of race condition.

One example of a race that systematic testing can find that
a data race detector couldn't is the everyday TOCTTOU: let's say the kernel
validates a userspace buffer (with a lock held) then later accesses the buffer,
but in the meantime another thread changed the permissions on that memory (with
the same lock held), making the first thread's access fail. Because of the
locking, it's not a data race, but Landslide can easily force an interleaving
when the first thread drops the memory-validation lock (cf. mutex_unlock()
being a good decision point, as I noted above).

Landslide itself is a little more limited than "in theory",
of course. As implemented, the only source of nondeterminism it controls is the
timer interrupts; so it controls thread scheduling, but not device input
handling. As such, it's most useful for finding races in core or mostly-core
parts of the kernel, such as the thread lifecycle implementation or the virtual
memory system. Finding races in device drivers is beyond the scope of
Landslide's current model (though potentially in the path of future work).

JE: Are you still
working on the tool?

BB: Since I wrote my
MS dissertation on it, I've put it on hold for a while. I've been doing some
other concurrency-related projects in the meantime (which you could read about
if you want on my website, linked above).

JE: What kinds of
code have you tested Landslide on?

BB: I wrote
Landslide targeting "Pebbles", which is a UNIX-like kernel
architecture that students implement in six weeks in our undergrad operating
systems class at CMU. Pebbles kernels have to be fully preemptible and support
concurrent execution of standard thread-lifecycle events (fork, exec, exit,
wait, yield, etc.). Using Pebbles as a case study was a good proving ground for
potentially extending this work in the future to more mainstream kernels, such
as Linux.

Students implement these kernels from the ground up (we only
provide them enough starter code to boot up into an entrypoint function; none
of the design or virtual memory or anything is done for them), so each kernel
ends up with slightly different design and structure. Landslide had to be flexible
enough to be able to attach to and control any of these kernels.

The real test was meeting with some of the students towards
the end of the project to get them to try out Landslide, to see if it works
"in the wild". Sometimes it was a struggle – the average time spent
doing instrumentation was 100 minutes – but in the end, all four (out of five
total) groups that completed the prerequisite instrumentation were able to find
bugs in their kernels with Landslide.

JE: That is a
very elegant twist to the project, I think. You both get to test your tool on
real code that you know nothing about, and you teach the students that there
are tools that can do almost magic things to help them check their code for
errors.

JE: Do you have
any fun examples from the work of bugs found?

BB: Sure. My favorite story is of one of the project's defining
moments, a time when I just stopped in my tracks and thought "huh. I have
made something smarter than me."

I had written a new unit test case, called double_wait, in
which two threads in a parent process attempt to wait() on a single child
process. I was testing it out on my own kernel (the one I wrote when I was an
OS student), and I expected the search to complete with no bugs, each time
having one parent thread succeed and the other return failure. Instead, after a
bit of searching, Landslide made my kernel trip an assert in its condition
variable linked-list logic.

This was a bug nobody ever knew about before – I'd written
it three years before and didn't find it during stress-testing; and the TA who
manually graded my kernel (by looking over it with a red pen) also overlooked
it.

JE: That just
goes to show how difficult reasoning about concurrency is for a human being.

JE: As you know,
I previously interviewed the creator of SimTester. At a high level, these two tools built on
Simics would seem to be similar, but I think that they are actually quite
distinct. Could you expand a bit on that?

BB: Imagine my
surprise when the SimTester interview came out just when I was writing my
thesis's related work section. "Whoa, wait, another tool that runs in
Simics and identifies key decision points and injects interrupts?"

So our projects are quite similar in those regards, but at a
higher level my strategy is somewhat different from theirs. SimTester focuses
on races involving device interrupt handling code, whereas Landslide is geared
towards non-driver inter-thread races. Additionally, SimTester’s testing model involves
forcing only one interrupt per test run, which is a lighter-weight approach
than systematic exploration.

JE: Thank you for
your time, this has been really interesting. Where can you go to find more
information about Landslide?

BB: The thesis
can be found at http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-CS-12-118_abs.shtml
(and the slides from the MSc defense at http://bblum.net/landslide-defence.pdf).

 

For additional information from Wind River, visit us on Facebook.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>