Debugging Simics – on Simics

BlurbI often write and talk about how useful Simics is for
debugging concurrency bugs and glitches in multithreaded and multicore systems. Recently, we had a case where we proved this on a very complex application,
namely Simics itself. This nicely demonstrated both the recursive completeness
of Simics, and its usefulness for conquering tricky bugs in complex software.

The beginning of this story is a bug in Simics, triggered by
a certain Simics configuration. The Simics target is a Power Architecture
machine, running some bare-metal test code testing the processor simulation.
Occasionally, this setup would crash Simics, due to some bug in Simics or the
models. It was a difficult bug to track down, as it only happened in one run
out of 50 or so. When attaching a debugger to try to diagnose it, it invariably
did not happen (a classic Heisenbug).

Simics-on-simics-1

Simics is the perfect tool to diagnose these kinds of
issues, but in order to do that, we had to get the failing program into Simics.
I.e., running Simics on Simics. The first step was to create a duplicate of the
development host, inside of Simics. This was fairly simple, just a matter of
installing a Fedora 16 standard Linux on an 8-core Intel target. Once the Linux
was installed and booted, a checkpoint of the system was taken.

Next, the development code tree from the host was packaged
up as a tar file and put on a DVD image file. Simics was started from the
checkpoint of the booted target system, and the DVD image inserted into the
virtual DVD drive and mounted by the Fedora Linux running on Simics. The tar
file was copied to the file system on the target, and unpacked. A new
checkpoint was taken after the Simics installation was thus completed and
Simics could run on Simics. The result at this point was a completely
self-contained, controllable, and repeatable environment.

The screenshot below shows Simics running on Simics, with
the same desktop wallpaper being used for both the host and outer Simics Fedora
system:

Simics-on-simics-screenshot-600px

The next step was to replicate the bug inside of Simics. To
this end, a shell command was used that repeatedly ran the inner Simics until
the bug hit (obviously, this session was started from the checkpoint after the
Simics installation).

The result was this setup, ready to run Simics until the bug
hit:

Simics-on-simics-2

To recap, we have Simics running on Simics. The “inner
Simics” is configured with the Power Architecture setup that resulted in a
crash on the host, and the “outer Simics” is running Fedora 16, providing a
virtual replica of the development host (but inside of Simics).  

Additional scripting in the outer Simics was used to make
the search for and replication of the bug more efficient.

Simics-on-simics-3

  • The Simics script varied the time slices given
    to the processors in the IA target system. This caused greater variation in
    scheduling of concurrent processes and threads in the Simics-simulated Fedora
    16 OS, which in turn helped provoke the bug so that it appeared faster (after
    fewer runs of the inner Simics).
  • A checkpoint was taken after the inner Simics
    had been started and the timing variation applied to the IA processors – but
    before it had started executing the test case. This meant that a checkpoint would
    be available that led straight to the bug, with no need to do any warm-up of
    the target or particular configuration of Simics. The checkpoint would in
    effect be a self-contained bug report for the issue.
  • A magic instruction (blue star) was planted in
    the segfault handler of the inner Simics, making it very simple to catch the crash
    of the inner Simics. Often, using a magic instruction like this is simpler than
    trying to capture the right page fault or putting a breakpoint at the right
    place. A magic instruction is a live marker in the code that will always
    trigger, regardless of debug information or OS awareness. Furthermore, it has
    no overhead until it hits.

Eventually, after some 20 runs of the inner Simics, the bug
was triggered. Thanks to the checkpoint and Simics repeatability, reproducing
the bug is trivial. The Simics crash could now be reproduced any number of
times, and it was time to go debug and figure out why Simics crash. An
occasional Heisenbug had been converted into a 100% reproducible Bohrbug.

The first step of debugging was to figure out the mapping of
the many dynamically loaded modules in the inner Simics. This was done by
running the outer Simics and sending a Ctrl-Z to the Fedora shell, pausing the
inner Simics. Then, the /proc file system on the Fedora Linux running on Simics
was interrogated to find the load addresses. Since the checkpoint was taken
after Simics was started, we know that this is the mapping in the software
setup found in the checkpoint. Every time the checkpoint is opened, the same
mapping applies – so the information was saved and used to setup symbolic debug
information for the Simics modules used.

The next step of debugging was to open the checkpoint again,
turn on reverse execution, and run forward until the magic instruction hit.
Then, OS awareness was used to back up until the last time that the inner
Simics was running prior to hitting the segfault handler. This placed the
execution of the outer Simics at the precise instruction where the inner Simics
crashed.

It turned out that Simics was trying to execute code in a
location (BCDE) where no code was to be found.

Stepping back one instruction led to a JMP instruction to
the location BCDE.

So where did this JMP BCDE come from? It was clearly not
part of the static code of Simics, but something that was generated at run time
by Simics itself (Simics contains a JIT compiler and thus modifying running
code at run time is perfectly expected behavior).

To find out how the bad JMP was created, a memory write
breakpoint was put on the instruction (JMP BCDE), and execution reversed.
Simics stopped at the point where the “JMP” part of the instruction was written
to memory. Doing a stack back trace at this point showed the code that was trying
to write a five-byte “JMP XYZQ” instruction into the JIT-generated code stream.
Since the breakpoint had hit on the write of the byte containing the JMP instruction
code, indicating that the other four bytes (containing the actual JMP target
location of XYZQ) were yet to be written when the instruction got executed and
Simics crashed.

Stepping forward (on the processor) revealed that a thread
switch happened in the inner Simics, and that the incoming thread immediately
executed the five-byte JMP instruction, such as it was. Since only the JMP byte
had been written, this was a jump to location BCDE, rather than the intended XYZQ
(it would also have been OK to execute the original ABDCE code). Thus, the
issue was diagnosed to be a read-write race condition, with the twist that the
read was an execution of the memory as code and the write a regular data write.
As soon as the problem was identified it was of
course very easy to fix.

With the same setup, another race condition in Simics was
also found and fixed, involving the more common case of multiple concurrent
threads updating and reading a shared data structure without sufficient
synchronization.

In summary, this blog post has described one instance where
Simics was used to find and fix concurrency bugs in a real-world complex
software system called Simics. The key to the success was the repeatability
that Simics provides, even for timing-related occasional events, along with checkpoints,
scripting, reverse execution, and debug facilities.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>