By Jakob Engblom
I often write and talk about how useful Simics is for debugging concurrency bugs and glitches in multithreaded and multicore systems. Recently, we had a case where we proved this on a very complex application, namely Simics itself. This nicely demonstrated both the recursive completeness of Simics, and its usefulness for conquering tricky bugs in complex software.
The beginning of this story is a bug in Simics, triggered by a certain Simics configuration. The Simics target is a Power Architecture machine, running some bare-metal test code testing the processor simulation. Occasionally, this setup would crash Simics, due to some bug in Simics or the models. It was a difficult bug to track down, as it only happened in one run out of 50 or so. When attaching a debugger to try to diagnose it, it invariably did not happen (a classic Heisenbug).
Simics is the perfect tool to diagnose these kinds of issues, but in order to do that, we had to get the failing program into Simics. I.e., running Simics on Simics. The first step was to create a duplicate of the development host, inside of Simics. This was fairly simple, just a matter of installing a Fedora 16 standard Linux on an 8-core Intel target. Once the Linux was installed and booted, a checkpoint of the system was taken.
Next, the development code tree from the host was packaged up as a tar file and put on a DVD image file. Simics was started from the checkpoint of the booted target system, and the DVD image inserted into the virtual DVD drive and mounted by the Fedora Linux running on Simics. The tar file was copied to the file system on the target, and unpacked. A new checkpoint was taken after the Simics installation was thus completed and Simics could run on Simics. The result at this point was a completely self-contained, controllable, and repeatable environment.
The screenshot below shows Simics running on Simics, with the same desktop wallpaper being used for both the host and outer Simics Fedora system:
The next step was to replicate the bug inside of Simics. To this end, a shell command was used that repeatedly ran the inner Simics until the bug hit (obviously, this session was started from the checkpoint after the Simics installation).
The result was this setup, ready to run Simics until the bug hit:
To recap, we have Simics running on Simics. The “inner Simics” is configured with the Power Architecture setup that resulted in a crash on the host, and the “outer Simics” is running Fedora 16, providing a virtual replica of the development host (but inside of Simics).
Additional scripting in the outer Simics was used to make the search for and replication of the bug more efficient.
- The Simics script varied the time slices given to the processors in the IA target system. This caused greater variation in scheduling of concurrent processes and threads in the Simics-simulated Fedora 16 OS, which in turn helped provoke the bug so that it appeared faster (after fewer runs of the inner Simics).
- A checkpoint was taken after the inner Simics had been started and the timing variation applied to the IA processors – but before it had started executing the test case. This meant that a checkpoint would be available that led straight to the bug, with no need to do any warm-up of the target or particular configuration of Simics. The checkpoint would in effect be a self-contained bug report for the issue.
- A magic instruction (blue star) was planted in the segfault handler of the inner Simics, making it very simple to catch the crash of the inner Simics. Often, using a magic instruction like this is simpler than trying to capture the right page fault or putting a breakpoint at the right place. A magic instruction is a live marker in the code that will always trigger, regardless of debug information or OS awareness. Furthermore, it has no overhead until it hits.
Eventually, after some 20 runs of the inner Simics, the bug was triggered. Thanks to the checkpoint and Simics repeatability, reproducing the bug is trivial. The Simics crash could now be reproduced any number of times, and it was time to go debug and figure out why Simics crash. An occasional Heisenbug had been converted into a 100% reproducible Bohrbug.
The first step of debugging was to figure out the mapping of the many dynamically loaded modules in the inner Simics. This was done by running the outer Simics and sending a Ctrl-Z to the Fedora shell, pausing the inner Simics. Then, the /proc file system on the Fedora Linux running on Simics was interrogated to find the load addresses. Since the checkpoint was taken after Simics was started, we know that this is the mapping in the software setup found in the checkpoint. Every time the checkpoint is opened, the same mapping applies – so the information was saved and used to setup symbolic debug information for the Simics modules used.
The next step of debugging was to open the checkpoint again, turn on reverse execution, and run forward until the magic instruction hit. Then, OS awareness was used to back up until the last time that the inner Simics was running prior to hitting the segfault handler. This placed the execution of the outer Simics at the precise instruction where the inner Simics crashed.
It turned out that Simics was trying to execute code in a location (BCDE) where no code was to be found.
Stepping back one instruction led to a JMP instruction to the location BCDE.
So where did this JMP BCDE come from? It was clearly not part of the static code of Simics, but something that was generated at run time by Simics itself (Simics contains a JIT compiler and thus modifying running code at run time is perfectly expected behavior).
To find out how the bad JMP was created, a memory write breakpoint was put on the instruction (JMP BCDE), and execution reversed. Simics stopped at the point where the “JMP” part of the instruction was written to memory. Doing a stack back trace at this point showed the code that was trying to write a five-byte “JMP XYZQ” instruction into the JIT-generated code stream. Since the breakpoint had hit on the write of the byte containing the JMP instruction code, indicating that the other four bytes (containing the actual JMP target location of XYZQ) were yet to be written when the instruction got executed and Simics crashed.
Stepping forward (on the processor) revealed that a thread switch happened in the inner Simics, and that the incoming thread immediately executed the five-byte JMP instruction, such as it was. Since only the JMP byte had been written, this was a jump to location BCDE, rather than the intended XYZQ (it would also have been OK to execute the original ABDCE code). Thus, the issue was diagnosed to be a read-write race condition, with the twist that the read was an execution of the memory as code and the write a regular data write. As soon as the problem was identified it was of course very easy to fix.
With the same setup, another race condition in Simics was also found and fixed, involving the more common case of multiple concurrent threads updating and reading a shared data structure without sufficient synchronization.
In summary, this blog post has described one instance where Simics was used to find and fix concurrency bugs in a real-world complex software system called Simics. The key to the success was the repeatability that Simics provides, even for timing-related occasional events, along with checkpoints, scripting, reverse execution, and debug facilities.
For additional information from Wind River, visit us on Facebook.