Fault Injection Revisited - With Video

Fault Injection Revisited - With Video

jakob-engblom-intro-picture

Fault injection is a very powerful capability of Simics and simulation in general.  Simics 5 introduces a new framework to make it easier to create fault injectors that support record, replay, repeatability, and reverse debugging, and that make it easier for users to discover and apply fault injectors to their systems. This blog posts looks a bit more in detail at how this works and what you can do with fault injection and Simics. Check out the accompanying video that shows Simics 5 fault injection in action.

The purpose of the new fault injection framework is to standardize how fault injectors are written, how to attach them to a system, and how to enable and disable the actual faults. Previously, faults were typically implemented in a mix of command-line scripts and Python, in an ad-hoc manner specific to each case. Each fault injector had its own peculiar user interface, and it was not possible to list all the fault injectors available in a particular Simics installation. With the new framework, fault injectors should be more regular, and uniformly support features such as recording the faults injected for precise replay.

Fault injection is a broad term, but in general it means changing the hardware setup or stored or read values in a way that is outside of expected operation.  We expect that faults should result in the software detecting the issue and dealing with it in some way – or that the software does something entirely uncontrolled since it was not prepared for the fault in question. It is a sad fact that the fault handling code is often the least tested part of the system, but also the part of the system that has to kick in when things really go bad. Thus, anything we can do to improve the testing of fault handling code is good for product quality and good for end users.

Here are some typical examples of faults and fault handling, and how you might go about simulating them.

  • Cosmic rays induce bit flips in memory, and the memory-controller ECC (Error-Correcting Code) functionality detects that an error has happened. If we assume that the error was not correctable, the ECC function in the hardware will set some flags and values in the ECC registers, and fire an interrupt at the processor. The processor would then jump to the ECC error handler, which will read the hardware registers to determine the nature of the fault, and then take some appropriate action – rebooting the system, logging the error, recovering from some secondary memory, etc. Simulating this in Simics is a matter of putting the right error reporting values into the ECC registers, and firing the interrupt. There is no need at the Simics level of abstraction to model the actual bits that flipped and the ECC checksums.
  • One board in a rack crashes due to a bad software driver and an unexpected input from the network. Eventually, the rest of the system notices that the board has gone offline since it stops broadcasting its heartbeat messages. According to the recovery design, another board takes over the role of the first board, and the system recovers. Testing this in Simics can be as simple as stopping all processors on the failing board or disconnecting it from the backplane network.
  • An IoT sensor node wakes up to take a sample, and tries to transmit it, but a nearby microwave oven is active and the noise blocks all attempts to transmit. The software notices that the situation seems hopeless right now, and waits a minute before trying to retransmit. Simulating this type of scenario in Simics is simple to do, just stop all outgoing packets from the node and configure the wireless network adapter model to tell the software that the radio environment is full of noise. See “Internet of Things Automatic Testing – Using Simulation” for more on IoT simulation with Simics.
  • A car engine being controlled by an electronic control system suffers a physical failure, and its performance quickly degrades. The control software detects the change in behavior, tries to compensate in the control laws, but when this fails to rectify the situation, fault handling code is invoked that shuts down the engine to an assumed fail safe mode, and informs the owner of the car to immediately call for assistance and get the car to a garage. In the simulation, this can be easily simulated by feeding the engine control system a series of sensor readings that correspond to the failure. See “The Trinity of Simulation” for more on integrating Simics with physics simulators.
  • Malicious malformed TCP/IP packets are sent to an embedded Linux system, causing the network stack to crash. There was no proper handling for those particular types of bad packets in the stack, which was now proven. Testing this in Simics is as simple as sending in a stream of packets from a fuzzer or other source of bad data. Recording the sequence of packets sent in is crucial to allow reproduction of the error.

In all such cases, it is very important to be able to precisely replay and particular fault scenario, so that any failure of the fault handling code can be investigated. Being able to replay faults also makes it possible to have fault injection as part of automatic regression testing, to make sure that fault handling  doesn’t go bad over time due to lack of testing.

How did we go about improving fault injection in Simics 5? Basically, by providing an easy-to-use framework that standardizes how fault injectors should be written, and their user interface to a user. As part of the framework, it is made very easy to record and replay fault injection actions.  The result is a standard flow that goes like this:

  • Fault injectors are proper modules, not just scripts.
  • A user can thus take an inventory of available fault injectors, and decide which one to use.
  • Once a user has decided which one to use, it is loaded into Simics.
  • A new instance of the fault injector is then created.
  • The new instance is then attached to some part of the system. This is a critical step, as the attaching of the fault injector can involve creating new objects, rewiring the setup of the system to add intercept points, and put the fault injector into the path of data to corrupt. Sometimes it is really simple, and sometimes really complicated, depending on the nature of the fault.
  • Once the injector has been attached, individual faults can be enabled and disabled. The precise logic of enable and disable depends on the fault injector, but the principle of operation should be the same for all.

The demo video we have posted on Youtube shows how this flow works for a simple memory error injector. The injector in the video is based on planting breakpoints on memory writes, and changing the values written so that later reads will return bad values. The simulation setup is like  this:

fi-1

There is a test program running that writes values to memory and checks that the same values are read back:

fi-2

The demonstrated fault is very easy to implement in a simulator, but would require quite a bit of work on hardware. The fault that is modeled is essentially memory chips going bad, and starting to not properly store the values written to them. This is why it necessary to intercept writes and change the values written – unless intercepted, the writes of the test patterns by the test program would have removed any trace of the error.

The recording and replay shown in the video shows actions on both the simulator side (setting up new faults) and the target side (key presses to make the test program rerun). Such joint recording is critical to ensure that a particular execution can be reproduced later to investigate what happened.

In the video, all actions are performed manually and interactively in order to show how things are done.  In a real use of this functionality, the whole setup and  application of the fault would be scripted and automated. Anything that can be done interactively from the Simics command line can be done from a script and automated.

In summary, fault injection is a very powerful capability of a simulator-based test lab, and with Simics 5, we are making it easier to create and reuse fault injectors, and to support collaborative workflows around the results of the fault injection campaigns.

More Reading

We have some older blog posts available if you want to read more about fault injection and Simics.