In the 1970's BBC comedy show Fawlty Towers, John Cleese manages to turn harmless everyday situations into total disasters, with very little effort. It is a marvellous example of how to inject faults into what could have been a smoothly operating hotel, and demonstrating just how things fall apart as the unexpected happens. Injecting faults isn't always that easy, unfortunately (or should that be fortunately)?
I recently read an article by Steve Chessin in ACM Queue about fault injection, where he discusses how the UltraSPARC II and III systems from the late 1990s and early 2000s supported hardware fault injection in order to test the ability of the system to recover from faults. In this case, injecting faults was a bit more difficult than roaming around in a hotel beating your hapless waiter with a frying pan or insulting the guests. In the article, he makes a very important statement:
Handling errors is just attention to detail. Injecting errors is rocket science
The UltraSPARC systems in question had special facilities added to their cache controllers to make it possible to create cache lines and memory words with faulty ECC (error correction code) data, which would trigger errors when accessed at a later point in time. Without this hardware support, it would have been impossible to actually test the error handling code.
Fault injection is hard, dirty, or even violent work. I have enjoyed many stories from embedded engineers about how faults were injected by pulling boards out of racks, putting chips inside radiation chambers, or taking an axe to a running system and breaking cables in order to test the ability to withstand sudden shocks. In the military field, literally shooting something to bits might be the best test there is.
Being a peaceful person, I have a different solution to the fault injection problem. Use a virtual platform and inject the faults virtually. One of the nice features of a virtual system is the control you have over it, and the fact that nothing is hidden in the hardware. Injecting faults is basically as easy as writing a script that changes some aspect of the target at an appropriate point in time. In no time at all, Basil Fawlty can be let loose inside your system.
Sometimes, target system models need to be extended to support fault injection in a clean and efficient way. For example, for the UltraSPARC II cache issues discussed in the article, the cache and its parity data would have had to be modeled in such way that the processor would trigger an error when a cache line was read. Another example is to include error states in models of communications buses, so that end-point devices can indeed detect that data has been corrupted in transmission. To continue our Fawlty analogy, you sometimes need to provide the rampant hotel manager with the tools needed smash things or guests to insult. For many targets, modeling how things break is just as important as modeling how they work. Here are some examples of errors that have been modeled with Simics in the past:
- Dropping packets on networks
- Injecting bad checksums on network packets
- Transient and permanent memory corruption
- Transient and permanent processor register corruptions
- Data bus errors (intercepting operations and changing their value)
- Address bus errors (sending operations to the wrong destination)
- Cache errors
- Spurious interrupts
- Permanent subsystem failures (by disconnecting boards and other subsystems from the rest of the system)
- Stopping a processor (to simulate a single core going bad)
Injecting errors in a virtual platform does not have to be rocket science, but it certainly does require some attention to detail to make sure relevant error state are modeled in a way that makes them amenable to fault injection. For reliable systems, testing error handling is necessary, and a virtual platform is a great tool for doing it in a cheap and non-destructive way.