In his recent blog post, Maarten Koning writes about fault propagation paths. That is, faults happening in one place in the code propagating to affect other parts of a system, indeed even multiple systems in an interconnected world. I recently had a chance to talk with a group of engineers at a large telecom equipment manufacturer. They told stories of software defects bringing several systems down for extended periods of time, causing outages in their customer's networks. Well, we all have heard stories similar to these.
I will agree with Maarten in saying; "I suspect that a large part of the code in any system is related to managing faults in some way."
In the past most software developers looked at software defects as failures to provide required functionality. Accordingly,most of the unit and system tests were designed to find as many such failures as possible. Today we must also consider the scenario where software is deliberately attacked to cause a fault. We need to make sure that faults are contained and not propagated.
I think we can enhance the quality of our products in this context by incorporating more fault injection test cases into the system tests. Fault-injection testing is not new, of course. But, most of the traditional fault injection tools and methods use a black-box approach to testing or they significantly modify the platform and the execution environment on which the tests are run. These make them somewhat ineffective and difficult to use. I think, for it to be useful in analyzing fault propagation paths, fault injection must be done in a controlled and deterministic manner. An effective fault injection test should have these characteristics as a minimum:
- It must be predictable at a function- and component-level granularity. Grabbing all available memory while the application is running is not predictable in this sense. We have no control of which function will hit this condition first.
- It must not change the application and system under test in any significant way. If the application or platform was significantly modified to cause deliberate faults, we are no longer testing the product we will ship.
- It must be dynamic and random. Playing back a recorded faulty sequence of packets is not very useful after the first run.
- It should be quick to develop and execute. We have millions of lines of code for which new test cases will need to be written.
I am probably a little bit biased but I believe Sensor Point technology can be used very effectively as a white-box fault injection tool. Sensor Point technology is a part of the Wind River Diagnostics product.
So, what are Sensor Points? Sensor Points are arbitrarily large fragments of code that are inserted into a running application or device and activated by patching a branch instruction at a specific address in the application's code. This way, newly inserted code is executed whenever the program counter reaches the patched address. When the Sensor Point code returns, the existing code continues to execute from where it was patched. If the Sensor Point code is relatively small, there is very little performance impact on the system; about 200 instructions overhead per Sensor Point, which is well within the bounds of expected scheduling latency.
So how can we apply this technique to inject faults into a function?
Well, in addition to inserting new code into an existing application (or a function, to be more precise), one can also stub a function completely by using Sensor Points. Together with the ability to nest Sensor Points, which makes it possible to enable a Sensor Point only when a particular function is on the stack, faults can be injected to a precise function or a component without affecting other functions or applications in the system. Hence, one can measure propagation characteristics of a fault originating at a precise point in the system, perhaps even at a precise point in time.
I will write more about using Sensor Points in QA testing in the future, and provide some fault-injection Sensor Point examples.


Bulent Kasman has over 20 years of experience in developing systems and network management software. Currently, he is the architect of the Diagnostics product line at Wind River, where he is designing applications to quickly diagnose and repair errors in device software.



Recent Comments