I like to use the term fault propagation a lot. It has so many uses. When we talk about multiprocessor systems, we think about fault propagation across a bus or other interconnect at the electrical level, the protocol level and even at the service level. It is bad enough if a fault in a piece of code manages to take down the processor it runs on. It is multiplicatively worse if the fault propagates to take down other processors directly or indirectly.
Fault propagation is a problem even within a single processor. We put code out in processes to help isolate the code. But there are often fault propagation paths that confuse other parts of the system if the process fails. These fault propagation paths can be fairly predictable (e.g. a bad pointer overwrites a data structure in memory shared by another process) or they can be quite subtle (e.g. one process does a lot of IO on a shared disk causing another process to slow down even though lots of CPU is available for both). Figuring out what caused the fault in the first place can be really hard - especially if the symptom shows up elsewhere via a subtle fault propagation path. OSs have implemented all kinds of abstractions and entities to help reduce fault propagation paths. Processes are one useful mechanism we can use. Processes can help reduce the scope of what we have to consider when we detect that a fault has occurred.
I suspect that a large part of the code in any system is related to managing faults in some way. We could include resource management subsystems (so one task doesn’t consume all the CPU, files or memory resources), network management (so you can see what is going on and react), debug infrastructure, software upgrade, diagnostics, permissions, error reporting, and so on – it is a lot of technology that we need to keep control of our applications.
The good news is that the technologies we are using to fight fault propagation paths are getting more powerful. Multicore processors are one opportunity to insert a barrier for fault propagation paths – simply put applications on separate cores so that they can’t steal CPU from each other directly. Mullticore processors are like having time and space partition technology in hardware. Processor virtualization technology is the ultimate software-based time and space partitioning technology. If we virtualize an entire processor and control what resources it gets and when it gets them (especially CPU and memory) we can protect other applications from those that run in the virtualized environment.
My concern is that as these tools are getting more powerful, those of us that are runtime developers will have a harder and harder time figuring out what is going on inside our systems. For a while back in the late 1980s I was developing GUIs on X-Windows version 10. I remember this nightmare I had about an errant pixel in one of my widgets. The layers of software I had in that system spanned application processes (X client to X server), protocols (they communicated over TCP/IP) , OS primitives (sockets and other system calls) , device drivers (the framebuffer), compilers and the operating system. My nightmare was that the bug that was setting this bit was so far down I could not get to it without an in circuit emulator and OS source code. Luckily I woke up from that nightmare.
But the nightmare could come back. With multiple operating systems on a multicore processor or even in a single processor with multiple virtualized processors, the layers of software in our devices get thicker and even harder to fathom. We need to trust that the environment provided is stable. Java developers have been doing this for a while. But I like to "trust but verify" so I want great tools that illuminate the answers to query’s like “how much” or “when” or “what”. I think there is a lot of value in having smarter tools for the run time – not just the design time - that will help us manage, visualize and control these more complex runtime environments we are increasingly using to manage fault propagation paths.
I am not sure what these tools exactly are but I am pretty sure I will need them. Anyone have any ideas in this area?
Recent Comments