Virtualization and Fault Handling

In several of my previous posts I have written about the fact that embedded virtualization has low overhead, maintains determinism and all that good stuff. I have also written about some of the benefits of virtualization due to partitioning, scalability and such.

However, there is one aspect of virtualization that gets little 'air time' and that is the fact that there the hypervisor really is a nice, lean, partitioned management layer of your multicore (or single core) hardware.

You may wonder why that is important. Well, take a 'system of old', you would have a single core, with an OS and on top of that OS a piece of software that monitors 'sanity' of the system. Sanity here is in relation to the sanity of the hardware and the sanity of the application. This monitor keeps track of exceptions thrown by the processor, the operating system and the application and based on it's state decides how and whether to continue operations. It does this by installing fault handler to handle the exceptions. Often these handlers try to correct the problem and log the occurrence of the problem as well, so an off-line engineer can look at the log later and try to debug the problem if that becomes necessary.

Now, the true problem comes when a fault occurs in a fault handler. This can happen if things are really broken, maybe due to faulty hardware, or maybe due to memory corruption. If this happens, there often is no other solution but to reboot, a fairly heavy handed solution. One could of course use a JTAG connection to the system if available, but if the system is already fielded, then this is simply not available.

Project this situation to a system built on a Hypervisor. The operating system with the application and the fault handlers is actually running on top of a Hypervisor. The operating system is strongly separated from the Hypervisor in terms of memory areas. So if a fault happens, the Hypervisor is still available and in a sane state unless the problem is truly hardware related, in which case you are out of luck. If the problem is software related, then the Hypervisor can step in when one of these fault situations happens.

The hypervisor contains the fault and can actually be used to explore the state of the partition (the Virtual Board) and find out what happens. The Hypervisor has a record of interrupts received and acknowledged, can walk the memory and dump the memory out over an network connection. It is often not possible to correct the fault, but the hypervisor provides additional debugging capabilities, either manual (by using the Hypervisor debugging shell) or automated by writing fault handlers and recovery code as part of your Hypervisor BSP.

There are a couple of neat ways of using this fault containment. Firstly, as stated, one can elaborate the fault handler in the Hypervisor to do funky things, handle the code, log the code. However, that is really not an elegant solution, you are adding code to the Hypervisor which should really not be there. The more elegant design would be separate the gathering of information (something the Hypervisor should do) and the actioning on that information, which can be done in a separate Virtual Board that gets a tiny slice of one of the processors in the system.

An example of a problem that could benefit from a fault handler in a hypervisor is the stack overflow detection and handling described by Michael Barr in his post.

All in all, another example of how virtualization is changing the way people develop embedded systems.