In several of my previous posts I have written about the fact that embedded virtualization has low overhead, maintains determinism and all that good stuff. I have also written about some of the benefits of virtualization due to partitioning, scalability and such.
However, there is one aspect of virtualization that gets little 'air time' and that is the fact that there the hypervisor really is a nice, lean, partitioned management layer of your multicore (or single core) hardware.
Now, the true problem comes when a fault occurs in a fault handler. This can happen if things are really broken, maybe due to faulty hardware, or maybe due to memory corruption. If this happens, there often is no other solution but to reboot, a fairly heavy handed solution. One could of course use a JTAG connection to the system if available, but if the system is already fielded, then this is simply not available.
Project this situation to a system built on a Hypervisor. The operating system with the application and the fault handlers is actually running on top of a Hypervisor. The operating system is strongly separated from the Hypervisor in terms of memory areas. So if a fault happens, the Hypervisor is still available and in a sane state unless the problem is truly hardware related, in which case you are out of luck. If the problem is software related, then the Hypervisor can step in when one of these fault situations happens.
The hypervisor contains the fault and can actually be used to explore the state of the partition (the Virtual Board) and find out what happens. The Hypervisor has a record of interrupts received and acknowledged, can walk the memory and dump the memory out over an network connection. It is often not possible to correct the fault, but the hypervisor provides additional debugging capabilities, either manual (by using the Hypervisor debugging shell) or automated by writing fault handlers and recovery code as part of your Hypervisor BSP.
There are a couple of neat ways of using this fault containment. Firstly, as stated, one can elaborate the fault handler in the Hypervisor to do funky things, handle the code, log the code. However, that is really not an elegant solution, you are adding code to the Hypervisor which should really not be there. The more elegant design would be separate the gathering of information (something the Hypervisor should do) and the actioning on that information, which can be done in a separate Virtual Board that gets a tiny slice of one of the processors in the system.
An example of a problem that could benefit from a fault handler in a hypervisor is the stack overflow detection and handling described by Michael Barr in his post.
All in all, another example of how virtualization is changing the way people develop embedded systems.
Mark is a senior product manager with Wind River focusing on multicore and virtualization solutions. Prior to joining Wind River Mark has helped development teams build embedded systems across Asia, Europe and North America in automotive, telecom, consumer electronics and defense industries.

You are out of luck with a hardware failure, unless the server is a Xeon-based fault-tolerant server. In which case, you continue to run smoothly as though nothing had happened at all.
Posted by: KenDonoghue | March 17, 2010 at 06:15 AM
Thanks for the comment Ken.
Fault tolerance in embedded is typically not handled by fault tolerant IT servers.
For telecom systems it is handled in the ATCA domain (for which Xeon processors are indeed popular).
For other embedded systems the solution is very much industry dependent. Case in point 'safe shutdown' in light of a failure is quite different for a printing press compared to a commercial airliner :)
Posted by: Mark Hermeling | March 17, 2010 at 07:28 PM