Being Helpful or Simply Correct?

By Jakob Engblom

Engblom_lgThe other day, I spent some time getting a new operating system up on one of our Simics virtual platforms. The platform is stable, the hardware is shipping and it is being used with the very software I was setting up. However, as the operating system was booting, I got quite a lot of warnings from Simics about incorrect hardware settings. The operating system still worked, however, so why did we see all these warnings? Actually, when is it prudent to issue warnings to a user about suspicious uses of hardware? When building Simics models, we want them to be helpful and offer warnings as much as possible – especially for cases where some uses might be considered technically "correct" but are actually somewhat suspicious.

A virtual platform like Simics can print warnings to the simulation console when software performs operations that are not legal according to the hardware specification. Typical examples are writing to offsets in device register banks where there are no registers, changing the value of bits which are marked as "reserved", or writing illegal values to configuration registers.

These warnings are very useful to software developers, since they expose incorrect hardware use in device drivers and operating systems. However, the code might work just fine on the physical hardware. In reality, writes to reserved bits and accesses to non-existing register offsets are usually ignored in a hardware system. Adding logic to detect bad accesses can be seen as a waste of silicon, and what should the hardware do about the error anyway? Warnings make a lot more sense within the user interface of a virtual platform. Being helpful for a software developer is not necessarily the same as being correct in the view of a hardware designer.

Let's look at some of my own recent experience with Simics warnings, we can get a better idea of the art of good warnings in a virtual platform.

Bad Registers

Accesses to bad register offsets can indicate that the device driver programmer has misunderstood the location of some device registers. Even if the hardware silently ignores the accesses, the final result will not be what the programmer intended. For example, you might end up routing all interrupts in a multiprocessors SoC to a single processor rather than evenly distributing them across all processor. Such things can happen if the programmer miscalculates the location of interrupt control registers for cores beyond core zero. In general, it seems more helpful to a user, and not generating such warnings would seem a reasonable goal for driver code.

However, there are times when the software is intentionally accessing non-existing registers. I hit such a case when I was bringing up my platform. It turned out that the register in question did exist – in another version of the device, used in a later SoC design from the same company.

Cleverly, the hardware designers had explicitly defined all accesses to non-existing registers as being allowed (and returning zero for reads). This is a hardware trick to allow reuse of driver code across a whole range of hardware, without any kind of conditional compilation. The hardware just ignores anything it does not recognize, and as long as the semantics of the registers that do exist in a certain device are consistent, the net result will be a single driver that works across a whole range of systems.

In our Simics model, we still warn about the illegal access. Think of the classic C idiom (or idiocy) "if (a=b) …"; it is perfectly legal to have an assignment in a conditional, and there are programmers that can make good use of it – but more often than not, it's a mistake. Most compilers will issue a warning. We have a similar situation here. Code carefully written to target the whole family of hardware is "correct" when it accesses the non-existing register. However, if code is written just for one device, the access is most likely an error. To do the right thing, we would need to know theintention of the code being written, which is asking a bit too much. If you know what you are doing, you can always disable that warning for your code.

Bad Bits

There were also a few cases of Simics warning about illegal modes being set.

In the first instance, we had a number of interrupt routing registers indicating which core was to receive a certain interrupt (out of all the cores in the system). The hardware manual clearly stated that setting more than one bit in a such a register was illegal and would generate unspecified behavior. Still, we saw the software happily setting all bits in all registers as part of its hardware initialization. Interesting.

It turns out that the registers were properly configured (single bit set) before each individual interrupt was enabled. In effect, having bad interrupt routing information does not matter if the interupt is not enabled. Thus, the Simics warning was a bit too aggressive in warning when the registers were written. Thus, the model was updated to only warn if the interrupt routing is bad when the interrupt is actually enabled. If that happens, you will have to reverse the execution to back up to the last write to the interrupt routing register (or rerun the execution with more logging turned on). Not much difference in effort for the programmer, and no spurious warnings.

Another example where we adjusted the warnings was a case where a bit field in a larger register absolutely should not contain zero. However, the reset value of the field was zero (interesting hardware design decision). This meant that nicely written software that set the values of some bit field in the register by reading its current value, changed a few bits, and wrote back the result would generate warnings. Each such operation should certainly be considered correct. Thus, we changed the Simics warning to only issue a warning if the value of the bit field was changed to an illegal value. This provided the full value of the warning for software developers, and removed spurious warnings.

Dealing with Warnings

When Simics issues a warning about suspicious behavior, it is usually prudent to go back and examine why it happens. We have seen several cases in the past where Simics warnings turned out to pinpoint very real bugs. For some reason, many of these involve the use of memory-management units in processors.

In one example, we kept seeing warnings about TLB entried being setup with unaligned page addresses (for example, mapping a 16MB page to an address not aligned to 16MB). The hardware silently aligns such unaligned mappings by masking all low bits in the page address (this behavior is documented). On hardware, this just meant that a mapping ended up covering a different area of memory than the designers had intended (the start of the mapping was a few 10s of MB lower), but since the memory area targeted happened to be included, nothing adverse happened. Until memory filled up and the OS needed to use the memory just below the mapping for code – which it could not do, since the previous mapping was hiding it in the TLB. This error would have been easy to spot if we had checked the warnings earlier.

I think this case is very typical, in that Simics warnings tend to uncover latent problems in the target code. Things that work right here, right now, with this particular operating setup, on a particular type of hardware, but which will break if anything changes (including a respin of an SoC). It might also break randomly in the field, as the software is used and state accumulates.

There is a software engineering ideal that I subscribe to that all code should be warning-free in compilation (in gcc, I always use -Werror). Similarly, it would be prudent to make sure that all code is warning-free in Simics. Especially if code is targeting reliable, secure, or safety-critical systems. Simics offers a unique insight into some of the code behavior that is otherwise invisible, but could be critical.

Summary

Doing warnings right is a problem in any programming support system. In general, software cannotknow that other software is "correct" (since that would amount to solving the halting problem), but have to rely on heuristics and good design to catch most mistakes while not generating too many false positives.

As discussed above, getting the warnings in a virtual platform right can take some skill and consideration. Simply being correct in the sense that "if it works on hardware, it works on the virtual platform" is not a sufficiently deep design criterion. A key advantage of a virtual platform is that it is more helpful than hardware. And being more helpful does mean being more picky about suspicious behavior.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>