One day, I get this phone call. "We're working with the system, we see the calls that update the exception handlers early on – connecting the clock routines, etc. Then not very much farther on we see the system has run past where tasking should be running but it's not. When we check, we're not getting any clock interrupts, and the clock exception handler isn't there any more. How could the system do that? Is there any way the OS could remove it's clock connections?"
The complete story was even more eyebrow raising. "We're testing for a launch, this vehicle is a one-of-a-kind, there's no backups. We're doing environmental limits testing. We're still within spec, but operating at the low end of the electrical ratings. We need to explain this anomalous behavior or scrub."
This was the start of a frantic week of work. On PowerPC, the architecture stores it's exception vectors down in low ram. A valid method of updating these vectors is to write new data to the vector address as data, then force the data cache to be pushed to the RAM, and finally tell the processor "any instructions you have cached for this area are invalid" (invalidate the I-cache). This last step ensures that the CPU will fetch the newly-updated instructions the very next time they're called.
The hardware involved was fairly rare and expensive. We didn't have a lot of options available for hardware debug assist. In fact, I had never even seen the hardware in person, all my work was theoretical and experiments were done with "similar" boards.
I used what I had to trace the execution sequence to the point in question, where we should have been seeing clock ticks come through. With all the hardware I had, there was no way I could recreate the problem. In every case, I had clock ticks come through as expected, and with code to set the clock to an expected value earlier on, every single run showed the same exact results. It was impossible for it to /not/ work.
The sequence, boiled down, that was under investigation, runs like this:
The runtime starts. Vectors are written-to / populated with default values and routines. Caches are invalidated and enabled. Board hardware is initialized. A clock is set up, it's exception vector is populated via a data-write; the D-cache is "flushed" to RAM, the instruction cache is told "this area of RAM is updated" (invalidated), and we continue running. 1/60th of a second later we should get a clock tick, it should be processed through this exception vector and ….
The problem seemed to be that the exception vector was running the same old code – not the new code. It could be checked that the new code had actually been pushed all the way into RAM – except we didn't have that capability at the time nor enough time to set it up.
In every other test setup, we saw the expected behavior – the clock tick delivered and tasking running exactly as expected. In this particular setup, it was as if the clock tick never happened. The CPU side of things clearly ran as if the code had never, ever been invalidated.
As a point of interest, we found the code executed was smaller than the instruction cache. We'd never exhausted the cache. We formed a theory, hard to prove, but interesting. Running at the lower voltage end, the units under test were all the -same-, hardware wise. With the lower voltage, was it possible that some anomaly of the hardware caused the instructions that had been invalidated to… somehow.. not be invalid? The only way we could still be executing those instructions is if there were still in the I-cache, and we'd invalidated them. Perhaps the hardware somehow revalidated the cache when run at low voltages?
Bringing power back up to spec showed normal execution. Working with the hardware vendor we learned that there was a probability that the lab grade hardware might show anomalies not exhibited by deployment grade hardware, and indeed the lab grade hardware would fail tests that the deployment grade hardware passed without problems.
Testing was repeated on deployment grade hardware, which ran without incident.
Looking back, had I known then what I know now, it would have been expeditious to execute a field of no-ops the size of the I-cache immediately after installing the timer vector. This would have caused the entire I-cache to be invalidated/populated with new entries, flushing the timer vector code completely. The next execution of that vector would have forced the I-cache to re-read the whole vector, we would have seen the clock ticks where expected, and we could have explicitly tested the I-cache in this way.