When you're debugging a process, or a system, you never can be sure were it will lead you. Sometimes you need special tools to find out what's happened. Sometimes you just need a little extra visibility into the system.
When you have a specific bit of code that runs, and that bit of code always fails the same way, you're at least starting with a defined test case and a known / expected failure syndrome. A real challenge is when the system seems to fail with a random set of symptoms – perhaps there are a few specific symptoms but you can't predict which will strike, or to what task, or when.
Some of my favorite failure syndromes… (paraphrased)
Our clock driver runs almost perfectly, but once every several days it steps backward by a few ticks!
mesgQSend is returning a semaphore ID…see!
Every once in a while, coming out of windExit, our task tries to execute something that's not an instruction. It has the same value as the address of my watchdog timer. Can it do that? [NO!]Sometimes the processor just halts. It will run fine for days at a time, then it just halts. We have a simplified process, but we can't tell when it will halt. It always halts after we flash the "busy" LED. Since it halts, we can't bring back any information on the JTAG.
Debugging any of these is a combination of knowing where to look and which tools to look with. One of these turned out to be a race condition involving an ISR and a global value, another was some form of hardware anomaly where the RAM controller was somehow written to so the RAM stopped responding. Another was a write through a mishandled pointer. I've also seen errors caused by cache logic un-invalidating invalidated lines, causing execution of incorrect cached instructions. Some of these literally took months to debug.
Looking back, with today's tools, some of the older problems that were hard to debug would be almost easy now. Using a combination of approaches – enabling the MMU to protect certain objects, visualization tools, and JTAG hardware assist can make short work of otherwise very complicated problems. Modern processors can assist in some cases; some processors actually have instruction trace buffers built-in.
The trickiest problems to debug fall into a few classes. Environmental, Race conditions, and Hardware anomalies are among the hardest to debug. Hardware anomalies can sometimes be found with diagnostics, like BIT and POST testing, or even running simple compliance tests, but many times it may require the use of logic analyzers or other test hardware. Race conditions can be hard to recreate and may take hours or days for the "races" to align and collide. It may require the use of faster machines or tricks involving clock rates to simulate time going faster, in order to cause the race condition to appear. Environmental conditions may be the hardest of all to recreate: radiation, electromagnetic noise, vibration, heat and cold extremes – some of these may be impossible to recreate without a very expensive laboratory.
Sometimes the best tool to use is the one at your finger tips. Recently, assisting a customer with debug, we started looking at why a PCI device seemed more stable from power-cycle than it did from holding down the "reset" button – in theory these should result in roughly the same condition. When we ran the device on one board, it operated as expected, but on another board it did not. One of the tools we had at hand was the address ranges of devices on the PCI bus, and expected values we should be able to read after either power-on or reset. We used some VxWorks magic to manually probe the device for some of it's information early on in the hardware-init stage and record it to dump later on – you can do this by adding some code to the right spot. What we found in the malfunction was correct information after a power-cycle, but more like random goop after hitting the reset button. A short discussion with some other folks in the hardware side of the house, and we learned that /RESET was not being propagated to the PCI bus.
In this case, in lieu of using a hardware analyzer to find out what was – or wasn't – happening, we used investigation and a process of elimination. We looked at values being read early-on after a reset cycle to see why a device driver was malfunctioning. The only way the PCI device wouldn't give the proper values after a RESET cycle is if it were not being reset. Though a hardware analyzer would have shown the problem after one reset cycle, we were able to forgo the need for hardware we did not have with a bit of crafty coding and a well thought-out investigation.