Quit Bugging Me: Driving Me Backwards

You're working on a highly precise instrument.  The kind of instrument that relies on extremely accurate measurements of time.   Work is proceeding normally.  Suddenly.. events come in in a backwards order, time jumps around – but only for a few ticks.

The heart of the system – in this case – is a hyper-accurate clock.  The clock's accuracy can be verified by external sources.  The clock itself is not only part of the computer, but also part of the scientific apparatus.  It is crucial that the timing marks in the data are as accurate as possible, or the data becomes worthless.

Everything seems to work perfectly.  All the calibration is done, data sets are processing exactly as intended, then suddenly….  time goes backwards.  For months testing has been going on, tests haven't found problems, data all looks good, but on the longest test runs, once in a long while, the clock seems to skip a beat… and move backwards.

An unlikely test is suggested: turn up the clock rate, either in actuality or virtually.  It's easy to say "The OS must have dropped the ball somewhere, perhaps delayed processing of some critical bit of information at just the wrong time.  The timer driver runs for days at a time with no errors, how can this find a problem with the system?"

In cases like this it's almost always best to work backwards from the symptoms.  The main symptom is that once in a long while, regardless of what the rest of the system is doing, time drifts a couple of ticks backwards all at once.

The best way to debug a situation like this is to start with what was happening when the system took it's trip into the past.  Software normally isn't sentimental and doesn't care what happened in ancient history a fraction of a second ago.  Look for patterns.  In this particular case, the most common pattern in the failed tests was that the system had been running for many weeks without failure, when it suddenly took a step backward in time.  What happened before those brief steps back in time was varied, there are no obvious patterns exhibited.

Since in each case when the problem occurred, the system was in a different state, the most obvious commonality between the incidents was: the system had been running for a "longer" test.  This means re-creating the test may be a very long term endeavor.  What do you do in these cases?  (Problems like this used to be difficult to debug before there were system visualization tools like the System Viewer.)

One thing you can do is speed time up.  If your system is not processor limited, increase the system tick rate and increase data rates on other tests you can run simultaneously.   Since it takes a long time for the problem to be exhibited, compressing time may help debug it.

The other thing to do is review the code involved.  Start small in scope – what things were running at the time the problem occurred.  Use the assumption that the code has not become corrupted. Places to look for culprits when an otherwise stable driver starts going awry: look at everything that could affect that driver, especially itself.

I've actually seen this problem – and others similar to it in nature – more than once.  In the cases I've seen it's mostly been race conditions – either "mistakes" made in the code sequence or timing problems imposed by hardware that the coder needs to be aware of to work around.

This is a bug known as a Race Condition.   A critical point is not properly protected, an irq re-enabled too early, hardware access re-enabled too soon, something along these lines.  The general feeling is that of a figure-eight race track.  As long as the drivers are spaced just right, the cars go speeding through the cross-roads without colliding, but it only takes the smallest bit of miss-timing to cause a small collision and precipitate a gigantic crash.

The timer driver had a miss-sequence.  The global tick count was incremented immediately after the interrupts were re-enabled.  Once every trillion ticks or so the clock tick comes through and right behind it a second tick comes through.  With the IRQ-re-enabled, a second tick completes first, then a first tick, and the tick value is seen to "go backwards" by one tick.  This makes time seem to jump forward, backward, and forward again – two forward, one back, and continue normally.  Looking at the disassembled code, there was only a window of perhaps 5 instructions where a second interrupt could have possibly triggered the bug, which was why the system could run for months without exhibiting the problem.

The first time I saw a hardware induced race-condition was on what I think of now as a vintage VME computer.  It was a 68010 based board with a serial chip (and a bunch of discrete electronics!).  A programming example for "implementing an adjustable buffer feature" (heh.. turning on the 4-char buffer) included a number of "nops" between writes to the serial chip.  We had a customer try to return the hardware once, because when they removed all the "nops" the serial driver stopped working ("they don't do anything, right? we need it to run fast!").  We explained that the no-ops actually created a delay just long enough for the serial chip to "become ready again".  Removing these delay-instructions actually induced a race condition where the CPU (serial driver) was too fast for the serial chip.  (there were only perhaps a dozen or so no-ops.)  In this case, the serial chip itself needed some time to "settle-out" after changing modes in the chip.  Accessing it too soon gave unpredictable results.