Quit bugging me: surprise NaN!

You've got a complex system.  There are dozens of tasks, a handful of interrupt routines.  Your system runs fine, for hours or sometimes days on end.  The loading on your system is somewhat "bursty" in that average loading is only about 20% of peak load, and peaks only happen once every hour for less than 5 seconds.  Every once in a while, for whatever reason, you suddenly get a divide by zero, not-a-number, or other unanticipated erroneous results on one of your floating point calculations.

In your system you only have a few tasks that employ floating point.  You're pretty sure none of your ISRs do, all they do is copy or move data around, they shouldn't have to compute anything.  Yet for some reason, occasionally any one of the tasks running floating point will go belly-up with unexplainable results.  It just looks like some kind of random corruption.

Upon further examination you find that your data isn't clean – every now-and-then you get results that are impossible for the device you're reading. It just doesn't generate values in those ranges.  When you read your device in a tight loop, no matter for how long, it doesn't deviate, values are always within range.   Now you've got a mystery – is something corrupting data?

The symptom is almost always the best place to start at.  It's a random occurence as far as when, the values sometimes cause an error that is caught (exceptions), and now you know sometimes the values become corrupted some how.  Your system has some indeterminism in how it initializes, depending on network conditions, etc, so tasks aren't always in the same places in RAM.  Yet… somehow, only your floating point tasks are affected.  You begin to suspect the system itself is a root cause.

Believe it or not, your best tool to start with in a situation like this is a code survey.  Because of the ranom timing nature and it's affinity for only hitting floating-point tasks, combined with values that couldn't be generated by a device being handled, the first place to check is: interrupt handlers!  Even though you're sure you're only moving data and not doing calculations, what kind of data are you moving?  You may think you're just copying some innocent structure from here to there… but if it contains floating point data in it, you'll be using the floating point hardware in the computer.  It's not just about calculations, floating point data requires a different way of handling, the compiler knows about it, so when source code even moves floating point data around the compiler generates machine code that uses the FP registers to carry the data.  You might just find something like this in your compiled code!

0003df68   c11fbfe8      ldd         [ %fp - 0x00000018 ], %f0
0003df6c   c91fbfe0      ldd         [ %fp - 0x00000020 ], %f4

And that could ruin your day.  All it takes is one ISR unexpectedly changing the contents of any floating point register, and the task context held in the FP-regs is now corrupted.  Suggested fix: remove the use of floating point at ISR time, it's costly in that ISRs don't save/restore floating point contexts (could more than double the context switch time).  Alternatively, alter that ISR so it maintains it's own FP context.

3 Comments

  1. Vipin

    The bigger problem is that compiler uses FP registers even you have no floating point variable or any floating point calculation.
    How?
    Well if you have 8 byte structure (say 2 integer variables or 8 char variables) and you do a structure copy compiler smartness exploit FP registers to move the content even you do not use any optimization flag.
    Tip: Make sure you compile code with -msoft-float that does not use floating point.
    [Ed: This is a valid concern for 64-bit entities, for example long-long, however this does disable floating point usage entirely; if you're using task-only code and can enable VX_FP_TASK, this may provide a better solution. In any case it's best to leave any heavy number crunching to a service task and keep the ISR code as short as possible.]

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>