You've got a complex system. There are dozens of tasks, a handful of interrupt routines. Your system runs fine, for hours or sometimes days on end. The loading on your system is somewhat "bursty" in that average loading is only about 20% of peak load, and peaks only happen once every hour for less than 5 seconds. Every once in a while, for whatever reason, you suddenly get a divide by zero, not-a-number, or other unanticipated erroneous results on one of your floating point calculations.
In your system you only have a few tasks that employ floating point. You're pretty sure none of your ISRs do, all they do is copy or move data around, they shouldn't have to compute anything. Yet for some reason, occasionally any one of the tasks running floating point will go belly-up with unexplainable results. It just looks like some kind of random corruption.
Upon further examination you find that your data isn't clean – every now-and-then you get results that are impossible for the device you're reading. It just doesn't generate values in those ranges. When you read your device in a tight loop, no matter for how long, it doesn't deviate, values are always within range. Now you've got a mystery – is something corrupting data?
The symptom is almost always the best place to start at. It's a random occurence as far as when, the values sometimes cause an error that is caught (exceptions), and now you know sometimes the values become corrupted some how. Your system has some indeterminism in how it initializes, depending on network conditions, etc, so tasks aren't always in the same places in RAM. Yet… somehow, only your floating point tasks are affected. You begin to suspect the system itself is a root cause.
Believe it or not, your best tool to start with in a situation like this is a code survey. Because of the ranom timing nature and it's affinity for only hitting floating-point tasks, combined with values that couldn't be generated by a device being handled, the first place to check is: interrupt handlers! Even though you're sure you're only moving data and not doing calculations, what kind of data are you moving? You may think you're just copying some innocent structure from here to there… but if it contains floating point data in it, you'll be using the floating point hardware in the computer. It's not just about calculations, floating point data requires a different way of handling, the compiler knows about it, so when source code even moves floating point data around the compiler generates machine code that uses the FP registers to carry the data. You might just find something like this in your compiled code!
0003df68 c11fbfe8 ldd [ %fp - 0x00000018 ], %f0
0003df6c c91fbfe0 ldd [ %fp - 0x00000020 ], %f4
And that could ruin your day. All it takes is one ISR unexpectedly changing the contents of any floating point register, and the task context held in the FP-regs is now corrupted. Suggested fix: remove the use of floating point at ISR time, it's costly in that ISRs don't save/restore floating point contexts (could more than double the context switch time). Alternatively, alter that ISR so it maintains it's own FP context.