Quit Bugging Me: Another Surprise NaN!

An Earlier QBM, "Surprise NAN" covered how floating point computations may become corrupted from unexpected sources.  Today's Quit Bugging Me is about… another Surprise! NAN!

An application runs with several tasks.  One task, which computes a set of floating point values, periodically comes up with bad values, and sometimes the name becomes corrupted.  

Here is a short example…  done with the solaris simulator.  Kernel options do -not- include any hardening, nor any guard pages / stack protection.  In fact, only the options related to "MMU_BASIC" are included.  I have a breakpoint set just before the exit so I can catch overflows.  surpriseNAN recurses on it's argument, easily overflowing the default stack size.  I call logMsg first thing, it's a cheap way to get the task ID printed out.  I'm going to provide edited kernel shell output to show the points.

Symptoms:  the task names are valid on entry but suddenly go away, there's a crash that involves "reschedule", the values contained in the FPregs are… corrupted badly:

-> task spawned: id = 0x40b4820, name = t2

task spawned: id = 0x40bb968, name = t3

task spawned: id = 0x40ad6d8, name = t4

task spawned: id = 0x40a6590, name = t5

task spawned: id = 0x409f448, name = t6

task spawned: id = 0×4098300, name = t7

0x40bb968 (t3): alive.

0x40ad6d8 (t4): alive.

0x40a6590 (t5): alive.

sometime later the BP is hit.  Several of the tasks are "missing", though none completed.

> i


  NAME      ENTRY     TID    PRI   STATUS      PC       SP     ERRNO  DELAY

——– ———- ——– — ———- ——– ——– ——- —–

tExcTask   excTask   40d8250   0 PEND          a89cc  40d8058       0     0

tLogTask   logTask   40d37b8   0 PEND          a89cc  40d35c0       0     0

tWdbTask   5eea4     40ce308   3 PEND          664c8  40cdfd8       0     0

!       surpriseNAN  40bb968 100 READY         6a80c  40b4cb8       0     0

!       surpriseNAN  40ad6d8 100 READY         6a80c  40a6a28       0     0

       surpriseNAN   40a6590 100 SUSPEND       6a810  409f950       0     0

value = 0 = 0×0

Notice – the task t5 hit the breakpoint and was suspended.  The other two tasks are "ready" and never hit the break point.  They are no longer running though.  And what happened to the names?  Let's look at one of the tasks…

-> ti 0x40bb968


NAME       ENTRY      TID    PRI   STATUS      PC       SP     ERRNO  DELAY

——– ———– ——- — ———- ——– ——– ——- —–

        surpriseNAN  40bb968 100 READY         6a80c  40b4cb8       0     0


stack: base 0x40bb968  end 0x40b4c08  size 27856  high 27856  margin 0    


options: 0x1d

VX_SUPERVISOR_MODE  VX_DEALLOC_STACK    VX_FP_TASK          VX_STDIO            


%pc    =    6a80c   %npc   =    6a810   %psr   =        0   %mask  =        0

%tbr   =        0   %y     =        0                                        

%g0    =        0   %g1    =       5f   %g2    =        0   %g3    =  40ad6d8

%g4    =        0   %g5    =        0   %g6    =        0   %g7    =        0

%i0    =        1   %i1    =    db800   %i2    =        0   %i3    =        0

%i4    =        0   %i5    =        0   %fp    =  40b4d
30   %i7    =  40d9d60

%l0    =  40bb968   %l1    =        0   %l2    =        0   %l3    =        0

%l4    =        0   %l5    =        0   %l6    =        0   %l7    =        0

%o0    =        0   %o1    =    db800   %o2    =        0   %o3    =        0

%o4    =        0   %o5    =        0   %sp    =  40b4cb8   %o7    =    6a804


%fsr   = eeeeeeee
%f00   = 0.793335   %f02   =        4   %f04   =        8   %f06   =        8
%f08   =  7.87933   %f10   = 0.00819755   %f12   = 0.223435   %f14   = -0.668483
%f16   = -0.902314   %f18   =        0   %f20   =        0   %f22   =      NaN
%f24   =      NaN   %f26   =      NaN   %f28   =      NaN   %f30   =      NaN
value = 67819224 = 0x40ad6d8


What happened here?  The FP-REGS are trashed?  and the name is trashed… something must have corrupted the system.  Is it a stack over-run?

-> checkStack

  NAME        ENTRY        TID     SIZE   CUR  HIGH  MARGIN

———— ———— ——– —– —– —– ——

tExcTask     excTask      40d8250  15984   504  3016  12968 

tLogTask     logTask      40d37b8  15984   504  2104  13880 

tWdbTask     0x000005eea4 40ce308  16040   816 11376   4664 

!           surpriseNAN  40bb968  27856 27824 27856      0 OVERFLOW

!           surpriseNAN  40ad6d8  27856 27824 27856      0 OVERFLOW

            surpriseNAN  40a6590  27856 27712 27856      0 OVERFLOW

So in fact there is not only one stack overflow, but 3.  What has happened here?

Looking back at the TaskIDs returned from taskSpawn, we can see these were all spawned with their OS data structures right in a row.  With no memory protection (no memory protections included!) the task stacks are not protected.

When the surpriseNAN recursively writes over the top of it's own stack, it corrupts the area where the floating point data are saved.  Some of the values written work-out to be valid floating point numbers, others are not numbers.  This is near the area where the task name is stored.  With a sufficient overflow, the task name is corrupted as well as the FP data.

When the task overflows it's stack far enough, it will eventually over-write other data.  In this case the other data is the register set of the task immediately adjacent (in lower addresses) in RAM.  

Something to notice in the corrupted FP-REGS – the FSR is set to a stack-fill value (so the FSR screams "eeeeeeee!").   This is one more tell-tale that the context for this task has been corrupted.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>