An Earlier QBM, "Surprise NAN" covered how floating point computations may become corrupted from unexpected sources. Today's Quit Bugging Me is about... another Surprise! NAN!
An application runs with several tasks. One task, which computes a set of floating point values, periodically comes up with bad values, and sometimes the name becomes corrupted.
Here is a short example... done with the solaris simulator. Kernel options do -not- include any hardening, nor any guard pages / stack protection. In fact, only the options related to "MMU_BASIC" are included. I have a breakpoint set just before the exit so I can catch overflows. surpriseNAN recurses on it's argument, easily overflowing the default stack size. I call logMsg first thing, it's a cheap way to get the task ID printed out. I'm going to provide edited kernel shell output to show the points.
Symptoms: the task names are valid on entry but suddenly go away, there's a crash that involves "reschedule", the values contained in the FPregs are... corrupted badly:
-> task spawned: id = 0x40b4820, name = t2
task spawned: id = 0x40bb968, name = t3
task spawned: id = 0x40ad6d8, name = t4
task spawned: id = 0x40a6590, name = t5
task spawned: id = 0x409f448, name = t6
task spawned: id = 0x4098300, name = t7
0x40bb968 (t3): alive.
0x40ad6d8 (t4): alive.
0x40a6590 (t5): alive.
sometime later the BP is hit. Several of the tasks are "missing", though none completed.
> i
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
-------- ---------- -------- --- ---------- -------- -------- ------- -----
tExcTask excTask 40d8250 0 PEND a89cc 40d8058 0 0
tLogTask logTask 40d37b8 0 PEND a89cc 40d35c0 0 0
tWdbTask 5eea4 40ce308 3 PEND 664c8 40cdfd8 0 0
! surpriseNAN 40bb968 100 READY 6a80c 40b4cb8 0 0
! surpriseNAN 40ad6d8 100 READY 6a80c 40a6a28 0 0
surpriseNAN 40a6590 100 SUSPEND 6a810 409f950 0 0
value = 0 = 0x0
Notice - the task t5 hit the breakpoint and was suspended. The other two tasks are "ready" and never hit the break point. They are no longer running though. And what happened to the names? Let's look at one of the tasks...
-> ti 0x40bb968
NAME ENTRY TID PRI STATUS PC SP ERRNO DELAY
-------- ----------- ------- --- ---------- -------- -------- ------- -----
surpriseNAN 40bb968 100 READY 6a80c 40b4cb8 0 0
stack: base 0x40bb968 end 0x40b4c08 size 27856 high 27856 margin 0
options: 0x1d
VX_SUPERVISOR_MODE VX_DEALLOC_STACK VX_FP_TASK VX_STDIO
%pc = 6a80c %npc = 6a810 %psr = 0 %mask = 0
%tbr = 0 %y = 0
%g0 = 0 %g1 = 5f %g2 = 0 %g3 = 40ad6d8
%g4 = 0 %g5 = 0 %g6 = 0 %g7 = 0
%i0 = 1 %i1 = db800 %i2 = 0 %i3 = 0
%i4 = 0 %i5 = 0 %fp = 40b4d30 %i7 = 40d9d60
%l0 = 40bb968 %l1 = 0 %l2 = 0 %l3 = 0
%l4 = 0 %l5 = 0 %l6 = 0 %l7 = 0
%o0 = 0 %o1 = db800 %o2 = 0 %o3 = 0
%o4 = 0 %o5 = 0 %sp = 40b4cb8 %o7 = 6a804
%fsr = eeeeeeee%f00 = 0.793335 %f02 = 4 %f04 = 8 %f06 = 8%f08 = 7.87933 %f10 = 0.00819755 %f12 = 0.223435 %f14 = -0.668483%f16 = -0.902314 %f18 = 0 %f20 = 0 %f22 = NaN%f24 = NaN %f26 = NaN %f28 = NaN %f30 = NaNvalue = 67819224 = 0x40ad6d8
What happened here? The FP-REGS are trashed? and the name is trashed... something must have corrupted the system. Is it a stack over-run?
-> checkStack
NAME ENTRY TID SIZE CUR HIGH MARGIN
------------ ------------ -------- ----- ----- ----- ------
tExcTask excTask 40d8250 15984 504 3016 12968
tLogTask logTask 40d37b8 15984 504 2104 13880
tWdbTask 0x000005eea4 40ce308 16040 816 11376 4664
! surpriseNAN 40bb968 27856 27824 27856 0 OVERFLOW
! surpriseNAN 40ad6d8 27856 27824 27856 0 OVERFLOW
surpriseNAN 40a6590 27856 27712 27856 0 OVERFLOW
So in fact there is not only one stack overflow, but 3. What has happened here?
Looking back at the TaskIDs returned from taskSpawn, we can see these were all spawned with their OS data structures right in a row. With no memory protection (no memory protections included!) the task stacks are not protected.
When the surpriseNAN recursively writes over the top of it's own stack, it corrupts the area where the floating point data are saved. Some of the values written work-out to be valid floating point numbers, others are not numbers. This is near the area where the task name is stored. With a sufficient overflow, the task name is corrupted as well as the FP data.
When the task overflows it's stack far enough, it will eventually over-write other data. In this case the other data is the register set of the task immediately adjacent (in lower addresses) in RAM.
Something to notice in the corrupted FP-REGS - the FSR is set to a stack-fill value (so the FSR screams "eeeeeeee!"). This is one more tell-tale that the context for this task has been corrupted.