Quit Bugging Me: Cell Death.
You're working from home. You trigger a remote transfer - a single update file is sent via a standard transfer protocol. Part way through the file transfer, the cell phone goes down - actually - all the phones connected to that cell concentrator... go down. This would be a bad thing to happen. It actually did happen in a laboratory environment.
In the lab, efforts to repeat the problem fail - until that same exact file is transfered. Transferring other files works without a problem. The thought becomes "perhaps there are characters embedded in the data stream that affect the software somehow - like XON/XOFF, Perhaps the software is somehow processing the data as it comes through?"
The contents of the file are now suspect, as well as the OS - perhaps reacting to the contents of the file. Somehow, transferring -this file- causes resets, you can make the file a bit shorter, longer, change it's name, and it still resets the system. You start digging through the file itself.
In this case, in the file, you notice a particularly long field of repeating characters. It doesn't mean anything until some time later. One of the technicians relates a story - in his old office they used to have a nice Frame Relay connection to the outside world, It was mostly reliable. But... transferring certain files always lead to a need to reboot the modem. You could transfer those -same- file if they were compressed. It was later found that in the uncompressed image there was a large bitmap, the bitmap - if edited - contained a huge field of "BBBBBB" capital B's. The Modem was designed in such a way that if there were problems, you could send it a special key to make it shut down. That special key? A long string of "B"'s....
Is it possible that something in the data stream itself is triggering the problem?
After experimenting you find that there is indeed a long pattern in this file, and sending this pattern through the file transfer protocol causes the new Cell Base to crash. Now you have a mystery - why does this exact pattern always cause the Cell Base to crash? The software is still not yet clear, but now that we know it's something to do with the use of this one high-speed serial channel, we start debugging.
Eventually we start adding signal traces from the hardware. It's found that while transferring the magic pattern, the PowerFail line on the Power Supply triggers, and the system resets due to the power alarm. The only problem with this automatic system is... the power bus was just fine.
This brings the question: how can transferring a specific pattern of bytes cause an incorrect power fail "reset" signal? These functions aren't even related in any way!!
Digging into the hardware design now, you find that the high-speed serial I/O is laid-out along side some of the power sense circuitry. It was just a fluke of the design. Transferring enough data - with the right electrical patterns - across the high-speed serial actually caused a power-supply sense circuit to anomalously trigger, resetting the entire system. There's some level of electromagnetic bleed-over, just enough that the right pattern repeating long enough induces the sensing circuit to believe it needs to reset the whole board - which it does.
The fix at this point becomes almost trivial. The logic analyzer shows an extremely short transient captured by the sensing circuit. This circuit is built / designed with programmable logic. The fix is to program the logic to screen-out short transients and look for longer "real" signals. In this way, the hardware doesn't need to be laid-out again and re-manufactured, an expensive option that would also delay the project.
[Thank you to A.G. for the story!]


As an Engineering Specialist, it is Mike Deliman's responsibility to enable customers to achieve success in their endeavors, assist sales groups in evangelizing Wind River's technologies, and bring feedback of customer needs and experiences back into Marketing and Engineering. Mike has over 15 years of experience with VxWorks. 



Comments