Jack Ganssle
Digi-Key
A watchdog timer (WDT) is a bit of hardware that monitors the execution of code to reset the processor if the software crashes. For many years there has been a raging debate in the embedded world about their importance. More than a few engineers feel WDTs are unnecessary; a better solution, they claim, is to write firmware that does not crash. That is a noble sentiment, for perfection is a lofty and admirable goal.
However, few products ever reach that level of quality. As the software swells in size, even an uncompromising focus on quality will hardly yield perfection. A million line program, if the code is 99.99 percent correct (a number far higher than achieved by the vast majority of organizations) has 100 lurking bugs. Any of those may crash the system or worse – put it into a dangerous operating mode. (Alas, the average embedded systems ships with only 95 percent of all of the bugs removed.1
Bugs are not the only problem. Perfectly designed and built hardware on which perfect code executes can still fail.
Increasingly, cosmic rays are causing problems in digital systems. Consisting mostly of high-energy protons from space, they can interact with transistors on ICs and flip bits. In the early days of microprocessors, these were much less of a threat than today because the process geometry was large – lots of energy was needed to effect a bit-flip. Today the problem has grown significantly worse since a 45 nm geometry is routine and 28 nm not uncommon, with smaller nodes appearing yearly.
In the 1990s, IBM found that a typical computer experiences about one error due to cosmic rays every month per 256 MB of RAM.2 Geometries have shrunk a lot since then so presumably the problem has gotten worse.
Intel believes that cosmic rays are likely to be an increasing source of computer errors in the future. Their patent 7,309,866 uses a MEMs sensor to detect incoming cosmic rays and then signals a circuit to take corrective action.3
H. Kobayashi, et al. found that errors from cosmic rays and other particles more than doubled in devices built from 180 nm geometry compared to those of 250 nm.4
A 2004 paper by Tezzaron Semiconductor shows that SRAMs and logic are the primary sufferers of cosmic ray upsets.5 The authors claim a system with one GB of SRAM can expect a soft error every two weeks, and the problems are ten times worse in Denver than at sea level.
Amazingly, a particle with as little as a 10 femtocoulomb charge has enough energy to flip an SRAM bit.6 A decade ago, the larger cells needed five times more energy.
The bottom line: even perfectly-written code can crash. Only a watchdog timer can help a crashed system recover.
A great WDT
Since the WDT is the very last line of defense, its design must anticipate any failure mode. One may ask, “What are the characteristics of a great watchdog?”
First, the WDT must be independent of the CPU. No matter what odd mode the processor finds itself in, the watchdog timer has to be functional. Further, once set up at initialization, nothing the processor does should be able to disable or reprogram the watchdog. Otherwise a rogue program could accidentally disable this protective mechanism, rendering it useless.
The WDT must always, under any condition barring perhaps a hardware failure, bring the system back to life. This means issuing a hard reset to the CPU. No other option is guaranteed to bring a crashed processor back to life.
Some WDTs issue a non-maskable interrupt instead of a reset. The idea is that the NMI’s service routine can snapshot the stack and log debugging information. Alas, there is no reason to believe that a CPU which is in an arbitrary dysfunctional mode will respond to any interrupt; there is quite a lot of processing required before the service routine will be invoked. On many processors an interrupt service routine will not start if, for example, the stack pointer has an odd number or unaligned addresses; indeed, they may go into a double-bus fault mode, wherein the CPU shuts down and only a hard reset will restore operation.
The NMI approach is interesting, however. One alternative sometimes used is to issue an NMI, and start a timer. After a few milliseconds the timer then resets the CPU. The NMI service routine then, if it works, logs debugging information, but the inevitable hard reset insures the device comes back to life.
It is critical that the watchdog, independently of a perhaps crippled CPU, puts the system into a safe state when the system controls dangerous hardware. Moving machinery, hazardous radiation, etc. must be disabled, parked or otherwise disengaged, since the reset may not work if a hardware failure has crashed the processor.
Today’s embedded systems often have very sophisticated peripherals; in some cases the I/O may be much more complex than the microprocessor. The WDT reset sequence must insure that these devices are brought back to a known state. When code crashes, it may issue bizarre streams of data to the peripherals. If the design of the peripherals is such that the CPU is not always able to put the devices into known correct states, those devices need a hard reset from the WDT.
Finally, it is wise to leave debugging breadcrumbs behind, if possible. The previously-mentioned NMI/deferred-reset is an example. Save the stack and other critical parameters into a region of non-volatile memory the developers can access. Unfortunately, a reset destroys all processor state information, but there is often application-specific data that can help diagnose problems, like pointers into state machine tables. Before initializing those after a reset, save them. If there is a real-time clock, also save the time of the reset.
Part 2 - Internal and External WDTs, Software considerations