Software Tech News 5:3 - A Stitch in Time

Volume 5, Number 3 - Experienced Based Management

A Stitch in Time

By Larry Bernstein, Stevens Institute of Technology

1.0 Introduction

The 1990s was to be the decade of fault tolerant computing. Fault tolerant hardware was hot and software fault tolerance was about to happen. But it didn't. Instead, the Web Wave took off. Until a rash of server failures, denial of service attacks, web service failures and the September 11 tragedy we lost interest in fault tolerance. Even though Software Fault Tolerance was not hot, there was progress. One breakthrough is the use of Software Rejuvenation. Here are three ways it was used to greatly improve software reliability without having to do the pain staking work of digging out every defect even if they would never cause a system failure.

Software faults are most often caused by design faults. Design faults occur when a software engineer either misunderstands a specification or simply makes a mistake. Software faults are common for the simple reason that the complexity in modern systems is often pushed into the software part of the system. It is estimated that 60-90% of current computer errors are from software faults. Running software is a predictable state machine. Software manipulates variables that have states. Unfortunately flaws in the software that permit the variables to take on values outside of their intended operating limits often cause software failures.

Software rejuvenation is special software that gracefully terminates an application, and immediately restarts it at a known, clean, internal state. Instead of running for a year, with all the mysteries that untried time expanses can harbor, a system is run for one day, 364 times. It is re-initialized each day, process by process, while the system continues to operate. Rejuvenation precedes failure, anticipates it, and avoids it. It transforms non-stationary, random processes into stationary ones.

2.0 Billing Data Collection

Two years of operation have passed with no reported outages for one system that collects billing data from telephone company switches. Its rejuvenation interval is set at one week. In another billing data subsystem a 16,000 line C program with notoriously leaky memory failed after 52 iterations. After adding seven lines of rejuvenation code with the period set at 15 iterations, the program ran flawlessly.

3.0 Store and Forward Message Switcher

While software cannot be designed without bugs, it does not have to be as buggy as it is. For example, as early as 1977, a software based store and forward message switching was in its fourth year of operation and it handled all administrative messages for Indiana Bell without a single failure. This record was achieved after a very buggy start followed by a substantial investment in failure prevention and bug fixes. One particularly error-prone software subsystem was the pointers used to account for clashes in the hash function that indexed a message data file. The messages could remain in the file system for up to thirty days. There were many hash clashes due to the volume of messages and the similarity of their names. Once the obvious bugs were fixed the residual ones were hard to find. This led to unexpected behavior and system crashes. A firm requirement was not to lose any messages. Failures exhibited by latent faults can appear to be random and transient. But they are predictable if only we can get the initial conditions and transaction load to trigger them. They are sometimes called Heisenbugs. It was just too costly to find and fix all the Heisenbugs in the file index code, so rejuvenation was used to rebuild the hash tables daily in the early hours of the morning when there was no message traffic. With fresh hash tables, the chances of triggering a fault was small especially after the bugs that were sensitive to the traffic mix were found and fixed. This experience shows that it is not necessary for software to be inherently buggy.

4.0 NASA Uses It Too

The NASA mission to explore Pluto has a very long mission life of 12 years. A fault-tolerant environment incorporating on-board preventive maintenance is critical to maximize the reliability of a spacecraft in a deep-space mission. This is based on the inherent system redundancy (the dual processor strings that perform spacecraft and scientific functions during encounter time). The two processor strings are scheduled to be on/off duty periodically, in order to reduce the likelihood of system failure due to radiation damage and other reversible aging processes.

Since the software is reinitialized when a string is powered on, switching between strings results in software rejuvenation. This avoids failures caused by potential error conditions accrued in the system environment such as memory leakage, unreleased file locks and data corruption. The implementation of this idea involves deliberately stopping the running program and cleaning its internal state through flushing buffers, garbage collection, reinitializing the internal kernel tables or, more thoroughly, rebooting the computer.

Such preventive maintenance procedures may result in appreciable system downtime. However, by exploiting the inherent hardware redundancy in this Pluto mission example, the performance cost is minimal. One of the strings is always performing and starting it before the current active string is turned off can mask the overhead for a string's initialization. An essential issue in preventive maintenance is to determine the optimal interval between successive maintenance activities to balance the risk of system failure due to component fatigue or aging against that due to unsuccessful maintenance itself.¹

5. Where To Get Rejuvenation

Watchd and Libft are software fault tolerance components. They may be used with any UNIX or NT application to let the application withstand faults.

Watchd is a watchdog daemon process for detecting UNIX process failures (crashes and hangs) and restarting those processes. The fault tolerance mechanism is based on a cyclic protocol. They may be obtained at the Lucent web site, www.lucent.com.

Windows 95 has a special library WinFT that provides automatic detection and restarting of failed processes; diagnosing and rebooting of a malfunctioning or strangled OS; checking pointing and recovery of critical volatile data; and preventive actions, such as software rejuvenation. Joao Carreira et. al., "Fault Tolerance for Windows Applications," Byte Magazine February 1997, pp 51-52:

6. Wrap up

Most software runs non-periodically, which allows internal states to develop chaotically without bound. Software rejuvenation seeks to contain the execution domain by making it periodic. An application is gracefully terminated and immediately restarted at a known, clean, internal state. Failure is anticipated and avoided. Non-stationary, random processes are transformed into stationary ones. The software states would be re-initialized each day, process by process, while the system continued to operate. Increasing the rejuvenation period reduces the cost of downtime but increases overhead. Rejuvenation does not remove bugs; it merely avoids them with incredibly good effect.

Rejuvenation does not remove bugs that exist beyond its carefully circumscribed limits. Instead, it avoids the vast unknown territory that conceals them.

1. Y. Huang and C. M. R. Kintala, "Software Implemented Fault Tolerance: Technologies and Experience", Proceedings of 23rd Intl. Symposium on Fault-Tolerant Computing, Toulouse, France, pp. 2-9, June 1993;

Also appeared as a chapter in the book Software Fault Tolerance, M. Lyu (Ed.), John Wiley & Sons, March 1995.