Volume 6, Number 2 - Software Quality Assurance


Software Rejuvenation: Avoiding Failures Even When There Are Faults

"Software Engineering Design Which Enables Faulty Software to Run Without Failure - a Study of ARQ Wireless Protocol Implementation in C++"

by Lawrence Bernstein, Professor, Computer Science; Yu-Dong Yao, Professor, Electrical Engineering; Kevin Yao, Master’s Candidate, Computer Science

Overview

Skeptics doubt the use of software rejuvenation technology to avoid failures even when there are faults in the system as presented by Larry Bernstein on the front page of the DoD Software Tech News (Volume 3, Number 4), in the article, "A Software Engineering Course for Trustworthy Software." Here is a case study showing that his conclusions are valid. You may wish to repeat this work in your organization to convince your process driven engineers and your hackers that there is a right and proper role for software design technology in realizing trustworthy software.

In this case study, communications protocol software was built by a skilled programmer. It contained a memory leak that would crash the system. The software failed under certain stressful tests. Once software rejuvenation library software was bound into the communications software, the protocol no longer failed, even though the bug was still there. By using known software design technology, bugs may become benign.

Software execution is very sensitive to its initial conditions and the external data it receives. What appear to be random failures are often repeatable. The problem in finding and fixing these problems is related to the difficulty of doing the detective work needed to discover the particular initial conditions and data sequences that can trigger the fault so that it becomes a failure.

Prof. Lui Sha’s model of reliability is based on these postulates [2]:

  1. Complexity begets faults. For a given execution time, software reliability decreases as complexity increases.
  2. Faults are not equal. Some are easy to find and fix and others are Heisenberg’s. Faults are not random.
  3. All budgets have limits. There is not unlimited time or money to pay for exhaustive testing.

Prof. Bernstein’s effectiveness extension of the reliability model adds an effectiveness factor. In Ś7x24’ systems, the longer the software system runs, the lower its reliability and the more likely a fault will be executed and become a failure. Reliability can be improved by investing in tools, simplifying the design, or increasing the development effort beyond that projected using Software Cost Estimation models such as COCOMO II. For example, one may inspect code twice or include a diabolic testing group in addition to the normal test groups.

Study Perspective

Our goal was to study and quantify the parameters for the software reliability model by gathering real data from a controlled implementation of a wireless communications protocol. The study included these steps:

  1. Review the requirements.
  2. Create a Software Architecture.
  3. Design the software.
  4. Review the Software Architecture and design.
  5. Implement (develop) Automatic Request Response (ARQ) communication software protocols in C++ [3].
  6. Inspect the code.
  7. Test the software.
  8. Measure reliability data from running system tests.
  9. Once the developer is satisfied that the code works reliably, stress test the software with diabolic tests. The test chosen was one where all frames fail in a window of 1000 frames, where the software should have continued to try to send the frames. In the case study, the software crashed.
  10. Do not fix any bugs detected during the diabolic stress test. Instead, add a fault tolerance library to see if the bug could be avoided. If the bug is not executed the software is shown to run reliably even though it still defective.

Software Requirements

The code will simulate standard Selective Repeat ARQ protocol [3]. In this case, the messages bytes are grouped into a set of frames, each frame is separated by special header and trailer bytes and a number of frames are sent in a burst. The receiver software either acknowledges successful receipt of the frames or sends special control frames to the sender software signaling that the frames that had errors must be resent. The software stops after correctly sending the assigned number of frames.

A procedure call sends frames to the network. The network may send a frame correctly, corrupt it, or lose it, or lose frames. The protocol can detect many kinds of network corruptions and frame loss.

Based on the requirement, the program will run in a simulated hardware/software environment. It will have three parts: A-side (sender), B-side (receiver) and a network emulator to simulate the network environment. The overall structure of the environment is shown in Figure 1.

Design Diagram

Figure 1. Layered Structure and Design Diagram

The program only implements unidirectional transfer of data (from A to B). Of course, the B side will have to send frames to A to acknowledge receipt of data.

Software Development Process History

(as recorded by the developer)

  1. I wrote documentation, tried to finish requirements document and architecture document and even the test plan first before I began coding. I became bored and under time pressure started coding before all documents were ready.
  2. I rushed into coding without complete understanding of the problem in the mistaken belief that coding would lead to insight into the problem.
  3. I built a simple version of the software and was proud to see that it worked. Early success gave me confidence to plunge ahead.
  4. I was stuck after finishing a simple version of the protocol because it would not scale to the Stop-N-Wait version. The early version lacked timers, flow control counters and frame counters.
  5. I made a design change to an event driven architecture.
  6. I ran unit tests and fixed bugs. A bug surfaced when sending 1000 frames with 50% lost rate. A pointer operation for one boundary condition exceeded the design constraints.
  7. The stress test of 100% frame loss rate exhausted the system memory that came from use of fprintf() used for debugging and statement recording and the software hung.
  8. I added Libft library to provide fault tolerance into the program [1]. A memory conflict happened at the start up of Windows 2000. Windows NT 4.0 ran with the latest service pack 6a and Swift. There were many problems when configuring the Windows NT computer. These were resource limitation and configuration conflict problems, typical of so many software developments.
  9. Reran stress tests and all worked perfectly.

Stress Test Without Rejuvenation

  1. Sending out 1000 frames 1000 times, with 0% loss rate and 0% corruption rate - passed.
  2. Sending out 1000 frames 1000 times, with 10% loss rate and 10% corruption rate - passed
  3. Sending out 1000 frames 1000 times, with 50% loss rate and 50% corruption rate - passed.
  4. Sending out 1000 frames 1000 times, with 100% loss rate and 100% corruption rate - failed.

There was a memory leak in the code. The memory leak operates under all conditions but does not cause the software to fail except under the most stressful conditions. The stress test uncovered the fault. If the system were used in production, a random amount of memory would have been consumed. Other programs running in the same memory space might have randomly failed for lack of memory. The fault becomes coupled from one program to another. The stress test was effective in detecting the fault so that a potential system failure could be avoided. After the fault tolerance library (Libft) was bound to the ARQ product, the stress test was repeated. With a 100% loss rate, the program has never failed and as expected never stopped. The memory usage did not increase. The memory leak failure was avoided.

The results from the stress test are illustrated in Figure 2.

After fault tolerance library (Libft) was built in to reset the program every time 100 frames were sent, the 1000 frames test cases were repeated. This time there was no crash.

Stress Test Results

Figure 2. Stress Test Results

Software Reliability Analysis

Two defects were found during system test. One defect crashed the software due to a memory leak and was avoided once the software fault tolerance library was added to the ARQ product. The other defect could occasionally cause the wrong frame to be sent but it did not hang or crash the package and the code was fixed.

The rush to coding without careful design led to poor software architecture. Too often developers do not feel they have the time to start again and live with an ill conceived architecture leading to untrustworthy software execution. The need for a solid architecture and good prototype was illustrated by the events in the case study.

The stress tests were designed by Professor Bernstein and went well beyond the imagination of the developer. The stress test was successful in inducing a latent fault to hang the system. Too often a latent fault in one process consumes resources that cause other fault free processes to fail.

The wisdom of exploiting fault tolerant software technology was demonstrated in the case study. Try it, you will like it and so will those you live with that will not be startled in the middle of the night as you handle a frantic phone call.

About the Author

Lawrence Bernstein is a recognized expert in software technology, network architecture, network management software, software project management, and technology conversion. He conceived of the notion of software rejuvenation. He is currently teaching graduate courses on Computer Networks and undergraduate Software Engineering at Stevens Institute of Technology in Hoboken, NJ.

During a distinguished 35-year career at Bell Laboratories he was Chief Technical Officer of the Operations Systems Business Unit and an Executive Director managing large software projects. Since retirement he heads his own consulting firm.

Author Contact Information

Larry Bernstein
Stevens Institute of Technology
4 Marion Avenue
Short Hills NJ 7078
973-258-9213
http://guinness.cs.stevens-tech.edu/~lbernste/
Previous Table of Contents Next