Volume 6, Number 2 - Software Quality Assurance
Skeptics doubt the use of software rejuvenation technology to avoid failures even when there are faults in the system as presented by Larry Bernstein on the front page of the DoD Software Tech News (Volume 3, Number 4), in the article, "A Software Engineering Course for Trustworthy Software." Here is a case study showing that his conclusions are valid. You may wish to repeat this work in your organization to convince your process driven engineers and your hackers that there is a right and proper role for software design technology in realizing trustworthy software.
In this case study, communications protocol software was built by a skilled programmer. It contained a memory leak that would crash the system. The software failed under certain stressful tests. Once software rejuvenation library software was bound into the communications software, the protocol no longer failed, even though the bug was still there. By using known software design technology, bugs may become benign.
Software execution is very sensitive to its initial conditions and the external data it receives. What appear to be random failures are often repeatable. The problem in finding and fixing these problems is related to the difficulty of doing the detective work needed to discover the particular initial conditions and data sequences that can trigger the fault so that it becomes a failure.
Prof. Lui Sha’s model of reliability is based on these postulates [2]:
Prof. Bernstein’s effectiveness extension of the reliability model adds an effectiveness factor. In Ś7x24’ systems, the longer the software system runs, the lower its reliability and the more likely a fault will be executed and become a failure. Reliability can be improved by investing in tools, simplifying the design, or increasing the development effort beyond that projected using Software Cost Estimation models such as COCOMO II. For example, one may inspect code twice or include a diabolic testing group in addition to the normal test groups.
Our goal was to study and quantify the parameters for the software reliability model by gathering real data from a controlled implementation of a wireless communications protocol. The study included these steps:
The code will simulate standard Selective Repeat ARQ protocol [3]. In this case, the messages bytes are grouped into a set of frames, each frame is separated by special header and trailer bytes and a number of frames are sent in a burst. The receiver software either acknowledges successful receipt of the frames or sends special control frames to the sender software signaling that the frames that had errors must be resent. The software stops after correctly sending the assigned number of frames.
A procedure call sends frames to the network. The network may send a frame correctly, corrupt it, or lose it, or lose frames. The protocol can detect many kinds of network corruptions and frame loss.
Based on the requirement, the program will run in a simulated hardware/software environment. It will have three parts: A-side (sender), B-side (receiver) and a network emulator to simulate the network environment. The overall structure of the environment is shown in Figure 1.
Figure 1. Layered Structure and Design Diagram
The program only implements unidirectional transfer of data (from A to B). Of course, the B side will have to send frames to A to acknowledge receipt of data.
There was a memory leak in the code. The memory leak operates under all conditions but does not cause the software to fail except under the most stressful conditions. The stress test uncovered the fault. If the system were used in production, a random amount of memory would have been consumed. Other programs running in the same memory space might have randomly failed for lack of memory. The fault becomes coupled from one program to another. The stress test was effective in detecting the fault so that a potential system failure could be avoided. After the fault tolerance library (Libft) was bound to the ARQ product, the stress test was repeated. With a 100% loss rate, the program has never failed and as expected never stopped. The memory usage did not increase. The memory leak failure was avoided.
The results from the stress test are illustrated in Figure 2.
After fault tolerance library (Libft) was built in to reset the program every time 100 frames were sent, the 1000 frames test cases were repeated. This time there was no crash.
Figure 2. Stress Test Results
Two defects were found during system test. One defect crashed the software due to a memory leak and was avoided once the software fault tolerance library was added to the ARQ product. The other defect could occasionally cause the wrong frame to be sent but it did not hang or crash the package and the code was fixed.
The rush to coding without careful design led to poor software architecture. Too often developers do not feel they have the time to start again and live with an ill conceived architecture leading to untrustworthy software execution. The need for a solid architecture and good prototype was illustrated by the events in the case study.
The stress tests were designed by Professor Bernstein and went well beyond the imagination of the developer. The stress test was successful in inducing a latent fault to hang the system. Too often a latent fault in one process consumes resources that cause other fault free processes to fail.
The wisdom of exploiting fault tolerant software technology was demonstrated in the case study. Try it, you will like it and so will those you live with that will not be startled in the middle of the night as you handle a frantic phone call.
Lawrence Bernstein is a recognized expert in software technology, network architecture, network management software, software project management, and technology conversion. He conceived of the notion of software rejuvenation. He is currently teaching graduate courses on Computer Networks and undergraduate Software Engineering at Stevens Institute of Technology in Hoboken, NJ.
During a distinguished 35-year career at Bell Laboratories he was Chief Technical Officer of the Operations Systems Business Unit and an Executive Director managing large software projects. Since retirement he heads his own consulting firm.
![]() |
![]() |
![]() |