More Reliable, Faster, Cheaper Testing with Software Reliability

John Musa, Software Reliability Engineering and Testing Courses


Introduction

The testing of software systems is subject to strong conflicting forces. A system must function sufficiently reliably for its application, but it must also reach the market no later than its competitors (preferably before) and at a competitive cost. Government systems may be less market-driven, but balancing reliability, time of delivery, and cost is also important for them. One of the most effective ways to do this is to apply software reliability engineering to testing (and development) 1,2.

Software reliability engineering has resulted in a new view of testing, in which:

  1. The most efficient testing involves activities throughout the entire life cycle.

  2. Testers are empowered to take leadership positions in proactively meeting user needs.

  3. Testers collaborate closely with system engineers, system architects, users, managers, and developers.

Software reliability engineering delivers the desired functionality for a product much more efficiently by quantitatively characterizing its expected use. It uses this information to precisely focus resources on the most used and/or most critical functions (by ŇcriticalÓ I mean having great extra value when successful or great extra impact when failing with respect to human life, cost, or capability) and to make test realistically represent field conditions. Thus software reliability engineering tends to increase reliability while decreasing development time and cost. Then software reliability engineering balances customer needs for the major quality characteristics of reliability, availability, delivery time, and life-cycle cost more effectively by:

  1. Setting quantitative reliability as well as schedule and cost objectives

  2. Engineering strategies to meet the objectives

  3. Tracking reliability in test as a release criterion

SRE is a proven, standard, widespread best practice that is built on a sound theoretical foundation3 and is widely applicable. As an example, Tierney4 reported the results of a survey taken in late 1997 that showed that Microsoft has applied software reliability engineering in 50 percent of its software development groups, including projects such as Windows NT and Word. It has been an AT&T best current practice since May 1991. Qualification as an AT&T best current practice requires widespread use, a documented large benefit/cost ratio, and a probing review by two boards of high-level managers. Some 70 project managers also reviewed the practice of software reliability engineering. Standards for approval as an AT&T best current practice are high; only five of 30 proposed best current practices were approved in 1991. An AIAA standard for software reliability engineering was approved in 1993, and IEEE standards are under development. McGraw-Hill and the IEEE Computer Society Press recently recognized the rapid maturing and standardization of the field, publishing a handbook on the topic5.

SRE is low in cost and its deployment has very little schedule impact. You can apply software reliability engineering to any software-based system, including legacy systems, beginning at the start of any release cycle. It encourages greater communication among different project roles. With SRE, testers typically participate as members of the system engineering team. They help develop operational profiles, set failure intensity objectives, and select project reliability strategies.

SRE is very customer-oriented. It involves direct interaction with customers, and this enhances your image as a supplier if you have any reasonable degree of competence, improving customer satisfaction. SRE is highly correlated with attaining Levels 4 and 5 of the SEI Capability Maturity Model.


Process Overview

Applying software reliability engineering to test involves five major activities: defining the "just right" reliability, developing operational profiles, preparing for test, executing test, and guiding test.

I will illustrate these activities in the context of an actual project at AT&T, which I call Fone Follower. I selected this example because of its simplicity; it in no way implies that software reliability engineering is limited to telecommunications systems. I have changed certain information to keep explanation simple and protect proprietary data.

Fone Follower is a system that lets telephone calls "follow" subscribers anywhere in the world (even to cell phones). Subscribers dial into a voice-response system and enter the telephone numbers at which they plan to be at various times. Incoming calls (voice or fax) that would normally be routed to a subscriber's telephone are then sent to Fone Follower, which forwards them in accordance with the program entered. If there is no response to a voice call and the subscriber has pager service, Fone Follower pages. If there is still no response or if the subscriber doesn't have pager service, Fone Follower forwards calls to the subscriber's voice mail.


Defining "Just Right" Reliability

To define the "just right" level of reliability for the product, you set the failure intensity objective (FIO), balancing among major quality characteristics users need. A failure is a departure of system behavior in execution from user needs. Failure intensity is simply the number of failures per unit time. The best way to determine the FIO is to use field data from a similar release or product. This data includes customer satisfaction surveys related to measured failure intensity, and an analysis of competing products. Then you engineer project software reliability strategies to meet these objectives. For example, you may determine the resources you will devote to requirements reviews, the amount of unit test, the degree to which you will implement fault tolerant features, etc.


Developing Operational Profiles

An operation is a major task of short duration performed by a system, which returns control to the system when complete. It is a logical rather than a physical concept, in that an operation can be executed over several machines and it can be executed in noncontiguous time segments. An operation can be initiated by a user, another system, or the system's own controller. Some examples of operations are a command activated by a user, a transaction sent for processing from another system, a response to an event occurring in an external system, and a routine housekeeping task activated by your own system controller. The operational profile is simply the set of operations and their probabilities of occurrence. An operational profile is a complete set of functions with their probabilities of occurrence. Table 1 shows an illustration of an operational profile from Fone Follower.

Table 1. Fone Follower Operational Profile
Operation Occurance Probability
Process voice call, no pager, answered
Process voice call, no pager, no answer
Process voice call, pager, answered
Process fax call
Process voice call, pager, answer on page
Process voice call, pager, no answer on page
Enter forwardees
Audit section - phone number database
Add subscriber
Delete subscriber
Recover from hardware failure
0.18
0.17
0.17
0.15
0.12
0.10
0.10
0.009
0.0005
0.0005
0.000001
Total 1.0

You can use operational profiles in system engineering to reduce the number of operations to those that are cost effective with respect to life-cycle system costs and benefits, to plan a competitive release strategy (schedule a small number of most-used operations for a speeded-up first version and defer the others to a later version), and to focus resources on the functions and modules that are most used or most critical. But operational profiles will also play a major role in preparing for and executing test.

To develop an operational profile, you identify the initiators of operations, enumerate the operations that are produced by each initiator, determine the occurrence rates of the operations, and determine the occurrence probabilities by dividing the occurrence rates by total operation occurrence rates.

Many first-time users of SRE think that determining operation occurrence rates will be very difficult; our experience indicates much less difficulty than expected. Frequently, field data already exists for the same or similar systems, perhaps previous versions. If not, you can often collect it. Even if there is no direct data, you can usually make reasonable estimates from related information. Finally, failure intensity achieved in test is very robust with respect to errors in operation occurrence rates.

Preparing for Test

To prepare for test, we prepare the test cases and the test procedures. We allocate test cases to operations in accordance with their occurrence probabilities, with special consideration given to critical operations. We then select test cases within the operation on a uniform basis. Test procedures are load test controllers that set up environmental conditions and randomly select and invoke test cases from the test case set, based on the operational profile.

Executing Test

We allocate test time among feature test, load test, and regression test. In feature test, test runs are executed essentially independently of each other, with interactions minimized. In load test, large members of test runs are executed simultaneously. Load test stimulates failures that can occur as a result of interactions among runs. In regression test, feature test runs are repeated after each build to see if any changes made to the system have spawned faults that cause failures. We identify failures, determine when they occurred, and establish the severity of their impact.

Guiding Test

You will interpret failure data differently for software you are developing and software you are acquiring. For software you are developing you attempt to remove the faults that are causing failures. You track progress, generally at fixed time intervals, by looking at the failure intensity to failure intensity objective (FI/FIO) ratio. For software you are acquiring (this can be by contract, purchase, or reuse from a library), you determine whether that software should be accepted or rejected, with limits on the risks taken. For acquired software, you interpret failure data after each failure.

For developed software, we estimate the FI/FIO ratio from the times of failure events or the number of failures per time interval, using reliability estimation programs such as CASRE5. These programs are based on software reliability models and statistical inference. Figure 1 shows a typical plot of the FI/FIO ratio. Significant upward trends in the plot commonly indicate nonstationary test selection or system evolution due to poor change control. Both need correction if you are to have a quality test effort that you can rely on. We consider releasing the software when the FI/FIO ratio reaches 0.5.



Figure 1. Plot of FI/FIO Ratio

For acquired software we apply a reliability demonstration chart, shown in Figure 2. Failure times are normalized by multiplying by the failure intensity objective. Each failure is plotted on the chart. Depending on the region in which it falls, you may accept or reject the software being tested or continue test. Figure 2 shows a test in which the first two failures indicate you should continue test, and the third failure recommends that you accept the software.

Charts can be constructed for different levels of consumer risk (the risk of accepting a bad program) and supplier risk (the risk of rejecting a good program).



Figure 2. Reliability Demonstration chart


Conclusion

Practitioners in many organizations (see 1,2 for lists) have found software reliability engineering unique in providing a standard proven way to engineer testing for confidence in the reliability of the software-based system they deliver as they deliver it in minimum time with maximum efficiency. It is a vital skill in today's marketplace.

About the Author

Author Contact Information

John D. Musa teaches courses and consults in software reliability engineering and testing. He has been involved in software reliability engineering since 1973 and is generally recognized as one of the creators of that field. Recently, he was Technical Manager of Software Reliability Engineering at AT&T Bell Laboratories, Murray Hill. He organized and led the transfer of software reliability engineering into practice within AT&T, spearheading the effort that defined it as a "best current practice." Musa has also been actively involved in research to advance the theory and practice of software reliability engineering. He has published more than 100 articles and papers, given more than 175 major presentations, and made several videos. He is principal author of Software Reliability: Measurement, Prediction, Application and author of Software Reliability Engineering: More Reliable Software, Faster Development and Testing.

Musa received an MS in electrical engineering from Dartmouth College. He has been listed in Who's Who in America and American Men and Women of Science since 1990. He is a fellow of the IEEE and the IEEE Computer and Reliability Societies and a member of the ACM and ACM Sigsoft.



John D. Musa
39 Hamilton Road
Morristown, NJ 07960-5341 U.S.A.
Phone: 1-973-267-5284
Fax: 1-973-267-6788
E-mail: [email protected]
http://members.aol.com/JohnDMusa

Previous Table of Contents Next