Software Reliability Engineering - An Overview

More Reliable Software Faster and Cheaper

By John D. Musa, Software Reliability Engineering and Testing Courses

Software reliability engineering (SRE) focuses on making practitioners more competitive without working longer. This is valuable in a world of globalization and outsourcing where professionals want more time for their personal lives. Customers demand that software-based products be more reliable, built faster, and built cheaper (in general order of importance). Success in meeting these demands affects the market share and profitability of a product for a company and hence individuals in the company. The demands conflict, causing risk and overwhelming pressure, and hence a strong need for this practice that can help balance them.

SRE is now standard, proven, widespread, and widely applicable. It is very effective and low in cost, and its implementation has virtually no schedule impact. We will describe it and how it works. We will then outline the SRE process to give you a feel for the practice.

What’s SRE Like?

SRE, backed with science and technology, quantitatively plans and guides software development and test, always keeping in mind a sound business perspective. You add and integrate it with other good processes and practices; you do not replace them. You choose the most cost-effective software reliability strategies for your situation.

We deliberately define software reliability in the same way as hardware reliability, so that we can determine system reliability from hardware and software component reliabilities, even though the mechanisms of failure are different.

Why SRE Works

SRE works by quantitatively characterizing two things about the product: the expected relative use of its functions and its required major quality characteristics. The major quality characteristics are reliability, availability, delivery date, and life-cycle cost. You then apply these characteristics in managing development and test.

When you have characterized use, you can substantially increase development efficiency by focusing resources on functions in proportion to use and criticality. You can also maximize test effectiveness by making tests highly representative of use in the field. Increased efficiency increases the effective resource pool available to add customer value.

When you have determined the precise balance of major quality characteristics that meets user needs, you spend your increased resource pool to carefully match them. You choose software reliability strategies to meet the objectives, based on data collected from previous projects. You also track reliability in system test against its objective to adjust your test process and to determine when you can end test. The result is greater efficiency in converting resources to customer value.

A Proven, Standard, Widespread Best Practice

SRE is a proven, standard, widespread best practice. As one example of its proven benefits, AT&T applied SRE to two different releases of a switching system, International Definity PBX. Customer-reported problems decreased by a factor of 10, the system test interval decreased by a factor of 2, and total development time decreased 30%. No serious service outages occurred in 2 years of deployment of thousands of systems in the field.

SRE became an AT&T Best Current Practice in May 1991 after undergoing rigorous scrutiny as to its cost-effectiveness by a large number of managers. McGraw-Hill recognized SRE as a standard practice by publishing an SRE handbook in 1996. The IEEE and the American Institute of Aeronautics and Astronautics, among others, have developed standards.

Users have published almost 70 articles about their successful application of SRE, and the number continues to grow [1,2]. Since practitioners generally publish infrequently, the total number of successful applications is probably many times greater. We have picked examples of major users for this special issue.

SRE is widely applicable. Technically, you can apply SRE to any software-based product, starting at the beginning of any release cycle. Economically, you can apply SRE to any software-based product, except for very small components (perhaps those involving a total effort of less than 2 staff months). However, if you use a very small component for several products, then it probably will be feasible to use SRE. If not, it still may be worthwhile to implement SRE in abbreviated form.

SRE is independent of development technology and platform. It requires no changes in architecture, design, or code, but it may suggest changes that would be useful. You can deploy it in one step or in stages.

SRE is very customer-oriented: it involves frequent direct close interaction with customers. This enhances a supplier’s image and improves customer satisfaction. SRE is highly correlated with attaining Levels 3 through 5 of the Software Engineering Institute Capability Maturity Model.

Despite the word "software," SRE deals with the entire product, although it focuses on the software part. It takes a full-life-cycle view, involving system engineers, system architects, developers, users, and managers in a collaborative relationship.

The cost of implementing SRE is small. There is an investment cost of not more than 3 equivalent staff days per person in an organization, which includes a 2-day course for everyone and planning with a much smaller number. The operating cost over the project life cycle typically varies from 0.1 to 3 percent of total project cost, dropping rapidly as project size increases. The largest cost component is the cost of developing the operational profile.

The schedule impact of SRE is minimal. Most SRE activities involve only a small effort that can parallel other software development work. The only significant critical path activity is 2 days of training.

SRE Process

The SRE process has six principal activities, as shown in Figure 1. We show the software development process below and in parallel with it timewise. Both processes follow spiral models, but we don’t show the feedback paths for simplicity. In the field, we collect certain data and use it to improve the SRE process for succeeding releases.

figure 1
Figure1. SRE Process

Define Product

The first activity is to define the product. This involves first determining the supplier and the customer, which is not a trivial activity in these days of products built at multiple levels and across corporate boundaries. Then you list all the systems associated with the product that you must test independently. An example is variations, which are versions of the base product that you design for different environments. For example, you may design a product for both Windows and Macintosh platforms.

Implement Operational Profiles

Operational profiles quantify how software is used.
An operation is a major system logical task, which returns control to the system when complete. An operational profile is a complete set of operations with their probabilities of occurrence. Table 1 shows an illustration of an operational profile.

Operation Occurance Probability
Enter card
0.332
Verify Pin
0.332
Withdraw checking
0.199
Withdraw savinmgs
0.066
Deposit checking
0.040
Deposit savings
0.020
Query status
0.00664
Test terminal
0.00332
Input to stolen cards list
0.00058
Backup files
0.000023
Total
1.000000
 
Table 1. Operational Profile for ATM Machine

To develop an operational profile, we first list the operations. We then determine the operation occurrence rates and find the occurrence probabilities by dividing the occurrence rates by the total occurrence rate.

When implementing SRE for the first time, some software practitioners are initially concerned about possible difficulties in determining occurrence rates. Experience shows them that this is usually not a difficult problem. Use data often exists on the business side of the house that they are unaware of. Companies usually approve new products for development only after making a business case study. Such a study must typically estimate occurrence rates for the use of the planned functions to demonstrate profitability. In addition, data is often available or can be derived from a previous release or similar system. One can collect data from the field, and if all else fails, one can usually make reasonable estimates of expected use. In any case, even if there are errors in estimating occurrence rates, the advantage of having an operational profile far outweighs not having one at all.

Once you have developed the operational profile, you can employ it, along with criticality information, in many different ways to more effectively distribute your development and test resources. In addition, you can:

1. Identify operations whose usage does not justify their cost and remove them or handle them in other ways (Reduced Operation Software or ROS)

2. Plan a more competitive release strategy using operational development. With operational development, development proceeds operation by operation, ordered by the operational profile. This makes it possible to deliver the most used, most critical capabilities to customers earlier than scheduled because the less used, less critical capabilities are delivered later.

Engineer "Just Right" Reliability

A failure is any departure of system behavior in execution from user needs. It differs from a fault, which is a defect in system implementation that causes the failure when executed. Failure intensity is simply the number of failures per time unit. It is an alternative way of expressing reliability.

To engineer the "just right" level of reliability for a product, you must first define "failure" for the product. You then analyze user needs and set the product’s failure intensity objective (FIO). You compute the FIO for the software you are developing by subtracting the total of the expected failure intensities of all hardware and acquired software components from the product FIO. You track reliability growth during system test of all systems you are developing with the failure intensity to failure intensity objective (FI/FIO) ratios. You also apply the developed software FIOs in choosing the mix of software reliability strategies that meet reliability, schedule, and product cost objectives with the lowest development cost.

Prepare For Test

The Prepare for Test activity uses the operational profiles you have developed to guide the preparation of test cases and test procedures. You distribute test cases across operations in accordance with the operational profile and criticality information. For example, if there are 500 test cases to distribute and operation A has an occurrence probability of 0.17 and ordinary criticality, it will get that fraction of them, or 85.

The test procedure is the controller that invokes test cases during execution. It uses the operational profile, modified to account for critical operations and for reused operations from previous releases, to determine the proportions of invocation.

Execute Test

In the Execute Test activity, you will first allocate test time among the associated systems and then types of test (feature, load, and regression). You then distribute test time among operations in accordance with the operational profile. Identify failures, along with when they occur, for use in Guide Test.

Guide Test

For software that you develop, track reliability growth as you attempt to remove faults. Input failure data that you collect in Execute Test to a reliability estimation program such as CASRE [2]. Execute this program periodically and plot the FI/FIO ratio as test proceeds as shown in Figure 2. If you observe a significant upward trend in this ratio, you should determine and correct the causes. The most common causes are system evolution, which may indicate poor change control, and changes in test selection probability with time, which may indicate a poor test process.

figure 1
Figure 2. Plot of FI/FIO Ratio

If you find you are close to your scheduled test completion date but have an FI/FIO ratio substantially greater than 0.5, you have three feasible options: defer some features or operations, rebalance your major quality characteristic objectives, or increase work hours for your organization. When the FI/FIO ratio reaches 0.5, you should consider release as long as essential documentation is complete and you have resolved outstanding high severity failures (you have removed the faults causing them).

If you expect that customers will acceptance test the software you are delivering, you should do so first with a procedure called certification test. Certification test involves plotting the time of each failure as it occurs in normalized units (MTTFs) on a reliability demonstration chart as shown in Figure 3. In this case, the first two failures fall in the Continue region. This means that there is not enough data to reach an accept or reject decision. The third failure falls in the Accept region, which indicates that you can accept the software, subject to the levels of risk associated with the chart you are using. If these levels of risk are unacceptable, you construct another chart with the levels you desire [1] and replot the data.

Collect Field Data

After you ship a product, you collect certain field data to use in engineering succeeding releases and other products, often building recording and reporting routines into the product. Collect data on failure intensity and on customer satisfaction with the major quality characteristics and use this information in setting the failure intensity objective for the next release. Measure operational profiles in the field and use this information to correct the operational profiles you estimated. Finally, collect information that will let you refine the process of choosing reliability strategies in future projects.

figure 1
Figure 3. Reliability Demonstration Chart

Conclusion

If you apply SRE in all the software-based products you develop, you can be confident that you have maximized your efficiency in balancing your customers’ needs for reliability and availability, time of delivery, and cost. Being able to do this is a vital skill to possess if you are to be competitive in today’s marketplace.

To Explore Further

1. Musa, J. D. 2004. Software Reliability Engineering: More Reliable Software Faster and Cheaper Đ Second Edition, ISBN 1-4184-9388-0 (hardcover), ISBN 1-4184-9387-2 (paperback), AuthorHouse. This book is a very practitioner-oriented, systematic, thorough, up to date presentation of SRE practice. It includes workshop templates for applying SRE to your project and more than 700 frequently asked questions. You can browse the book and buy it through the website noted below.

2. Software Reliability Engineering website. An essential guide to keeping current in software reliability. Includes short and long overviews; book, course, and consulting information; a bibliography of articles by SRE users; deployment advice; the Question of the Month; links to standards, data, and the downloadable CASRE program: http://members.aol.com/JohnDMusa/

About the Author

John D. Musa is one of the founders of SRE, an IEEE Fellow, and Engineer of the Year 2004. In Who’s Who in America since 1990, he is a prolific researcher, international consultant and teacher, and experienced and practical software developer and manager. Widely recognized as the leader in reducing SRE to practice, he spearheaded the effort that convinced AT&T to make it a "Best Current Practice" in 1991. He has actively implemented and disseminated worldwide his vision of applying SRE to guiding business, engineering, and management decisions in the development, testing, acquisition, and use of software-based systems. He is principal author of the widely acclaimed pioneering book Software Reliability: Measurement, Prediction, Application and author of the eminently practical books Software Reliability Engineering: More Reliable Software, Faster Development and Testing and Software Reliability Engineering: More Reliable Software Faster and Cheaper Đ Second Edition.

Author Contact Information:

John D. Musa Software Reliability Engineering and Testing Courses Email: [email protected]

Copyright John D. Musa 2004. Permission is granted to reproduce or distribute this article in its entirety with proper credits included, provided it is not sold or otherwise commercially exploited.

December 2004
Vol. 8, Number 1

Software Reliability Engineering
 

Articles in this issue:

Software Reliability Engineering - An Overview

Automated Testing with an Operational Profile

Applications of SRE in the Security Domain

Software Reliability Engineering for Mass Market Products

Application of SRE to Ultrareliable Systems - The Space Shuttle
 

Download this issue (PDF)

Get Acrobat

Recieve the Software Tech News