Using Failure History to Improve Reliability in Information Technology

by Dolores R. Wallace, National Institute of Standards and Technology

Introduction

Achieving 100% software reliability may seem an unreasonable goal. Software developers and consumers of many software products are largely unsure about the reliability of their product or purchase. Today, many opportunities exist for some assurance of software products. Current practices and issues address process (e.g., CMM, ISO9000), people (e.g., software engineering degrees, certification exams, licensing) and product (e.g., measurement of the product); they encompass major areas of progress toward software reliability. One aspect of the product concerns the usage of history data of faults and failures of software systems, collected from either the development and assurance processes or operational use, to improve reliability of software products. Information contained in these histories characterizes the nature of faults, or defects, for a specific product line. The objectives are to use the history to determine how to prevent faults from entering into the product, to remove faults before the product is released, and to measure a product's frequency profile against others in the same domain. Finally, the histories may indicate problems indicating the need for better methods to prevent or detect faults, hence enabling justifiable research ideas.

Case Studies

Two case studies indicate how history data can be used. One is a study of failures of medical devices after release. The generic lessons learned here can be applied in other domains and the specific lessons may indicate how to study a domain and use failure history to ensure that reliable software is produced before the system is released.

Medical Device Failures

The medical device failures, involving no deaths or serious injury, occurred between 1983 and 1997. The 342 software-related failures were due principally to faults in logic, calculation, data, change impact, and others such as requirements, omission, quality assurance, timing, interface, initialization, configuration management and fault tolerance.

Many of these faults could have been prevented with requirements verification and with quality assurance practices aimed at the specific types of faults. For example, several of the calculation faults stemmed from using incorrect specifications for floating point calculations of the target computer or from not checking formula carefully against their source. Others, such as faults in logic with multiple conditions, indicate a need first for specifying the requirements correctly and second for finding methods for testing multiple conditions without exhaustive testing. In other instances, configuration management practices did not carefully retain version control for different international specifications. These are examples of specific lessons.

A different type of lesson is that knowledge of faults characteristic of a particular system aids decisions to direct verification and validation resources for optimum value. Specifically, the results indicate how failure data can be used to examine worst case scenarios in a product line and from there to identify how best to apply the quality practices to search for specific types of prevalent or characteristic faults for that product line. Results of this study provide an affirmation of generally accepted quality practices regardless of domain and may indicate the need for more sophisticated corrective techniques for solving requirements specification and logic problems.

Transportation Application

The second case study involves an application in transportation with data collected during the activities of requirements specification through acceptance testing of at least some parts of the system. In this instance, certain lessons had already been learned, that is, because most faults occur in the requirements specification, more effort should be expended to catch faults during that activity. The prevalent classes in this project are logic, specification, output, computation, performance, improvement, and initialization, with interface, omission, data handling, input, and documentation comprising the remaining groups. There was not enough information to classify approximately 10% of the faults.

Each fault is associated with its severity level and the development or test activity during which it was discovered. For example, 32% of the most severe problems were discovered during requirements specifications, but 31% were discovered during system test. Of all faults, 53% were discovered during requirements specification but 15% were caught during integration test, and 13% during system test. Why did those faults escape detection until so late, especially since 21 of those in system test had severity level 1? Could some of them have been detected with other activities during design and coding? Further examination of the fault classes detected in each group may lead to an understanding of issues and verification methods to be addressed better in the next similar project. More data are needed from additional projects to answer the types of questions posed by these two case studies.

NIST Repository and Support Tools for Collection of Data

One activity of the Error, Fault, and Failure Repository Project in the Information Technology Laboratory of the National Institute of Standards and Technology (NIST) is to collect additional data from projects. The objective of the project is to improve software quality by establishing fault models that reflect software failures in real-world systems. Greater understanding of the types of software errors that lead to faults and failures, along with the frequency and distribution of those faults, may result in the production of more reliable software. Project data from many domains will enable researchers and practitioners to understand weaknesses in current development and assurance methods and to affirm benefits of generally accepted quality practices. Some data may demonstrate the need for further research in selected topics, such as requirements specification and testing of complex systems. The NIST repository and support tools for collection of data are available to the public at http://hissa.nist.gov/effProject/project.html

About the Author

Author Contact Information

Ms. Wallace leads the Reference Data: Software Fault & Failure Data Collection Project (http://hissa.nist.gov/effProject/) which provides metrology and reference data for software assurance. She is interested in methods of experimentation and measurement of software technology. Her publications on software verification and validation include NIST SP 500-234, Reference Information for the Software Verification and Validation Process, V&V articles in the Encyclopedia of Software Engineering (Wiley) and the IEEE Tutorials on Software Requirements Engineering and Software Engineering. She is co-author, Software Quality Control, Error Analysis, and Testing, Noyes Data Corporation, 1995 and co-chair of the IEEE STD 1012 -1998, Software Verification and Validation, and has published papers on software experimentation and other software engineering topics. She received the 1994 Department of Commerce Bronze Medal Award. Currently she serves on the editorial board of the American Society for Quality's Software Quality Professional and the Industrial Advisory Board for the IEEE Computer Society's Software Engineering Body of Knowledge Project. She has a master's degree in mathematics from Case Western University.



Dolores R. Wallace
National Institute of Standards and Technology (NIST)
Information Technology Laboratory
[email protected]
http://hissa.nist.gov/




Previous Table of Contents Next