Practical Software and System Availability and Reliability Estimation- Software Tech News 3-4: Software Reliability

Practical Software and System Availability and Reliability Estimation

by John Gaffney, Lockheed Martin

Introduction

In an increasing number of cases, the acquirers of software intensive systems for both Government and civilian applications are requiring that these systems meet certain availability and/or reliability objectives. This article presents aspects of a method and supportive tools that have been developed and applied at Lockheed Martin in the estimation of availability and reliability of various large software intensive systems. The methodology and supportive tools have evolved from work started at the former IBM Federal Systems, now various Lockheed Martin companies.

Availability and reliability are closely related measures. A definition for availability is the proportion of some period of time that the system is operating satisfactorily. What satisfactorily means must be defined, of course. A definition for reliability is the probability that the system does not fail for a stated length of time, starting from some specific time. For calculations/estimates of availability, we need information about both failure times/rates and time/rates for service restoration (not necessarily repair). We have used the term system in the definitions just presented. We could also apply them to the hardware or software components of a system, such as a software intensive system. We can combine the unavailabilities for a system due to software or due to hardware (and to procedures on some occasions) into an overall figure for system unavailability; availability is equal to one minus unavailability.

There are two principal types of availability and reliability models, the black box and the white box. The former requires detailed knowledge of the (hardware and software) elements of the system of concern. The latter looks at the externally visible behavior of the system, without requiring knowledge of the detailed nature of such elements or their interactions (such as those among failure tolerant software and/or hardware units which might mask the effects of certain types of hardware or software failures from the external visible behavior of the system).

The principal parameter to be estimated for hardware or software failure is l, the failure rate (such as failures per hour, month, etc.). In the case of software, this is actually a function of time, as the failure rate eventually decreases as defects are discovered and few if any new defects (not present before testing began) are added. In the case of hardware, we often consider l to be fixed; that is, the failure rate for the hardware or for hardware-caused system outages (under the white box model as defined above) is constant because the hardware is assumed to have "burned in" or is mature or has a mature design. We now focus on software as it has a non-constant l. After the software has been delivered (i.e., placed in operation), the function l(t) is often taken to be a monotonically decreasing function of time, say of the form:

l(t) = l₀*(exp (-bt)),

where l₀ = E*b, E=the number of defects in the software at t = 0 (i.e., at delivery/start of operation) and b = 1/t_p, where t_p, the time constant, is the time at which 0.63*E defects will have been discovered. In actuality, this is often not the case for a period of time after delivery of the software/when it is placed in operation. During that period, there is often a "surge" of defect discovery. This is due to one or both of two principal causes, the test environment did not completely accurately represent the operational environment (and there were different stimuli of types not applied during testing), and/or there was additional software, not present in the test suite, that opened other error paths.

We can obtain an estimate for E in various ways. The most accurate is to fit actual defect discovery data, obtained during the development and testing process and fit it to an assumed equation, say using the STEER model tool, the original version of which was developed at IBM, Federal Systems. If both integration and testing are considered, the form of the time-based model of defect discovery is a Rayleigh curve (or another, similar monomodal-single-peak-curve of the Weibull family). Then, an estimate for E would be the area under the right hand side of the monomodal curve from the time of the delivery of the software (or when it is placed in operation) to infinity. Or, alternatively, an exponential fit can be made to the defect discovery data obtained from the time of the peak forward. That is, the portion of the curve to the right of the peak can be approximated as an ("decaying") exponential, which is the form given above for l(t). Then, the values for the parameters E and b (= 1/t_p) for this curve can be used to estimate l(t) in the vicinity of some value for t, t = t_n, where it is desired to compute the software reliability or availability or the unavailability of the system due to software-caused failures.

About the Author

Author Contact Information

John Gaffney is a Software Engineering Consultant, at Lockheed Martin, Mission Systems in Rockville, MD. He provides support to Lockheed Martin organizations in software and systems measurement. Earlier, he worked at the Software Productivity Consortium where he started the measurement program. Previously, he worked at IBM. He has taught at Polytechnic University, Brooklyn, NY and at Johns Hopkins. He holds a BA from Harvard, and MS from Stevens Institute of Technology, and is a Registered Professional Engineer (Electrical) in the District of Columbia.

John Gaffney Jr.
Lockheed Martin
Mission Systems & Software Resource Ctr
9211 Corporate Boulevard
Rockville, MD 20850
Phone: (301) 640-2359
Fax: (301) 640-2429
[email protected]