Software Reliability Engineering for Mass Market Products

Brendan Murphy and Mario R. Garzia
Microsoft Corporation

Microsoft has found the approach presented in this article to be effective in assessing the reliability of mass market software products such as Windows, both before and after product release. In addition our approach speeds the failure detection and resolution process to assure that quick fixes are developed and then deployed to the user base resulting in an ever increasing level of reliability.

In today’s environment the reliability of software systems is fundamental not just for corporations running large mission critical servers but also for the knowledge worker at the office and even the consumer who stores their family photos and does their banking on their personal computer. Customers have an ever increasing expectation of the reliability of their products that not only requires a focus on producing reliable products, but also developing the ability to quickly produce fixes that can be delivered to those customers.

Historically the computer industry used reliability modeling to predict future behaviour. Hardware failure rates dominated the overall system reliability and the systems were run and managed in very controlled environments. Through capturing failures occurring during system test and feeding them into these models it was possible to accurately predict (to some degree) a product’s future reliability. As the complexity of the systems and software increased then the causes of system failures altered. The operating system started to have an increasing impact on overall system reliability, as did other factors relating to the complexity of the system, such as Human Computer Interface (HCI) failures [2]. While companies continued to rely on software modeling they also started to measure the system behaviour on the customer sites to gain a wider perspective of the actual failure rate of computers [3]. Over time the customer feedback indicated that measuring reliability in terms of system crashes did not fully match their perception of reliability.

Mass market products, such as the Windows operating system, have a number of unique characteristics. Firstly it is no longer possible to test all possible combinations of user configurations. For Windows XP there are currently 35,000+ available drivers, with each driver having over three different versions in the field, making the number of possible combinations of hardware and drivers, for all practical purposes, infinite. Additionally it is virtually impossible to capture the usage profile of the product. For instance the system management procedures for Windows vary from zero management in the home all the way to servers in corporate datacenters with strict policies.

Therefore the question that Microsoft faced, in developing new products, is whether to persist with reliability models as a reliability predictor or to move towards a measurement based approach.

Product Reliability - Prediction and Measurements

System reliability is a measure of its ability to provide a failure-free operation. For many practical situations this is represented as the failure rate. If the total number of failures for N installations in a time period T is F, then a good estimate for the failure rate of the software is [4] λ = F / (N * T) . This approach for measuring failure rates has been widely used [1].

Reliability growth models, using data collected during testing, predict the future reliability of the product assuming:

Configuration of the test systems are representative of the user environment
Product usage and management does not impact reliability
Failures captured during system test are representative of customer failures
Failures occur once and are then corrected

We can test these assumptions against the two usage extremes of a mass market product such as Windows. Corporate IT departments typically have a standardized server environment, high quality administrators and experienced users. The opposite is a home user running Windows on a machine bought from a white box manufacturer.

Corporate customers traditionally pre-stage their systems. The computers will only use a standard set of components, such as signed drivers. The systems usage profile and management will still affect product reliability [2, 3], but a local call desk will assist users thereby limiting the impact of any failures.

The home PC is increasingly becoming more complex than those found in corporate environments. These systems often consist of state of the art CPU’s, graphical processors, sound cards, modems and networking. Within the home they will be connected to a number of products, of different ages, such as cable modems, printers, cameras, MP3 players and other computers in the house. The home user’s configuration is less likely to have been well tested, as the possible number of configurations is infinite. The quality of some of the drivers may be questionable and these will not have been digitally signed by Microsoft. A signed driver indicates that it has been designed for Microsoft operating systems and has been tested for compatibility with Windows. Users are informed that they are running unsigned drivers but may have no option but to install the driver (for instance if their cable modem driver, provided by the ISP, is unsigned). The user may have little to no computer experience but may still be performing system management tasks, such as creating a network of computers in the home; this will inevitably increase the likelihood of HCI induced failures (e.g. due to users attempting to configure systems in ways they were never designed for).

For systems in controlled corporate environments it is possible to build reliability models that are representative of their usage profiles but this is much more difficult for systems in the home environment. For home systems a measurement based approach is applied.

Problems in areas normally not addressed by reliability models, such as HCI, can be effectively addressed through measurement (identifying areas of complexity which results in incorrect system settings). HCI problems that are identified can be addressed using traditional HCI techniques [6]. To address faults in released products Microsoft has developed Windows Error Reporting (WER). This collects and provides fault resolution to all users. The effectiveness of these processes in addressing overall product reliability is continually measured and improved.

Both modeling and measurements have advantages and disadvantages. Modeling provides a means of prediction, but without complete usage characterization it only addresses a subset of failures. For home users in particular there is a decreasing relationship between actual and predictive failure rates. Applying measurement techniques allows companies to capture a complete picture of reliability, albeit later in the development cycle. In choosing to use a measurement approach the next issue to address is how to interpret the data collected through the measurement programs.

Characterizing Reliability of Software Systems

To measure the reliability of a software product it is essential to determine what constitutes a failure. Reliability is defined as the probability of encountering a failure in a specified amount of time t, where a failure can be defined as a departure from requirements. A failure is traditionally viewed as having occurred when a system stops responding/working, e.g., the system crashed or hung. However, as already mentioned, there are many types of failures (departures from requirements) beyond hard failures. Customers may view service disruption, not resulting from a system failure, as a systems reliability event.

Based on extensive customer feedback, Microsoft takes a broad view of customer reliability requirements, whereby systems and software should operate without disruption. As such, system and software reliability also takes into account other types of disruptions including planned events, e.g., system shutdowns required to install an application. Microsoft categorizes disruptions into six different classes of events

Resilience: The system will continue to provide the user service in the face of internal or external disruptions.
Recoverable: Following disruption the system can be easily restored, through instrumentation and diagnosis, to a previously known state with no data loss.
Undisruptable: System changes and upgrades do not impact the service being provided by the system.
Performing: Provides accurate and timely service whenever needed.
Production Ready: On release the system contains a minimum number of bugs, requiring a limited number of predictable patches/fixes.
Predictable: It works as advertised, and is backward compatible.

As mentioned, a key aspect of mass market software products is the extensive variety of possible usage scenarios. For example, home users want their drivers (all 100,000+ possible versions!) to work with their specific systems but are not overly bothered if any configuration changes require a shutdown as these are frequently occurring events (e.g., turning it off at night). For the data center IT system manager, dependent upon a limited set of well tested drivers running across thousands of servers, any configuration changes requiring a shutdown is very costly. The disruption classification above provides a framework for assigning weights based on specific user needs. Using the proper weights for each class of disruption we can arrive at a meaningful assessment of reliability for specific customer scenario. With these scenario specific weights, the reliability of a software system can then be defined as the 6-tuple

R = (^RResilient, ^RRecoverable, ^RUndisruptable, ^RPerforming, ^RProductionReady, ^RPredictable)

Where we use common reliability metrics to define each member of the 6-tuple, for example, for the Resilient class of disruptions we define the reliability RResilient as

Mean Time To Resilience Disruption (MTTRD) = (· system uptimes) / (# of resilience disruptions)

Similar definitions can be developed for corresponding Availability and Downtime metrics. Any improvement in individual metrics will result in a corresponding improvement in the reliability experienced by the user. These six measures of reliability can be aggregated into a single overall reliability number using the specific scenario weights. Keeping the measurements separate allows us to identify where improvement might be needed. For instance, if we conclude that Resilient systems should function without crashes or hangs irrespective of hardware and software errors, then the subset of events {I_i} for the Resilient class of disruption would be {crashes, hangs} and R_Resilient is a measure of the time to a crash or hang.

This approach recognizes that a mass market software product like Windows does not have a single reliability figure; product reliability is a function of the customer usage scenario and the level of reliability will likely vary from scenario to scenario. Additionally with hundreds of millions of users it is necessary to develop multiple measurement approaches, dependent upon whether you want to collect usage profiles from a target set of users or failures occurring on all user systems.

Breadth and Depth Measurement Approach

Once reliability goals for specific customer scenarios have been defined it is necessary to implement an effective reliability measurement approach. In the past, failures reported by users to a product support organization have been used [1] to evaluate the product’s reliability. But it is well known that customers do not report all the problems they encounter especially when they solve it themselves. This non-reporting is far more pronounced in mass-market products, compared to software products with on-site service contracts. Reports received are likely to be limited to hard failures (as these may require a support call to resolve).

The best approach for collecting failure data is through automated system reporting based on product instrumentation and triggers. When collecting data from software products, with a user base in the hundreds of millions, scalability becomes a major issue. Our approach is to collect data using a breadth and depth perspective. We take a broad sample of data to assure we are covering the entire population base and the variety of possible issues. This data is not detailed but it does allow us to scale to large numbers of reporting customers. We then focus on getting detailed data on a small subset of those users. The broad data is used to assess cost/probability of disruptions for the user base and the depth data is targeted to identify the root cause of the disruption to be used for product improvement. For privacy reasons, the customer decides if and what data they will share.

For the Windows product we have automated reliability measurement mechanisms including WER that focuses on crashes and hangs, and the Microsoft Reliability Analysis Service (MRAS) [7] focused on reliability and availability tracking of Windows servers, and products running on servers like MS SQL database, MS IIS web server, MS Mail Exchange, and the Windows Active Directory. WER collects high level information on each crash/hang reported to Microsoft, but in general will get a dump from only a very small number of the crashes submitted that is sufficient to identify the cause of the problem. Aside from providing Microsoft with data for product improvement both of these processes also provide customer feedback. Table 1 shows the information collected and customer feedback provided in each case.

Data Collection	Microsoft Data	Customer Feedback
WER	crash & hang dumps	available fixes for crashes & hangs
MRAS	disruption times & causes	reliability metrics & disruption reasons

WER is available to all Windows XP and Windows Server 2003 users; MRAS has been deployed to over 200 corporate customers and is being used extensively within Microsoft to collect data from thousands of servers. Both WER and MRAS have been instrumental in assessing the reliability of beta versions of Windows Server 2003 at both Microsoft and customer sites and are being used for ongoing reliability tracking and new product version evaluation [5].

Acknowledgement

This paper is derived from a prior paper [5], published at the ISSRE 2004 conference at Saint-Malo. The authors would like to acknowledge the work of Ben Errez and Pankaj Jalote who were joint authors of this prior paper.

References

1. R. Chillarege, S. Biyani, J. Rosenthal, "Measurement of failure rate in widely distributed software", Proc. 25th Fault Tolerant Computing Symposium, FTCS-25, 1995, pp. 424-433.

2. J. Gray, "A census of Tandem system availability between 1985 and 1990", IEEE Transactions on Reliability, Vol 39:4, Oct 1990, pp. 409-418.

3. B. Murphy, T. Gent, "Measuring system and software reliability using an automated data collection process", Quality and Reliability Engineering International, 1995.

4. K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, Second Edition, John Wiley and Sons, 2002.

5. P. Jalote, B. Murphy, M. R. Garzia and B. Errez. "Measuring Reliability of Software Products". ISSRE 2004 Conference. Saint-Malo, Bretagne, France, 2004.

6. A general list of usability papers and techniques can be found at http://www.microsoft.com/usability/publications.htm

7. M. R. Garzia, "Assessing the reliability of windows servers", Proc. Conference on Dependable Systems and Networks (DSN), San Francisco, 2003.

About the Authors

Mario Garzia is Director of Windows Reliability at Microsoft. Prior to joining Microsoft in 1997, he was a Distinguished Member of Technical Staff at AT&T Bell Laboratories working in the areas of telecommunication and computer system and service performance and reliability. He is the co-author of the book Network Modeling, Simulation and Analysis and has published over 40 technical papers in the areas of modeling, performance and reliability in refereed journals and conference proceedings. Garzia holds a Ph.D. in Mathematical Systems Theory from Case Western Reserve University, his M.S. and B.S. are in Mathematics. He is an IEEE Senior Member.

Brendan Murphy is a researcher into System dependability at Microsoft Research in Cambridge. Brendan’s research interests include analyzing the relationship between software development technique and subsequent product failures on customer sites. Prior to joining Microsoft he ran the reliability group in Digital that monitored customer systems in the field. Brendan graduated from Newcastle University.

December 2004
Vol. 8, Number 1

Software Reliability Engineering

Articles in this issue:

Software Reliability Engineering - An Overview

Automated Testing with an Operational Profile

Applications of SRE in the Security Domain

Software Reliability Engineering for Mass Market Products

Application of SRE to Ultrareliable Systems - The Space Shuttle

Download this issue (PDF)

Recieve the Software Tech News