Survivability as a Component of Software Metrics

David L. Wells, Object Services and Consulting Inc. and David E. Langworthy, Langworthy Associates


Introduction

Software metrics provide estimates of software quality that are used to determine where to spend additional development testing resources or to determine the suitability of software for particular (often critical) applications. Metrics have traditionally focused on code quality. However, the trend toward constructing large, distributed applications as a collection of independent “services” interacting across a software backplane (e.g., CORBA), makes the process of configuring the application an important part of the development process. This affects the kinds of software metrics required, since perfect software, imperfectly deployed, or deployed in such a way that is vulnerable to failure or attack is of no more value than imperfect software that fails of its own accord. This paper describes metrics we developed [5] for measuring the survivability of software systems that can be applied to the more general realm of software metrics.


The Importance of Configurations

Service-based applications can have many physical configurations that provide the same (or approximately the same) logical functionality using identical services. Multiple configurations are enabled by the following:

A truly useful metric for distributed, service-based software must measure both the quality of the software itself (the traditional role) and the quality of its configuration vis a vis the underlying infrastructure and the kinds of threats to which the software and infrastructure are subject. In the real world, systems can fail for a variety of reasons other than code and specification errors (e.g., a virus might corrupt the file system that the software relies upon). Thus, rather than ask simply whether the specification and code are correct, it is necessary to ask how likely it is that the system will continue to provide the desired functionality, or failing this, something approaching it. A survivable system [1,2] is one in which actions can be taken to reconfigure applications in the event of partial failures to achieve functionality approximating the functionality of the original system. The usefulness of a survivable system can be judged in several ways: how useful is what it is doing now?; how useful is it likely to be in the future?; if it breaks, can it be repaired so that it can again do something useful?


Overview of Utility Theory

Utility theory is the study of decision making under risk and uncertainty among large groups of participants with differing goals and preferences [4]. A participant has direct control over the decisions he makes, but these decisions are only indirectly linked to their outcomes, which depend on the decisions of other participants and random chance.

Utility can be used to quantify the goodness of states and actions in a survivable system. System states can be compared using utility measures to determine which are preferred, and as a result, which survival actions should be taken in an attempt to move the system to a better state or avoid worse states. A key aspect of measuring the utility of a system state or administrative action is that utility depends on both the services that are currently running and the future configurations that can be reached. Future configurations need to be considered to differentiate between a rigid configuration that offers good current performance from a flexible configuration that offers slightly lower current performance but is more resilient to faults and is more likely to continue offering good performance. A balance must be reached between present performance and future performance. For example, for most systems the potential configurations a year in the future are not nearly as important as the configurations the system could reach during the next 12 hours.


Applying Utility Theory to Software Metrics

Every client receives a benefit from every service it uses, expressed as a utility function, U, that maps a description of the service being provided to a value received. The service to be received can be described in many ways, including using quality of service (QoS) concepts such as timeliness, precision, and accuracy of the results to be provided. Further, utility itself can have multiple definitions, depending on the overall goals to be achieved. For example, one utility function could value maximizing the work performed, another utility function could value minimizing the likelihood that the level of service provided falls below some threshold, and a third utility function could value minimizing the probability that information is divulged to an opponent. All are equally valid, and depending upon circumstances could in turn be valued to different degrees. This would result in a combined utility function that is some aggregation of the underlying utility functions.

The benefit a client receives from a service is accrued only if the service completes its task; i.e., an instantaneous, ephemeral connection to a service provides no value. Thus, every benefit function must include a duration over which the service must be provided in order to attain the specified benefit. Our analysis restricts the duration to fixed size discrete time intervals; a client receives the benefit only if the service is still being provided at the end of the interval. We define the utility of a configuration, U(c), to be the aggregation across all clients in a configuration of the value of the services they receive. Because there can be multiple utility functions, we differentiate between them using subscripts when necessary; ergo, Uwork(c).

A configuration provides utility only for tasks it completes. Since a system that begins a time interval in some configuration c may end it in some other configuration that provides a possibly different utility, a more useful measure of utility is the expected utility of a configuration c, EU(c). EU(c) measures the benefit of a collection of potential configurations, C, that can be reached from c in one time interval. It is the probability weighted sum of the utilities of each individual configuration that can be reached. The probability function, P(ci), is the probability of ci being instantiated out of all the configurations in the set.


Expected utility allows us to compute the benefit expected to be obtained from a configuration even after considering the near term negative events that can cause the configuration to degrade. A second utility measure allows us to consider longer term changes to the system and to incorporate the ability to perform beneficial administrative transformations. We call this net utility, NU(c). Net utility measures the fact that the long term desirability of a configuration depends upon the services that are currently running and the future configurations that can be reached. Net utility is thus a sum of future expected utilities. In general, not all time periods are of equal importance; the near term behavior of a system is usually valued more highly than behavior far into the future. To handle this, we introduce a discount function, D(t), which maps from time to an appropriate weighting factor. The discount function is related to net present value in finance.


The use of a discount factor has an additional benefit, since it allows us to discount far future states for computational as well as policy reasons. This has a practical advantage, since when one projects the configuration space further into the future, the computations rapidly become more expensive (due to state explosion) and the results rapidly become less precise (due to imprecise estimates of event probabilities). The benevolent myopia introduced by the discount factor allows us to ignore incomputable or dubious future states.


Utility Metrics

The meaning and power of the metrics defined above vary greatly depending on the precise definition of the base utility function U(c). As noted, the base utility function measures what is valued most highly. We introduce two very different utility metrics.

Utility of Value is based on a measure for aggregate performance. This work developed from a market based, distributed resource allocation prototype. The goal of the market was to maximize the aggregate value of all the services provided by the system. End users or administrators would assign values to services. The resources, both hardware and software, would compete to offer the best service at the lowest cost. The resources’ goal was to accumulate profits which would be gathered by the owners of the resources and allocated to end users and administrators, closing the loop. If users value a service highly, it will replicate itself to assure that it is highly available. If resources are removed from the system, the prices will rise and only the more valued services will obtain resources; if resources are added, prices will fall and lower priority services will run. Utility of Value implements a simple microeconomic model that tends toward Pareto Optimality, a local optimality criterion. If the Net Utility of Value is maximized, then future performance of the system will be maximized. There are many possible definitions of survivability, but a relatively straightforward one is that the system continues to offer good performance into the future.

Utility of Operation is based on a binary measure depending on whether the system meets some minimal level of operation over a given interval. This gives rise to a very different notion of survivability. Using this measure, EU(C) is itself a probability: the probability that the system is operational. Maximizing the Net Utility of Operation minimizes the possibility of some catastrophic failure in the future, possibly at the cost of optimal average case performance. This is arguably a better survivability metric than the Net Utility of Value, since the purpose of survivability is to avoid catastrophic failures. The two could be used in conjunction so that after a minimal level of service is guaranteed, performance is optimized for the normal case.

Examples:

Replica Balancing

There are two services, A and B, and six hosts, 1 - 6. Each service can be replicated and each replica requires an entire host. There is a 10% probability of failure of each host during a period, so the probability of success of a service with n replicas is 1- 0.1n. In the initial configuration, C1, each service has three replicas: C1 = A{1, 2, 3}; B{4, 5, 6}. At step 2, B loses two replicas, so C2 = A{1, 2, 3}; B{4}. The third configuration, C3, is the result of a possibly automatic administrative action which trades a second backup from A to provide a single backup for B, C3 = A{1, 2}; B{4, 3}. This last transition is voluntary. The administrator or survivability service would take whatever action seemed best.

Table 1 calculates the expected utility for each configuration in the example. A bar over the service label in the State column (RC) indicates the service is not operational at the end of the period. The second column is the value of the configuration. The aggregation function is simple addition, so if both A and B are operational the value of the configuration is 2000. P(i) and E(i) show the calculation of the expected utility of Ci, EU(Ci), which is shown on the last row of the table.

Table 1. Expected Utility


In C1 everything is running fine. Out of a possible value of 2000 the expected utility is 1998, almost perfect. After the failures, the expected utility drops to 1899 because of the uncertainty that B will complete. C3 reflects the administrative action of taking a replica from A and giving it to B. This increases the expected utility to 1980, a dramatic improvement considering that no resources were added.

Utility of Value vs. Utility of Operation

The following illustrates the difference between utility of value (which optimizes for performance) and utility of operation (which optimizes for stability). Service A now has two levels of operation, high and low. The high level offers a value of 2000 and requires 3 hosts to run. The low level is required for a minimal level of operation and offers a value of 1000 but requires only 1 host to run. If the high level of service cannot be maintained, it automatically drops to the low level of service. In the example A starts out at the high level of QoS. If A loses a host, it drops to the low level of QoS with one replica. The probability that A completes the period at the high level is the probability that all three hosts complete. The probability that A completes the period at the low level is the probability that any single host completes minus the probability that A completes at the high level. There are now 6 possible outcomes. B is still worth 1000, so if A completes at the high level along with B the value is 3000.

Table 2 calculates the utility of value. In the initial configuration all hosts are operational and the expected Utility of Value is nearly optimal at 2739. After the failures, the expected value drops by about 150 reflecting B’s instability. C3 evaluates the administrative action of removing a host from A to increase B’s stability. In this case, the action does not appear to be desirable and would not be taken. The reason is that removing a host from A would cause it to drop from a high level of QoS to a low level of QoS at a cost of nearly 1000.

Table 2. Utility of Value


A Utility of Value metric maximizes perceived performance and maintains A at a high level of QoS is consistent with this goal. However, the survivability of the system is sacrificed by this choice as Table 3 using Utility of Operation shows.

Table 3. Utility of Operation


In the initial state all hosts are operational and A is operating at the high level. After the failures, B is reduced to one replica and the expected Utility of Operation drops to .8991. A is still operating at the high level, but this is not reflected in the binary operational metric. Step 3 reflects the administrative action of taking a host from A. This causes A to drop from the high level to the low level and increases the stability of B. As a result the expected operational utility increases to .9801.


Conclusions

The metrics presented allow measurement of the useful work that is likely to be done by software as actually deployed and subject to the various kinds of attacks and failures that exist in the real world. These metrics can be combined with more traditional software metrics that measure the likelihood of failure due to software or specification failure to produce a combined metric that measures both the quality of the code and its expected long-term behavior in a realistic environment.

About the Authors

Author Contact Information

David Wells, is Vice President of Object Services and Consulting Inc. and the head of software research. Wells received his D. Eng. degree in Computer Science from the University of Wisconsin-Milwaukee in 1980. He was Assistant Professor in the Computer Science Deptartment at Southern Methodist University from 1980 to 1986 where he conducted research in databases, computer security, and computer graphics.

Dr. Wells was the Principal Investigator on the DARPA/ITO project Survivability in Object Services Architectures. He was previously PI on the DARPA funded Open OODB and Open OODB II projects at Texas Instruments where he was the principal architect of a modular object-oriented database that seamlessly added persistence to programming objects. Those projects produced many of the ideas of flexible service binding used in this software survivability work. Wells has also done work in cryptography for databases and risk assessment. Wells holds 5 patents and has published over 20 technical articles in journals and conferences.

David E. Langworthy performs experimental computer science research in both academic and industrial contexts. He has ten years experience in the design of scaleable distributed systems with a focus on object oriented database technology. Langworthy received his PhD from Brown University in May of 1995.

While completing his PhD, Langworthy was a consultant at Microsoft, and designed the Information Retrieval system for the Microsoft Network. This system scaled to thousands of queries per second using parallel arrays of NT servers. The work resulted in fundamentally new technology for combined query evaluation. Other accomplishments include: teaching a course in Object Oriented Analysis and Design, developing courseware, and consulting for Semaphore and the Trilogy Development Group.

David L. Wells
Object Services and Consulting, Inc.
Baltimore, MD
Phone: (410) 318-8938
Fax: (410) 318-8948
E-mail: [email protected]


David E. Langworthy
Langworthy Associates
E-mail: [email protected]


Previous Table of Contents Next