Software Rejuvenation and Self-healing
Lawrence Bernstein, Industry Research Professor, Computer Science Department
Chandra M. R. Kintala, Distinguished Professor, Electrical and Computer Engineering Department Stevens Institute of Technology, Hoboken, NJ 07030
Software rejuvenation is a periodic, pre-emptive restart of a running system to prevent future failures. It is one aspect of a self-healing system. It was first introduced, described, implemented, modeled and analyzed in.[1] It is used in systems ranging from a data collector used by most of the US Telephone companies to collect billing information to NASA’s long-duration space mission to Pluto[2]. It is also implemented in IBM’s Netfinity resource manager[3]. Billing system failures and the use of software rejuvenation to prevent those failures, as described in [1], are quite similar to the failures and the fix that Nick van der Zweep described in the Computer World article (QuickLink Ref# 43636) dated January 12, 2004.
Software rejuvenation incurs overhead. Modeling to find optimal times is crucial. A simple and useful model based on Continuous-time Markov chains was introduced in [1] to analyze the reliability improvements due to software rejuvenation; the model is also useful to find optimal trigger rates/frequencies for rejuvenation. This model was then extended using Stochastic Petri Nets to study rejuvenation using the fail-over mechanisms in IBM’s cluster-based systems[4]. X2000 for NASA’s 12-year long Pluto-Kuiper Express mission to do simultaneous on-board preventive maintenance of software and hardware components during cruise and exploration phases used software rejuvenation. Analysis of reliability due to software rejuvenation showed 2 orders of magnitude improvement;[2] optimal interval was found to be 31.2 weeks in the 12-year long cruise phase. A recent paper[5] described software rejuvenation in web servers and how it can be analyzed to determine optimal interval for rejuvenation.
Recent experiments at Stevens Institute of Technology showed that data link protocols suffering memory leak failures could be made reliable using Rejuvenation libraries without having to fix the memory leak bug.[6] In essence Rejuvenation bounds the execution space for the working software so that latent failure modes are not executed. Had this technology been used in the Patriot Missile system during the first Iraq war the counter overflow problem causing the anti-scud system to fail would not have occurred. The need for this technology was first identified during field tests of the earlier Safeguard anti-missile system. It then was applied to avoid hash table problems in a data switch.
Since the 1960s data communication designers knew to have software modules restart a line when it hung. The rejuvenation technology restarts a line before the hang to avoid potential secondary problems. It is a low cost, easy to implement technology that makes systems more trustworthy.
Software rejuvenation is one aspect of self-healing. Interesting new problems to study rejuvenation of large scale systems are:
1 Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Software Rejuvenation: Analysis, Module and Applications”, in Proc. of 25th Symposium on Fault Tolerant Computing, FTCS-25, pages 381–390, Pasadena, California, June 1995.
2 A. T. Tai, L. Alkalai and S. N. Chau, “On-Board Preventive Maintenance: A Design-Oriented Analytic Study for Long-Life Applications”, in Performance Evaluation, Vol. 35, No. 3-4, pp. 215–232, June 1999.
3 V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. P. Zeggert, “Proactive Management of Software Aging”, in IBM Journal of Research & Development, Vol. 45, No. 2, March 2001.
4 K. Vaidyanathan, R. E. Harper, S. W. Hunter, K. S. Trivedi, “Analysis and Implementation of Software Rejuvenation in Cluster Systems,” in Proc. of the Joint Intl. Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June 2001.
5 Y. Bao, X. Sun and K. Trivedi, “Adaptive Software Rejuvenation: Degradation Models and Rejuvenation Schemes,” in Proc. of The International Conference on Dependable Systems and Networks, DSN-2003 June 2003.
6 Lawrence Bernstein, Yu-Dong Yao, Kevin Yao, “Software Avoiding Failures Even When there are Faults,” The DoD SoftwareTech News, October 2003, Vol. 6, No. 2, pp8 – 11 , www.softwaretech- news.com, http://iac.dtic.mil/dacs
![]() |
![]() |
![]() |