Fault Tolerant Computing: How?

>> Recovery-Oriented Computing (ROC)

One of the most recent projects in fault tolerance is Recovery Oriented Computing (ROC), a joint research effort between Stanford University and UC Berkeley.
This project makes the following three basic assumptions:

failure rates of both software and hardware are non-negligible and continue to increase
systems cannot be completely modeled for reliability analysis, so their failure models cannot be predicted in advance
human error by system operators and during system maintenance is a major source of system failures

A widely accepted equation for system availability is A=MTTF/(MTTF+MTTR), where MTTF is Mean Time to Failure and MTTR is Mean Time to Repair. The goal is to approach A=1.0, which historically has been to increase Mean Time to Failure toward infinity. However, ROC is focused on decreasing MTTR, so that MTTR is much smaller than MTTF. Several reasons behind this strategy include the aforementioned fact that human error is inevitable, and MTTR can be directly measured, as opposed to MTTF, which can range to 120 years. Moreover, lowering application-level MTTR can directly improve the user experience by changing the downtime of a system, such as Ebay, to several minutes over a longer period of time than a downtime of hours in one day. Finally, frequent "recovery" may lengthen the effective MTTF.

Currently, research in ROC is in process, but the following areas cover the most important characteristics necessary for a truly recovery-oriented computing system:

Isolation and Redundancy

Isolation is an important quality for a recovery-oriented computing system in that it is necessary for fault containment and safe online recovery, while also needed for enabling diagnostic and verification techniques described below. With isolation, redundancy is required to allow the system to continue to function and deliver its services while parts of the system are isolated. However, the isolation and redundancy in ROC systems must be failure-proof as well, in order to fit with the ROC philosophy.

System-wide Support for Undo

The "Undo" option is available in most software applications, such as Microsoft Word, in order to account for human error. However, very few undo facilities are available for system maintenance, so system operators are expected to flawlessly perform complex tasks that have great impact on a system. Moreover, the lack of this option prevents the effective diagnosis and learning process of trial-and-error investigation.

This area has been researched in detail in the form of an undoable e-mail system, as described in a paper written by Aaron B. Brown and David A. Patterson of UC Berkeley, entitled "Undo for Operators: Building an Undoable E-Mail Store." The model for Operator Undo is based on the "Three R's," which are Rewind, Repair, and Replay. In Rewind, the system state is physically rolled back in time; in Repair, the operator fixes the rolled-back system to keep the problem from reoccurring; and in Replay, the repaired system is rolled forward to the present by replaying parts of the previously-rewound timeline in the repaired system.

The main problem of undoable systems, however, is that they take up many system resources in that they require time to develop, slow down system performance, and consume a significant amount of storage. In response to this criticism of undoable systems, studies show that most operator errors are realized in a matter of seconds to minutes after they have been performed, and thus the undo function has high potential for correcting faults. The extent to which events should be undone, however, is still the area of much study.

Integrated Diagnostic Support

In order for any form of ROC to be effective, one of the most important tasks for a system to perform is swiftly identifying the cause of the error and isolating and repairing those components that have been affected. Currently latent errors that may later lead to wide-spread failure must also be identified and dealt accordingly, mostly through a method of systematic testing. Research done in the area of diagnostic support thus focuses on the design of testing interfaces and framework for system components, methods for software verification, and of course root-cause analysis algorithms.

One particular area of focus for the Stanford/Berkeley ROC group involves giving assistance to operators to track down errors more quickly. In existing high-dependability computer systems, programmers construct a failure-analysis tree, in which every software and hardware element is carefully recorded so that the tree can trace and depict every possible way a system can fail. However, with the wide-spread popularity of heterogeneous systems such as the Internet, which uses components from multiple vendors, the problem of finding and isolating errors is made much more complex.

To help determine the cause of faults in such heterogenous environments, Stanford graduate students Emre Kiciman and Eugene Fratkin, together with Mike Chen of Berkeley created PinPoint, a ROC-based computer program that analyzes which components are erroneous. Whenever a user navigates to a PinPoint enabled website, the program tries to determine which components are used in sending requested environments to the user. If there's an error, then PinPoint will log it and over time data-mine the information to find out whih components are suspected of causing the most failures, so that in the future either software or operators can resolve the problem faster.

Quick Comeback
In recovery oriented programming, sometimes the easiest way to solve an error mistakes or even to prevent one is to do a simple reboot. Rebooting works because it wipes a slate clean and fixes a large class of so-termed "transient" failures, or problems that are not recurring. In fact, estimates show that 80% of errors are transient. Rebooting works beause it returns the system to its start state, which is the easiest to understand and most tested. Unfortunately, rebooting can be very timely, particularly for a very large system, and important data stored in memory may be lost along the way.

Thus, the idea of "recursive micro-rebooting" has been proposed to accomplish all that rebooting can do, but on a much smaller scale. Generally, if there is a failure, it does not involve the entire system but rather only a small subportion of the system. Given that the system is well-designed and that such portions are discrete enough to be handled separately, it is theoretically possible to do a reboot on only that subportion and allow the rest of the system to function normally. Such a reboot would have much less dramatic impact on the system as a how, as well as being far speedier.

For a system to be recursively recoverable, however, it must consist of components that are independently recoverable on multiple levels--that is, each part of the system can be shut down and repaired without affecting any of the other components. Thus, components must be loosely coupled and be prepared to be denied service from other components that may be in the process of micro-rebooting. Fortunately, such loosely coupled, componentized systems such as Sun J2EE and Microsoft .NET are gaining more popularity. In fact, micro-rebooting does not even have to be applied when an error occurs, but even before as a preventative measure. Many major search engines periodically perform rolling reboots of all nodes in their search engine cluster.

In an application of micro-rebooting, Stanford Graduate Students George Candea and James Cutler have testing the concept on ground-station software. They modified each receiving station software module so that it would not "panic" if other subcomponents were reinitialized. They took data on the most common forms of failure and then experimented to see which components could be rebooted to minimalize the downtime. In the end, they succeeded in automating the recovery process for a range of recurring problems, cutting the average restoration time from 10 minutes to two. The actual rebooting time decreased from about 30 seconds to six seconds.

Dependability Benchmarking/Verification

After the implementation of ROC-systems, a critical step involves testing them to ensure correctness in dealing with faults. The rapid improvements throughout the years in terms of performance has come about partly by the competitiveness of universally-adopted benchmarking standards, such as clock speed, memory performance, etc. Similarly, fault-tolerant computers should establish some sort of benchmarking or test standard to verify that it does indeed function as intended. The Stanford/Berkeley ROC group advocates the injection of test errors into ROC-systems so that response time and mechanisms could be measured and publicized. These test errors would be on all levels of common faults, including hardware, software, and operator errors, and give insight as to what systems should do when encountered with an error, as well as help developers test their repair mechanisms, which is difficult to do in an error-free environment. Perhaps in the future, with the development and advancement of fault-tolerant technology, customers would be able to choose their machines based not solely on performance measures but also on measures of reliability and availability.

b a c k