Fault Tolerant Computing: How?

>> Autonomic Computing

In October of 2001, IBM released a 38-page "manifesto" that argues that the complexity of current computing systems has begun to outpace the capabilities of human administrators to cope with them. It proposes a solution in the form of Autonomic Computing, a phrase deliberately chosen by senior vice president of research, Paul Horn. The term refers to the body's autonomic nervous system that governs the heart rate and body temperature, so that the human does not have to consciously handle these low-level, yet vital functions. Similarly, the fundamental concept in autonomic computing is the idea of self-regulation and the self-governing operation of the entire system, not just parts of the system.

IBM's manifesto laid out the eight key elements of their plan for autonomic computing:

Approaches from different areas, such as artificial intelligence, control theory, complex adaptive systems, catastrophe theory, and cybernetics were mentioned as plausible starting points for research in autonomic computing. Today, many different projects are going on around the world in universities and in the industry that integrate the idea of autonomic computing. The main four ideas that continue to be repeated are those of self-configuration, self-optimization, self-healing, and self-protection.

University of California, Berkeley, continues to research The OceanStore Project that provides a consistent, highly-available, and durable storage utility on top of an infrastructure comprised of untrusted servers. Promiscuous data caching is the idea behind the project, allowing any server to create a local replica of any data object, providing faster access and robustness to network partitions. It also requires redundancy and cryptographic techniques to protect the data from the servers upon which it resides.

UC Berkeley also has a more well-known research project in conjunction with Stanford University that focuses on "Recovery-Oriented Computing." Since human error cannot be eradicated, the project attempts to reduce time of recovery instead of time between failures. A more in depth summary of their research can be found here.

Self-securing storage is the keyword at Carnegie Mellon University. It enables the storage device to protect data even when the client OS is compromised. It uses the fact that storage servers run separate software on separate hardware, allowing the existence of server-embedded security that cannot be disabled by any software running on client systems, as shown in the figure above.

At Cornell University, Professor Kenneth Birman is working on Astrolabe, a system that automates self-configuration, monitoring, and controls adaptation. It works by creating a virtual system-wide hierarchical database that evolves as the underlying information changes.

At IBM itself, there are several different projects still in process, such as Océano, that manages a complex of servers using optimization algorithms to find the best way to distribute tasks and the cheapest places to store data. It attempts to anticipate demand and have the computers in its command prepared right before they are needed.

Hewlett-Packard is also purusing a similar project, referred to as planetary computing. The goal of planetary computing is infrastructure on demand; moreover, the infrastructure is to be scalable, flexible, economical, and always on. The infrastructure is to be based on a shared pool of processing power and data storage capacity that can be automatically reconfigured.

At this point, there are still many challenges that need to be addressed. Some engineering challenges include the necessity of creating programming tools and techniques that aid in managing relationships with other autonomic elements and the necessity of finding out how to install, configure, monitor, and upgrade the elements. The systems will also have to face security issues of authentication, authorization, encryption, signing, secure auditing, and other problems that all computers must address. At the heart of autonomic computing, though, is the challenge of defining appropriate behavioral abstractions and models. Several approaches have been studied, such as using genetic algorithms to use local transformation rules of simple cellular automata to achieve the desired global behaviors, studied by Melanie Mitchell of the Santa Fe Institute. Another method at NASA being studied has algorithms that derive individual goals for individual agents when given high-level global objectives. These are just two of many methods that are being pursued, as the goal of autonomic computing still remains flirtingly beyond reach.

Vocabulary

Artificial Intelligence (AI) - The capacity of a computer or system to perform tasks commonly associated with the higher intellectual process characteristic of humans. AI can be seen as an attempt to model aspects of human though on computers. Although certain aspects of AI will undoubtedly make contributions to autonomic computing, autonomic computing does not have as its primary objective the emulation of human though.

Catastrophe Theory - A special branch of dynamical systems theory that studies and classifies phenomena characterized by sudden shifts in behavior arising from small changes in circumstances.

Control Theory - The mathematical analysis of the systems and mechanisms for achieving a desired state under changing internal and external conditions.

Cybernetics - A term derived from the Greek word for "steersman" that was introduced in 1947 to describe the science of control and communication in animals and machines.

b a c k