Fault Tolerant Computing: How?

>> The history of fault tolerence computing

Over the past half century, binary computing machines have seen many changes and have exponentially grown in complexity and speed. Early computers functioned effectively without the aid of an incorporated fault tolerance system and relied solely on programmers to detect the erroneous compilation of code. The first computers simply failed after executing flawed code and would only then be inspected and repaired. Early attempts to address faults in machines included the "N modular redundancy" and "M of N majority voting". In the "N modular redundancy" technique the system automatically detects and fixes the faulty module or circuit card and then notifies the operator of the error. The "M of N majority voting" technique uses three or more of any replicable hardware component (essentially three identical machines) and executes commands simultaneously checking for disagreements in the system. In the example of three parallel machines, the system will function as follows:

If all three machines agree, the system resumes execution.
If two of the machines agree but one is divergent, the execution resumes using the two agreeing machines and the one in disagreement is reported as faulty.
If all three machines disagree, the system halts and ends execution.

Both techniques are still in wide-use today but are slowly being replaced by new methods that seek to provide reliability for systems with growing complexity.

The BBN Pluribus

The Pluribus system was created by Bolt Baranek and Newman and was comprised of six to fourteen Lockheed SUE computers to provide a reliable machine system. The BBN Pluribus was heavily oriented towards its application as a communication processor for the ARPA (Advanced Research Projects Agency) network and therefore required greater reliability than normal systems. In this system, every processor was identical and could access memory locations. Initially designed as an IMP (Interface Message Processor), the Pluribus system was later used for processing seismic data and was eventually used in the ARPANET endeavor. The SUE processor had a single bus for accessing both memory and I/O, and made use of a separate module (the arbiter) to control bus access and to resolve conflicts. Due to the redundancy employed in the hardware design, the system was highly reliable. For example, in case of a lost connection between a processor and a common memory bus, the system would resume its task normally by removing either of the two busses from the operational system. If a processor bus became unusable due to faulty hardware, the remaining processor bus(ses) would generally be able to provide sufficient computational power to continue running the system. The fault tolerance scheme in the Pluribus system was highly based on its software and relied on hardware architecture that was highly tailored to the Pluribus operating system. The main responsibility of the Pluribus operating system was to maintain an updated map of the available hardware and software resources so that the system would be informed about the variables and data structures associated with the processes that use those components. As a result, the system continued to function even after hardware components ceased to be operational. The Pluribus operating system was organized as a hierarchical sequence of stages:

STAGE	FUNCTION
0	Checksum local memory code (for stages 0, 1, 2). Initialize local interrupt vectors, and enable interrupts. Discover Processor bus I/O. Find some real-time clock for system timing.
1	Discover all usable common memory pages. Establish page for communication between processors.
2	Find ands checksum common memory code (for stages 3, 4, 5). Checksum whole page ("reliability page").
3	Discover all common busses, PIDs, and real-time clocks.
4	Discover all processor bus couplers and processors.
5	Verify checksum (from stage 2) of reliability page code (for rest of stages plus perhaps some application routines). External reloading of missing code pages is possible once this stage is running.
6	Checksum of all local code.
7	Checksum common memory code. Maintain page allocation map.
8	Discover common I/O interfaces.
9	Poll application-dependent reliability and initialization routines. Periodically trigger restarts of halted processors.
10	Application system.

Each processor begins with only the first stage (0) enabled. The system then executes the subsequent stage after establishing a proper map of its segment of the system state. The current stage may use information produced by earlier stages and therefore only becomes enabled if the previous stages were successfully executed. After being enabled, a stage is tested intermittently in order to confirm that the conditions for a successful execution still apply. Previous stages are also re-checked at this time, although most of the processing power is spent in verifying the current stage. The application stage only takes place once all previous stages have properly executed, guaranteeing a reliable execution of the application. Because previous stages are periodically checked at every stage, any change in the running environment is detected before the execution of the application.

b a c k