"Our bodies have great availability. I have soft errors all the time: my memory fails once in a while, but I don't 'crash.' My whole body doesn't shut down when I cut a finger." |
--Robert Morris, director of IBM's Almaden Research Laboratory, on the inspiration of IBM's autonomic computing project
In order to deal with the errors, failures, faults, and other computer problems that occur with great frequency, three general methods have been used and continue to be studied and researched. These three software techniques are different in the way they handle potential faults during a program's execution.
Fault avoidance consists of creating programs that are free of faults. Developing fault-free software must include a precise system specification, the extensive use of reviews during development, and careful planning and implementation of system testing.
However, because constructing a completely fault-free program is well-known to be generally unachievable, another technique used is fault removal. This method consists of accepting the existence of faults, then removing them after programs have already been written, but this proves to be another very difficult goal to reach.
Finally, fault tolerance is the technique that has been the focus of much recent research. It consists of acceping the existence of faults, admitting that they cannot be removed, and using a combination of detection and recovery mechanisms to ensure that the faults do not cause programs to fail.
Some general methods to improve dependability in both hardware and software are listed below:
- good design methods
- extra-reliable (expensive) components
- improved production techniques
- improved power supply and cooling
- adding redundancy:
- in time: do computations several times, with the same hardware (HW) and software (SW), or with different HW and/or SW (design diversity)
- in space: have multiple units perform same function, and add majority voting equipment
- failure: a component's inability to perform to its specifications (note: specifications can be wrong)
- error: cause of a failure
- fault: anomalous component condition:
- internal: design or manufacturing fault, damage, aging
- external: harsh conditions, radiation, electromagnetism, misuse
- design faults: those faults that after having been repaired result in a system with a different specification
- hardware faults: non-design faults, i.e., faults caused by physical processes
- permanent: stable deviation from what is required
- intermittent: every now and then
- transient: only once, hopefully
>> How Fault Tolerance works
For more information on this fascinating area of research, please visit the next section of our page.