>> The Blue Screen of Death...



You know what it is. We know what it is. Almost everyone who has had any amount of experience with computer technology knows this one fact: Computers fail. Computers are riddled with bugs, errors, and all sorts of embarrassing problems that incite in the average user copious amounts of frustration. The following haikus give a comical but sadly true look at the unreliability of computers:


Serious error.
All shortcuts have disappeared.
Screen. Mind. Both are blank.
Yesterday it worked
Today it is not working
Windows is like that.
Windows NT crashed.
I am the Blue Screen of Death.
No one hears your screams.

Of course, for the normal user with a personal pc, having Internet Explorer crash or enduring a system reboot after encountering the notorious "blue screen of death" is certainly terribly irritating and inconvenient, but hardly life-altering. A much bigger problem arises with the failure of hardware--that is, when a hard drive gets corrupted and a person's entire collection of saved data is lost, or if the processor chip burns out and one is left computerless for days (heaven forbid!). So this problem of computer failure happens on two levels--software failures, in which Windows and other software applications crashes, and hardware failures, in which the actual underlying circuit components no longer work correctly.



>> The Cost of Computer Failures

The problem of unreliability is magnified in industry and business, where even a few minutes of downtime can translate to thousands upon millions of dollars lost. The table below (from a 2003 Microsoft White Paper on Strategies for Fault-Tolerant Computing) shows the impact of an hour of downtime on sectors of the economy:

Industry Sector

Hourly Cost of Downtime

Manufacturing

$28,000

Transportation

$90,000

Retail, Catalog Sales

$90,000

Retail, Home Shopping

$113,000

Media, Pay Per View

$1,100,000

Banking datacenter

$2,500,000

Financial, Credit Card Processing

$2,600,000

Brokerage

$6,500,000



Even more important than reducing cost, however, is the absolute dependence of critical systems in society that cannot afford to be down, even momentarily. Such systems include hospital equipment such as life support and emergency telecommunications. In these situations, unreliability can result in severe injury or loss of life. Thus, it is of critical importance to develop computers that are extremely available.

In the IT community, availability is generally defined as the percentage of time a system is able to serve its intended function, taking into account the reliability of all the components including the hardware, software, operating system, etc. Availability is typically measured in "nines," so "four nines" would refer to 99.99% availability, "five nines" to 99.999% availability, and so on. The following table (from same Microsoft paper mentioned above) displays the hours of annual downtime that correspond to availability.

Availability

Annual Downtime

99%

87.6 hours

99.9%

8.76 hours

99.99%

52.5 minutes

99.999%

5.25 minutes



Very generally, the goal of fault-tolerance computing is to reduce the amount of downtime in computer systems. For more information on what fault-tolerant computing is, visit the next section of the site.
b a c k