>> Tandem & NonStop Computing

James Treybig and a group of engineers from Hewlett-Packard founded Tandem Computers in 1974, with a business plan focused on fault-tolerant systems that were safe from "single point failures." In 1975, they completed the first version of the NonStop line of systems that exists as a strong competitor in the market today, still using the same fundamental concepts found in the original model. In fact, a study conducted in 1999 discovered that NonStop runs 90% of the world's securities trades, 80% of the world's ATMS, 66% of all credit card transactions, and also plays a major role in phone systems, emergency 911, and cellular network infrastructures. Moreover, in 2000, The Standish Group, an independent research company, found that the NonStop Himalaya systems have 1/4 the downtime of any other vendor's system.

The three main NonStop fundamentals are continuous availability, unlimited scalability, and data integrity. The basic idea behind all these goals rest on redundancy, which is the use of multiple, less available components in order to increase overall system reliability. The most simple form of redundancy, replication, is applied in the unique architecture of the hardware of Tandem systems. These systems have a modular design based on loosely coupled processors, each with its own resources, interconnected using message-based architecture.

Several other companies had similar ideas of switchover or "failover," but their systems operated by restarting programs on other CPU's. The original Nonstop I used a custom operating system called Guardian that allowed message passing and checkpointing for each operation, so Guardian could restart from any instruction in the program. More specifically, Tandem's system created process pairs consisting of a primary process and a backup process. First, the primary process would begin, then clone itself elsewhere as a backup. Guardian would send an initial message to the backup process, "telling" the process its status as a backup. Finally, the backup would become passive and accept messages from either Guardian or the primary process. When an error occurred, Guardian would automatically re-route a message from a failing component to a functioning one. Checkpointing and synchronization were important to maintain "uptimes" measured in years, as opposed to the typical systems of the day that failed every few days.

After NonStop I was finished in 1975, NonStop II was released in 1981, with improvements in speed and, more importantly, memory, including a virtual memory system. In 1983, NonStop TXP doubled the speed and quadrupled the physical memory. Along with the TXP, FOX, a fibre optic bus system, was finished, allowing NonStop systems to be connected together. Later versions contnued the upgrade, and in 1991 , Cyclone/R and CLX/R had RISC-implementations of Guardian and ran on MIPS R3000-based CPU modules. The first system that changed the underlying architecture of the NonStop system was the NonStop Himalaya (the S-Series), which was finished in 1993. This system based both the I/O and inter-CPU busses on their new ServerNet system, that was a true peer-to-peer network.

In 1997, Tandem was acquired by Compaq, which was later acquired by Hewlett-Packard in 2002. The NonStop product is still being produced under HP and can be found at http://h71033.www7.hp.com/.

Today, the NonStop products incorporate fault tolerance and data integrity by:

  • Incorporating self-checking processors that use data replication and comparison logic to ensure faults are detected and affected components are isolated and taken offline as to avoid contaminating other sectors. ServerNet technology in routers also ensures that data and addresses are correctly transmitted and prevents adapters and controllers from corrupting memory.

  • Utilizing mirroring to replicate disks so that if one disk drive fails, the server can continue to operate using a mirrored copy. Then, when the faulty disk is repaired or replaced, the mirrored data is copied back to the old one without a disruption in service.

  • Using power-failure protection, in case of a general power outtage. NonStop can run on battery powered backup for up to one hour, eliminating the risk of losing critical data in memory.

  • Integrating two lockstepped processor chips that run identical instruction streams. The output to these processors is compared continuously, and if there is ever a difference the processor is shut down immediately to keep corrupted data from propagating. This technique is called fail-fast, and is applied to both hardware and software to eliminate the possibility of magnifying the error in the future.
b a c k