Military Cases

ADA

The Ada programming language was developed after a 1973 cost study determined that the Department of Defense was spending $3 billion annually on software, much of it on embedded computer systems, which at the time were being written in a mix of proprietary software languages. In an effort to reduce cost and improve the quality and maintainability of software, the Department of Defense formed a group to create its own programming language. In 1980, the initial Ada specification was released, and in 1983 it was finalized. Ada was revised again in 1995, and is still in active use.

One of the major goals of Ada was to add significant high-level language features not often found in the proprietary languages it was designed to replace: it is a strongly typed language, with well-defined exception handling, run-time error checking and built-in parallel programming. It is object oriented, and descends from languages such as PASCAL and Modula-2. The language was designed from the ground up to minimize

programmer error, and to reduce the likelihood that programs could fail unexpectedly. In this way, it was hoped that Ada would not only allow programmers to do all Defense programming in a familiar and consistent environment, but allow them to write higher quality and more reliable code in less time. These are important consideration when developing defense systems.
Since 1983, Ada has become fairly popular. Although it is used outside the military sector, and has been used somewhat in areas such as commercial banking and aviation systems, its main use has been in military computer systems. In 1987, the Department of Defense issued the so-called "Ada mandate", stating that "the Ada programming language shall be the single, common, computer programming language for Defense computer resources used in intelligence systems, for the command and control of military forces, or as an integral part of a weapon system." In other words, all defense software would be implemented using Ada.

Ada was designed from the ground up to be a better language to write these sort of systems in, both from a standpoint of implementation and maintenance costs and reliability. However, in 1997, the Ada mandate was reversed by Assistant Secretary of Defense Emmett Paige, Jr, on recommendation from the National Academy of Sciences' Research Council. Among other reasons, the report from the Research Council cited low commercial adoption of Ada. Although the military had been writing millions of lines of Ada code, not many others had. As a result, it was found that programmers were not as familiar with Ada as other languages, managers were not as familiar with its characteristics, it was not being taught in schools, and tools and compilers for Ada were only available on a limited fashion.

The original impetus for Ada's design was the "software crisis" in the Department of Defense in the late 1960s and early 1970s -- defense applications were being written in a multitude of different, incompatible languages, leading to increased costs for development and maintenance, delays, and faulty software. Providing a limited number of language options -- Ada -- was supposed to make it easier to develop high-quality software that could be easily maintained. In the fifteen years since Ada was introduced, however, the software engineering market had changed. A lot had been learned about the process of designing software, and the commercial sector had its own solutions to many of the problems that Defense addressed with Ada.

In light of this, the Department of Defense abandoned its Ada mandate, requiring instead that standard, non-proprietary languages be used, to keep down language proliferation and high maintenance costs. The Department of Defense claims it is not abandoning Ada, but only making programming language a choice that needs to be considered like any other aspect of a computer system project.

However, this decision shows one important fact: The Department of Defense, although it did see lower costs of development and maintenance as a result of the abandonment of proprietary languages, did not find a marked improvement in the quality and reliability of the code simply by switching to a language designed to ensure this more than other high-level languages. As of today, defense programmers are given the same tools as those in the commercial sector, not only because these tools are cheaper and better understood, but because the specialized military tool (Ada) does not do the job significantly better.

Smart Ship

In 1995, the U.S. Navy, on advice Naval Research Advisory Committee (NRAC), started a program to research labor and manpower saving ideas. The results of this program, deemed the Smart Ship, are being tested aboard the USS Yorktown. The Navy quickly deemed the program a success in reducing manpower, maintenance and costs. In September of 1997, however, the Yorktown's propulsion system failed. The ship had to be towed to a Naval base at Norfork, and the ship was not restored to operational status for two days. The culprit? The software running on PCs designed to control the ship crashed, taking the rest of the ship down with it.

According to a memo from Vice Admiral Henry Giffin, commander of the Atlantic Fleet's Naval Surface Force, "The Yorktown's Standard Monitoring Control System administrators entered zero into the data field for the Remote Data Base Manager program. That caused the database to overflow and crash the LAN consoles and miniature remote terminal units." As part of cost-cutting measures, the Navy has a policy towards using commercial off-the-shelf (COTS) hardware and software. The Smart Ship program used standard Pentium Pro PCs, and the standard operating system for the Navy Information Technology for the 21st Century initiative: Windows NT 4.0.

But according to engineer Anthony DiGiorgio, of the Atlantic Fleet Technical Support Center, "using Windows NT, which is known to have some failure modes, on a warship is similar to hoping that luck will be in our favor." Although Windows NT is promoted by Microsoft as a stable and robust operating system, it is not by any means bugfree -- the lists of bugs that Microsoft fixes and introduces in each new version is impressively long -- and trusting such full control of a ship's system to such an operating system is dangerous. A naval vessel that cannot operate for even a few minutes could be in grave danger, especially during a time of war. So if the ship is to be controlled by a computer, that computer needs to be absolutely reliable -- or at least more reliable than the crew it replaces.

Obviously the software the Navy used about the Yorktown was flawed. It is a pilot program, however, and such bugs may be expected. This was be fixed and the program moved on. More important, however, are the greater concerns the bug belies: The use of Windows NT, and the fact that the software could cause such a major failure, shows a failure on the part of the design process. Such a large important system should have been as bug-free as possible before it reached the ocean.

In fact, a letter published in Scientific American a year and a half after the incident reveals more: Harvey McKelvey, former director of Navy programs for CAE Electronics, which designed portions of the software used, wrote that the incident was due to "a decision to allow the ship to manipulate the software to simulate machine casualties," which was not the intended mode of operation. CAE "was on record with the navy in January 1997 expressing serious concern for system integrity and reliability while this unorthodox and risky access to the core software was allowed."

In other words, not only did the software allow the ship to be rendered inoperative, but the Navy decided to use the software in methods explicitly warned against by the authors. One of the most important parts of implementing any software system is the human factors: Computer systems are designed, built and used by people, and if those people do not pay the proper amount of attention to detail, failures can occur. In this incident, the people failed in multiple areas: The design process favored cost and speed over safety and reliability. The implementation was built on a system known to be faulty, and the system was used in a way not intended by the designers.

The Navy Smart Ship program continues without major modification; the USS Yorktown trials have been declared a success and the Navy plans to make similar modifications to other Aegis class vessels. The program has been declared a success, but these issues remain. One can only hope that the Navy recognizes and repairs the flaws in its computer and software system design process before a major catastrophe occurs.

Space Shuttle

NASA's Space Transportation System, commonly referred to as the Space Shuttle, is one of the most meticulously engineered transportation devices made by man. The computer systems onboard are no exception. They are subject to perhaps an extremely rigorous design and testing process before ever being deployed in space, and errors have been few and far between. The entire process and methodology used to construct the shuttle computer systems was designed to prevent any error from occurring.

There are five main computers, or "general purpose computers" (GPCs) aboard the shuttle. The AP-101 computer series was designed explicitly for use aboard the shuttle, and only two models have been produced: The AP-101B, which was installed aboard the shuttles initially, began its design process in 1972, eight years before the first shuttle launch. In 1991, these were upgraded to the AP-101S, originally designed in 1985. These computers are ludicrously underpowered compared to computers available today, but NASA still uses them, with good reason: They work. These machines have twenty years of testing and design behind them. They have a very small amount of active memory -- which until the 1991 upgrade was still using ferrite core, a form of memory that has not been used in computer design since the early 1970s. The machines have no hard drives or other modern storage device. Mission programs are loaded from tapes, and the computer has little enough storage that new tapes must be loaded at various points along the mission for new tasks.

Even with all the testing and verification of hardware, the computers are still not trusted. Four of the computers serve as primary systems, and each is connected (where possible) to a separate set of sensor systems. Each then independently determines the shuttle's course of action, and a voting algorithm is used to select what to do. Actually, once the launch is complete, one of the four computers is loaded with the descent (landing) program and put to "sleep," disconnected from the rest of the shuttle -- in this way, should something happen to the other three computers, it will still be able to land. And in the event of a failure of all four primary systems, a completely separate backup computer is available, and can be switched in immediately by the pilot at a moment's notice.

To match the hardware, the software is also written and verified meticulously. The shuttle control software is not large by commercial standards -- some 420,000 lines of code, but the authors of the code at IBM Federal Systems Division and Lockheed Martin Corps consider it essentially bug-free. In the entire flight of the shuttle, there have been only seventeen failures due to software, none of them major enough to impact the mission.

Unlike much other software, the NASA shuttle systems are written only after an incredible amount of documentation and plans have been commissioned. Before a single line of code is written, it has been planned out and approved. This way, with everything planned and known, the likelihood that changes will need to be made along the way, which could interact haphazardly with other parts of the system. Further, once the code has been written, it is subject to independent verification, line by line, to make sure it is correct.

To make things even more secure, the fifth, backup, computer, runs entirely separate software, called Backup Flight Software (BFS), from the four primary computers, which use the Primary Avionics Software System (PASS). The two software systems are written by separate companies and the coders are not allowed to see each other's work. This is to ensure that if a software bug causes the primary computers to fail in some given scenario, it is likely that the authors of the backup software will not have introduced the same bug, and switching to the backup system will let the shuttle continue.

The design and implementation techniques for the space shuttle systems are top-notch. In 1986, the shuttle Challenger exploded on takeoff. The accident had nothing to do with the computer systems, but Richard Feynman, Nobel Laureate in physics and the member of the commission to investigate the disaster who actually discovered the cause of failure (the O-ring in the solid rocket booster (SRB)), was impressed with the computer system and its design enough to write that "The computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems." In other words, even compared with the rest of the shuttle, the avionics (computer) system has incredibly high standards of quality.

There is no doubt that the computer systems aboard the space shuttle are designed with the most attention to detail, and the most well-thought out and implemented design methodologies possible. The hardware systems are as reliable as possible, and only changed when deemed necessary and safe. The software is subject to countless hours of verification and testing. Yet there are still flaws. Bugs are still present in the code, albeit very few compared to most other software systems. As of 1988, NASA had logged over 700 anomalies involving the computer system in only 24 completed missions. Not all of these were software bugs, or even computer related, but it still shows that no computer system is 100% reliable.

Interview with Dr. Stringer-Calvert at SRI

A great deal of work has been done on verification of computer software systems using formal methods of verification techniques. Someone wishing to verify a piece of software must first summarize the specification for that software in the high-order logic of a verification system. Then, such a system can verify that the operation of the software matches the specification. In this manner bugs can be detected and eliminated. Further, formal verification can help to assure the reliability of critical systems.

I spoke with Dr. David W. J. Stringer-Calvert, a scientist at SRI's Computer Science Laboratory who works on a formal verification system called PVS, or Prototype Verification System. He had a number of interesting things to say about the ways that his system is used to verify critical systems, including aviation systems for aerospace, military, and commercial areas.

Unlike in traditional engineering, in which it is possible to perhaps double the strength of a structural member by doubling the amount of material used, software cannot be strengthened in this manner by doubling the amount of code. Usually the opposite happens. Rather, it is possible to decrease the likelihood of failure by using formal methods to conclusively prove that the important parts of a system are correct.

One of the primary difficulties in verifying a complete system is that it is nontrivial at best to express the complete specification, including all possible inputs, for a system designed to function in the real world. For example, suppose a system does everything properly when the user enters data that the designer anticipated they would enter, but crashes when the user enters in a zero where they shouldn't have. In this case, a formal verification system would return that the system meets specification -- but the specifications were not robust enough in this case.

Dr. Stringer-Calvert noted that while formal verification can be used to verify with 100% accuracy that a short program is logically correct, it is often impossible to verify the total correctness of a large system because the time to do so would be impractical. For example, to verify with 100% accuracy that the systems that control an airplane are "correct" could take centuries using the best computers available. Rather, formal verification of important parts of systems can help expose bugs and decrease the probability of failure.

Regarding redundant computers in critical systems, Dr. Stringer-Calvert was somewhat critical. Suppose that redundant systems were implemented according to protocol, that three separate teams were given the same design task but that they were rigorously separated and forced to use different hardware, software, and design methodologies. Suppose then that these three systems were implemented, and another computer existed just to tally the results and report the final result, taking a vote if necessary. It appears that this one vote-taking computer represents a single point of failure. According to one article he co-authored, "Mechanisms for fault tolerance are a significant component of many safety-critical systems: they can account for half the software in a typical flight-control system, and are sufficiently complicated that they can become its primary source of failure!" (http://www.csl.sri.com/reports/html/fmtrends98.html ) Dr. Stringer-Calvert also pointed out that if two computer systems don't agree on a result, then one of them had bugs and could have been better implemented from the beginning. One correct system, he asserted, was better than three systems that could be incorrect and would have to vote to get a result. After all, what if the computer that was in the minority had the correct result?

Certification trade groups recognize the importance of formal verification in systems for aviation, for example. The organization known as RTCA, or Requirements and Technical Concepts for Aviation, Inc., publishes a document known as "DO-178B" that specifies guidelines for how reliable aviation computer systems should be created. This document makes reference to formal verification as an important method providing "evidence that the system is complete and correct with respect to its requirements." (http://www.csl.sri.com/reports/html/csl-95-1.html) While such formal verification is not considered mandatory by this trade group, it is listed as an important tool. Other organizations, such as NASA, sometimes do however require formal verification for their systems.