programmer error, and to reduce the likelihood that programs
could fail unexpectedly. In this way, it was hoped that Ada would not
only allow programmers to do all Defense programming in a familiar and
consistent environment, but allow them to write higher quality and more
reliable code in less time. These are important consideration when developing
defense systems.
Since 1983, Ada has become fairly popular. Although it is used outside
the military sector, and has been used somewhat in areas such as commercial
banking and aviation systems, its main use has been in military computer
systems. In 1987, the Department of Defense issued the so-called "Ada
mandate", stating that "the Ada programming language shall be the single,
common, computer programming language for Defense computer resources
used in intelligence systems, for the command and control of military
forces, or as an integral part of a weapon system." In other words,
all defense software would be implemented using Ada.
Ada was designed from the ground up to be a better language to write
these sort of systems in, both from a standpoint of implementation and
maintenance costs and reliability. However, in 1997, the Ada mandate
was reversed by Assistant Secretary of Defense Emmett Paige, Jr, on
recommendation from the National Academy of Sciences' Research Council.
Among other reasons, the report from the Research Council cited low
commercial adoption of Ada. Although the military had been writing millions
of lines of Ada code, not many others had. As a result, it was found
that programmers were not as familiar with Ada as other languages, managers
were not as familiar with its characteristics, it was not being taught
in schools, and tools and compilers for Ada were only available on a
limited fashion.
The original impetus for Ada's design was the "software crisis" in
the Department of Defense in the late 1960s and early 1970s -- defense
applications were being written in a multitude of different, incompatible
languages, leading to increased costs for development and maintenance,
delays, and faulty software. Providing a limited number of language
options -- Ada -- was supposed to make it easier to develop high-quality
software that could be easily maintained. In the fifteen years since
Ada was introduced, however, the software engineering market had changed.
A lot had been learned about the process of designing software, and
the commercial sector had its own solutions to many of the problems
that Defense addressed with Ada.
In light of this, the Department of Defense abandoned its Ada mandate,
requiring instead that standard, non-proprietary languages be used,
to keep down language proliferation and high maintenance costs. The
Department of Defense claims it is not abandoning Ada, but only making
programming language a choice that needs to be considered like any other
aspect of a computer system project.
However, this decision shows one important fact: The Department of
Defense, although it did see lower costs of development and maintenance
as a result of the abandonment of proprietary languages, did not find
a marked improvement in the quality and reliability of the code simply
by switching to a language designed to ensure this more than other high-level
languages. As of today, defense programmers are given the same tools
as those in the commercial sector, not only because these tools are
cheaper and better understood, but because the specialized military
tool (Ada) does not do the job significantly better.
Smart Ship
In
1995, the U.S. Navy, on advice Naval Research Advisory Committee (NRAC),
started a program to research labor and manpower saving ideas. The results
of this program, deemed the Smart Ship, are being tested aboard the
USS Yorktown. The Navy quickly deemed the program a success in reducing
manpower, maintenance and costs. In September of 1997, however, the
Yorktown's propulsion system failed. The ship had to be towed to a Naval
base at Norfork, and the ship was not restored to operational status
for two days. The culprit? The software running on PCs designed to control
the ship crashed, taking the rest of the ship down with it.
According to a memo from Vice Admiral Henry Giffin, commander of the
Atlantic Fleet's Naval Surface Force, "The Yorktown's Standard Monitoring
Control System administrators entered zero into the data field for the
Remote Data Base Manager program. That caused the database to overflow
and crash the LAN consoles and miniature remote terminal units." As
part of cost-cutting measures, the Navy has a policy towards using commercial
off-the-shelf (COTS) hardware and software. The Smart Ship program used
standard Pentium Pro PCs, and the standard operating system for the
Navy Information Technology for the 21st Century initiative: Windows
NT 4.0.
But according to engineer Anthony DiGiorgio, of the Atlantic Fleet
Technical Support Center, "using Windows NT, which is known to have
some failure modes, on a warship is similar to hoping that luck will
be in our favor." Although Windows NT is promoted by Microsoft as a
stable and robust operating system, it is not by any means bugfree --
the lists of bugs that Microsoft fixes and introduces in each new version
is impressively long -- and trusting such full control of a ship's system
to such an operating system is dangerous. A naval vessel that cannot
operate for even a few minutes could be in grave danger, especially
during a time of war. So if the ship is to be controlled by a computer,
that computer needs to be absolutely reliable -- or at least more reliable
than the crew it replaces.
Obviously the software the Navy used about the Yorktown was flawed.
It is a pilot program, however, and such bugs may be expected. This
was be fixed and the program moved on. More important, however, are
the greater concerns the bug belies: The use of Windows NT, and the
fact that the software could cause such a major failure, shows a failure
on the part of the design process. Such a large important system should
have been as bug-free as possible before it reached the ocean.
In fact, a letter published in Scientific American a year and a half
after the incident reveals more: Harvey McKelvey, former director of
Navy programs for CAE Electronics, which designed portions of the software
used, wrote that the incident was due to "a decision to allow the ship
to manipulate the software to simulate machine casualties," which was
not the intended mode of operation. CAE "was on record with the navy
in January 1997 expressing serious concern for system integrity and
reliability while this unorthodox and risky access to the core software
was allowed."
In other words, not only did the software allow the ship to be rendered
inoperative, but the Navy decided to use the software in methods explicitly
warned against by the authors. One of the most important parts of implementing
any software system is the human factors: Computer systems are designed,
built and used by people, and if those people do not pay the proper
amount of attention to detail, failures can occur. In this incident,
the people failed in multiple areas: The design process favored cost
and speed over safety and reliability. The implementation was built
on a system known to be faulty, and the system was used in a way not
intended by the designers.
The Navy Smart Ship program continues without major modification; the
USS Yorktown trials have been declared a success and the Navy plans
to make similar modifications to other Aegis class vessels. The program
has been declared a success, but these issues remain. One can only hope
that the Navy recognizes and repairs the flaws in its computer and software
system design process before a major catastrophe occurs.
Space Shuttle
NASA's
Space Transportation System, commonly referred to as the Space Shuttle,
is one of the most meticulously engineered transportation devices made
by man. The computer systems onboard are no exception. They are subject
to perhaps an extremely rigorous design and testing process before ever
being deployed in space, and errors have been few and far between. The
entire process and methodology used to construct the shuttle computer
systems was designed to prevent any error from occurring.
There are five main computers, or "general purpose computers" (GPCs)
aboard the shuttle. The AP-101 computer series was designed explicitly
for use aboard the shuttle, and only two models have been produced:
The AP-101B, which was installed aboard the shuttles initially, began
its design process in 1972, eight years before the first shuttle launch.
In 1991, these were upgraded to the AP-101S, originally designed in
1985. These computers are ludicrously underpowered compared to computers
available today, but NASA still uses them, with good reason: They work.
These machines have twenty years of testing and design behind them.
They have a very small amount of active memory -- which until the 1991
upgrade was still using ferrite core, a form of memory that has not
been used in computer design since the early 1970s. The machines have
no hard drives or other modern storage device. Mission programs are
loaded from tapes, and the computer has little enough storage that new
tapes must be loaded at various points along the mission for new tasks.
Even with all the testing and verification of hardware, the computers
are still not trusted. Four of the computers serve as primary systems,
and each is connected (where possible) to a separate set of sensor systems.
Each then independently determines the shuttle's course of action, and
a voting algorithm is used to select what to do. Actually, once the
launch is complete, one of the four computers is loaded with the descent
(landing) program and put to "sleep," disconnected from the rest of
the shuttle -- in this way, should something happen to the other three
computers, it will still be able to land. And in the event of a failure
of all four primary systems, a completely separate backup computer is
available, and can be switched in immediately by the pilot at a moment's
notice.
To match the hardware, the software is also written and verified meticulously.
The shuttle control software is not large by commercial standards --
some 420,000 lines of code, but the authors of the code at IBM Federal
Systems Division and Lockheed Martin Corps consider it essentially bug-free.
In the entire flight of the shuttle, there have been only seventeen
failures due to software, none of them major enough to impact the mission.
Unlike much other software, the NASA shuttle systems are written only
after an incredible amount of documentation and plans have been commissioned.
Before a single line of code is written, it has been planned out and
approved. This way, with everything planned and known, the likelihood
that changes will need to be made along the way, which could interact
haphazardly with other parts of the system. Further, once the code has
been written, it is subject to independent verification, line by line,
to make sure it is correct.
To make things even more secure, the fifth, backup, computer, runs
entirely separate software, called Backup Flight Software (BFS), from
the four primary computers, which use the Primary Avionics Software
System (PASS). The two software systems are written by separate companies
and the coders are not allowed to see each other's work. This is to
ensure that if a software bug causes the primary computers to fail in
some given scenario, it is likely that the authors of the backup software
will not have introduced the same bug, and switching to the backup system
will let the shuttle continue.
The design and implementation techniques for the space shuttle systems
are top-notch. In 1986, the shuttle Challenger exploded on takeoff.
The accident had nothing to do with the computer systems, but Richard
Feynman, Nobel Laureate in physics and the member of the commission
to investigate the disaster who actually discovered the cause of failure
(the O-ring in the solid rocket booster (SRB)), was impressed with the
computer system and its design enough to write that "The computer software
checking system and attitude is of the highest quality. There appears
to be no process of gradually fooling oneself while degrading standards
so characteristic of the Solid Rocket Booster or Space Shuttle Main
Engine safety systems." In other words, even compared with the rest
of the shuttle, the avionics (computer) system has incredibly high standards
of quality.
There is no doubt that the computer systems aboard the space shuttle
are designed with the most attention to detail, and the most well-thought
out and implemented design methodologies possible. The hardware systems
are as reliable as possible, and only changed when deemed necessary
and safe. The software is subject to countless hours of verification
and testing. Yet there are still flaws. Bugs are still present in the
code, albeit very few compared to most other software systems. As of
1988, NASA had logged over 700 anomalies involving the computer system
in only 24 completed missions. Not all of these were software bugs,
or even computer related, but it still shows that no computer system
is 100% reliable.
Interview with Dr. Stringer-Calvert at SRI
A great deal of work has been done on verification of computer software
systems using formal methods of verification techniques. Someone wishing
to verify a piece of software must first summarize the specification
for that software in the high-order logic of a verification system.
Then, such a system can verify that the operation of the software matches
the specification. In this manner bugs can be detected and eliminated.
Further, formal verification can help to assure the reliability of critical
systems.
I spoke with Dr. David W. J. Stringer-Calvert, a scientist at SRI's
Computer Science Laboratory who works on a formal verification system
called PVS, or Prototype Verification System. He had a number of interesting
things to say about the ways that his system is used to verify critical
systems, including aviation systems for aerospace, military, and commercial
areas.
Unlike in traditional engineering, in which it is possible to perhaps
double the strength of a structural member by doubling the amount of
material used, software cannot be strengthened in this manner by doubling
the amount of code. Usually the opposite happens. Rather, it is possible
to decrease the likelihood of failure by using formal methods to conclusively
prove that the important parts of a system are correct.
One of the primary difficulties in verifying a complete system is that
it is nontrivial at best to express the complete specification, including
all possible inputs, for a system designed to function in the real world.
For example, suppose a system does everything properly when the user
enters data that the designer anticipated they would enter, but crashes
when the user enters in a zero where they shouldn't have. In this case,
a formal verification system would return that the system meets specification
-- but the specifications were not robust enough in this case.
Dr. Stringer-Calvert noted that while formal verification can be used
to verify with 100% accuracy that a short program is logically correct,
it is often impossible to verify the total correctness of a large system
because the time to do so would be impractical. For example, to verify
with 100% accuracy that the systems that control an airplane are "correct"
could take centuries using the best computers available. Rather, formal
verification of important parts of systems can help expose bugs and
decrease the probability of failure.
Regarding redundant computers in critical systems, Dr. Stringer-Calvert
was somewhat critical. Suppose that redundant systems were implemented
according to protocol, that three separate teams were given the same
design task but that they were rigorously separated and forced to use
different hardware, software, and design methodologies. Suppose then
that these three systems were implemented, and another computer existed
just to tally the results and report the final result, taking a vote
if necessary. It appears that this one vote-taking computer represents
a single point of failure. According to one article he co-authored,
"Mechanisms for fault tolerance are a significant component of many
safety-critical systems: they can account for half the software in a
typical flight-control system, and are sufficiently complicated that
they can become its primary source of failure!" (http://www.csl.sri.com/reports/html/fmtrends98.html
) Dr. Stringer-Calvert also pointed out that if two computer systems
don't agree on a result, then one of them had bugs and could have been
better implemented from the beginning. One correct system, he asserted,
was better than three systems that could be incorrect and would have
to vote to get a result. After all, what if the computer that was in
the minority had the correct result?
Certification trade groups recognize the importance of formal verification
in systems for aviation, for example. The organization known as RTCA,
or Requirements and Technical Concepts for Aviation, Inc., publishes
a document known as "DO-178B" that specifies guidelines for how reliable
aviation computer systems should be created. This document makes reference
to formal verification as an important method providing "evidence that
the system is complete and correct with respect to its requirements."
(http://www.csl.sri.com/reports/html/csl-95-1.html) While such formal
verification is not considered mandatory by this trade group, it is
listed as an important tool. Other organizations, such as NASA, sometimes
do however require formal verification for their systems.