Overview
Reliable systems are necessary in cases where there is the potential
for serious loss of life or resources. Because of the ethical burden
placed upon designers and implementers of reliable systems, it becomes
necessary to use methods that are drastically different than those used
in traditional software systems. A substantial development effort is
necessary to produce reliable systems, regardless of whether the system
is being used for aerospace, military, or civilian purposes. In addition,
the consequences of failure in critical systems are so dire beyond the
ethical consequences that it forces companies to expend the necessary
effort to deliver such systems. The design of a reliable system is more
deliberate than the average off-the-shelf software title. The planning
stage typically begins months, even years, before the first line of
code is ever written. Once implementation begins, every design decision
and line of code is scrutinized. When the coding is complete, the work
is not even half done. The majority of time is spent debugging, testing,
and verifying the system.
Tools and methodologies used in building reliable systems
In discussing the reliability of critical systems, it is important
to note the distinction between building reliable subsystems and building
a complete system. Software engineering was originally seen as one of
the multitude of tasks in building a complete system. In complex systems,
there are typically many points of failure, where an error might be
catastrophic. Because software engineering is a relatively new field
compared to its engineering brethren, the focus often falls on software
portions of these systems when disaster strikes. A complete system is
only reliable as its weakest component, and the goal of building complex
systems from the software perspective is to ensure that software is
not that weakest link.
Among the best tools systems developers have at their disposal are
solid methodologies and techniques for developing robust, fault-tolerant
systems. These tools include a well-thought out set of design and specifications,
as well as formal verification of the system once it has been written.
In addition, the government maintains a set of stringent standards for
the regulation of the design of critical systems.
These tools are far from perfect, however, and require time and effort
from the systems developers to use effectively. In addition to realize
the benefits of these tools, developers need to understand exactly what
each tool is meant to accomplish. Formal verification is a tool that
is applied to code only after it has been written. It cannot guarantee
anything about the quality of system design. It can only ensure that
the code written meets the specifications for the project.
The work, therefore, falls in creating a high-quality specification.
The challenge of doing good design work lies in the inability of human
designers to completely and accurately predict all possible scenarios
under which the system will have to perform. If the designers of a system
fail to predict certain scenarios, then the formal verification tools
will have no way of detecting such a shortcoming. It is important that
designers and testers understand exactly the strengths and limitations
of such programs. They are meant to be used as tools to help improve
the quality of a program along with careful testing and good implementation,
rather than as the exclusive means by which to verify program correctness.
Military versus Commercial Sector Differences
One of the questions we had starting this project was systems were
developed in the military versus how such systems were developed in
business. In our research, we discovered several things: 1) That the
government tends to heavily regulate itself, and any critical area where
lives or critical resources are at stake. 2) That good software engineering
practices hold irregardless of what environment the development takes
place in. 3) The government tends to stay with a certain technology
that has proven to be stable in the past, even as the technology becomes
out-dated. 4) Competition and the desire for new and better features
in the commercial sector drive a more rapid product cycle.
Initially, we thought that the more rapid product phase used in the
commercial sector would cause reliability problems, given the massive
debugging and testing effort that typically goes into such systems,
but this turned out not to be the case, as military and commercial systems
both exhibited times when they suffered software reliability disasters.
Possibilities to explain this observation include: 1) the tremendous
amount of testing and debugging does pay off, and eliminates a large
number of bugs from these systems. 2) There are always esoteric bugs
that arise from complex system interactions that are impossible to simulate
in a laboratory environment, since simulations are them themselves programs
designed to test certain aspects of a system.
Ethical Responsibilities
The ethical responsibilities of a team member working on a reliable
software project transcends military/commercial boundaries-it applies
to everybody. Every person has a responsibility to do their job to the
best of their ability (i.e. following good software engineering practices,
using great circumspection, peer review, etc.). When reliable systems
do fail, it may be politically and socially acceptable to blame a person,
or a portion of the team directly responsible for bug leading to the
failure. However, the team as a whole should shoulder the burden for
the failure, since in such systems there are many dependencies and relationships
between the people who write, test, and implement the system.