Overview

Reliable systems are necessary in cases where there is the potential for serious loss of life or resources. Because of the ethical burden placed upon designers and implementers of reliable systems, it becomes necessary to use methods that are drastically different than those used in traditional software systems. A substantial development effort is necessary to produce reliable systems, regardless of whether the system is being used for aerospace, military, or civilian purposes. In addition, the consequences of failure in critical systems are so dire beyond the ethical consequences that it forces companies to expend the necessary effort to deliver such systems. The design of a reliable system is more deliberate than the average off-the-shelf software title. The planning stage typically begins months, even years, before the first line of code is ever written. Once implementation begins, every design decision and line of code is scrutinized. When the coding is complete, the work is not even half done. The majority of time is spent debugging, testing, and verifying the system.

Tools and methodologies used in building reliable systems

In discussing the reliability of critical systems, it is important to note the distinction between building reliable subsystems and building a complete system. Software engineering was originally seen as one of the multitude of tasks in building a complete system. In complex systems, there are typically many points of failure, where an error might be catastrophic. Because software engineering is a relatively new field compared to its engineering brethren, the focus often falls on software portions of these systems when disaster strikes. A complete system is only reliable as its weakest component, and the goal of building complex systems from the software perspective is to ensure that software is not that weakest link.

Among the best tools systems developers have at their disposal are solid methodologies and techniques for developing robust, fault-tolerant systems. These tools include a well-thought out set of design and specifications, as well as formal verification of the system once it has been written. In addition, the government maintains a set of stringent standards for the regulation of the design of critical systems.

These tools are far from perfect, however, and require time and effort from the systems developers to use effectively. In addition to realize the benefits of these tools, developers need to understand exactly what each tool is meant to accomplish. Formal verification is a tool that is applied to code only after it has been written. It cannot guarantee anything about the quality of system design. It can only ensure that the code written meets the specifications for the project.

The work, therefore, falls in creating a high-quality specification. The challenge of doing good design work lies in the inability of human designers to completely and accurately predict all possible scenarios under which the system will have to perform. If the designers of a system fail to predict certain scenarios, then the formal verification tools will have no way of detecting such a shortcoming. It is important that designers and testers understand exactly the strengths and limitations of such programs. They are meant to be used as tools to help improve the quality of a program along with careful testing and good implementation, rather than as the exclusive means by which to verify program correctness.

Military versus Commercial Sector Differences

One of the questions we had starting this project was systems were developed in the military versus how such systems were developed in business. In our research, we discovered several things: 1) That the government tends to heavily regulate itself, and any critical area where lives or critical resources are at stake. 2) That good software engineering practices hold irregardless of what environment the development takes place in. 3) The government tends to stay with a certain technology that has proven to be stable in the past, even as the technology becomes out-dated. 4) Competition and the desire for new and better features in the commercial sector drive a more rapid product cycle.

Initially, we thought that the more rapid product phase used in the commercial sector would cause reliability problems, given the massive debugging and testing effort that typically goes into such systems, but this turned out not to be the case, as military and commercial systems both exhibited times when they suffered software reliability disasters. Possibilities to explain this observation include: 1) the tremendous amount of testing and debugging does pay off, and eliminates a large number of bugs from these systems. 2) There are always esoteric bugs that arise from complex system interactions that are impossible to simulate in a laboratory environment, since simulations are them themselves programs designed to test certain aspects of a system.

Ethical Responsibilities

The ethical responsibilities of a team member working on a reliable software project transcends military/commercial boundaries-it applies to everybody. Every person has a responsibility to do their job to the best of their ability (i.e. following good software engineering practices, using great circumspection, peer review, etc.). When reliable systems do fail, it may be politically and socially acceptable to blame a person, or a portion of the team directly responsible for bug leading to the failure. However, the team as a whole should shoulder the burden for the failure, since in such systems there are many dependencies and relationships between the people who write, test, and implement the system.