Commercial Cases

London Ambulance Disaster

In October 1992, the London Ambulance Service suffered a disaster that brought their operations to a virtual standstill over 36 hours, and cost approximately 20 people their lives in the process. Upon further investigation, it was discovered that the new computer aided dispatch (CAD) software was responsible for the crisis.

The problem lay in the design of the software, which was completely inadequate for the needs of the London Ambulance Service. So terribly pathetic were this software's services that response times to emergency calls were as high as 11 hours during the 36 hours crisis.

Some of the worst problems in the system included the following:

Inability of the software to distinguish between duplicate calls from different people pertaining to the same incident.

Failure of the software to maintain and keep track of logged calls. One particularly evident case was a woman who called the emergency number after being trapped by the body of her collapsed husband. She repeated called the ambulance service every half-an-hour, only to be told that there was no record of her earlier call. When an ambulance eventually arrived nearly 3 hours later, her husband had already died.

(More contintued down the page)

A more detailed analysis revealed more clearly exactly the chain of events that led to such a catastrophic failure. The official report, released in February of the following year, stated that "neither the Computer Aided Dispatch system itself, nor its users, were ready for full implementation on 26 October 1992. The CAD software was not complete, not properly tuned, and not fully tested." As it turns out, the LAS was using a manual paper-and-pen based approach to dispatching, where urgent calls were manually written down and transcribed and passed on.

The LAS had also been previously criticized in public forums for having slow response times under their manual dispatch system, and their upper management believed that a computer based dispatch program was the quickest way to increase response time and improve efficiency. Therefore, LAS applied pressure to the contractor to advance the timeline more quickly than was prudent, resulting in a lower quality product.

Perhaps due to the compressed timeline, the LAS implemented the system, allowing it to "go live" on October 26, 1992 with its staff having little or no training in the new system. This one-phase deployment lacked testing and critical oversight, and was a contributing factor in the subsequent disaster. Even worse, in the face of such a rushed-system, LAS had no backup plan, no failover mechanism to handle problems in the CAD system.

In addition, Systems Options, that was contracted by LAS to complete this software system was completely out of its depth in writing such software. Systems Options had no prior experience in writing large and complex systems, much less mission critical CAD systems on which people's lives would depend. However, allegations arose that LAS settled for the lowest-bid from competing software vendors, rather than ensuring a contractor with the necessary experience and depth to provide a reliable end product.

LAS made critical mistakes at every juncture in the design, development, and deployment of their CAD system and there were unfortunately no checks in place to prevent the ambulance crisis from occurring.

China Airlines A300 Disaster

In April of 1994, China Airlines A300 crashed at Japan's Nagoya airport, killing 264 of 271 people on board. The most likely cause of the crash was not solely the fault of software, but the confused interactions between software and human, in this case between the 26-year old copilot of the plane who was attempting to land the plane and the autopilot of the plane.

Two minutes before the plane was about to land, the autopilot of the plane went into take-off/go-around for reasons the investigation could not determine. In effect, this caused the autopilot to attempt to control the plane in a way that was directly opposite to what the human pilot was attempting to control.

Despite the warnings from the pilot, the copilot continued to attempt to the land the plane with the autopilot in go-around mode. The autopilot was therefore attempting to gain altitude, increasing the pitch of the plane, while the copilot tried to decrease altitude using different parts of the plane. The crew then switched the autopilot out of go-around mode, but could not undo some of the changes the autopilot made to the stabilizer flaps on the wings, which caused the plane to increase in altitude.

The continued climb prevented the plane from landing, so the crew switched the go-around mode of the autopilot again, causing the plane to continue climbing. Finally, the engine stalled after the angle of ascent increased to 53 degrees, and the plane fell towards the ground, crashing tail-first.

This is not an overt case of software failure that cost, but rather a case where the specifications and interface were less than optimal for communication between the [human] pilot and the autopilot. The design of the software was such that there were no audio cues signifying when the autopilot was engaged or disengaged. Also, below a certain critical altitude, the autopilot resisted de-activation, because the designers originally feared that below this altitude there was insufficient time for a human pilot to regain control of the aircraft.

In addition, the system lacked a way to resolve a conflict of control-the question of who to trust in this situation, the autopilot or the human pilot, was never addressed. If the system had built-in safeguards that resolved conflicting actions and handed control over to a single pilot, then perhaps this disaster might have been averted.

Lauda Air B767 Accident

On May 26, 1991, Lauda Air Boeing B767 suffered an in-flight problem and broke apart over Thailand approximately 7000 meters in the air after departing from Bangkok. There remain some unsolved portions about the precise cause of the problem, due to a damaged flight data recorder, but the strongest possibility is that a thrust reverser deployed during the flight. The thrust reverse then reduced lift by 25% for the airplane, causing the flight crew to lose control of the airplane.

The problem lay in the problem of design and testing. Initially, Boeing claimed that there was software in place that made accidental inflight deployment of the thrust reversers impossible. In simulations, however, it was later shown by tests that the disintegration of certain physical locks might lead to a scenario where a thrust reverser might be deployed, despite the software supposedly in place to prevent such a thing. Another possibility that was investigated was a fault in the proximity switch electronics unit (PSEU), and its accompanying operating software.

Simulation scenarios similar to the crash showed that unless full wheel and full rudder were applied within seconds of such a thrust reverse, the airplane would no longer be capable of controlled flight. The same report concluded that "·recovery from the event was uncontrollable [sic] for an unexpecting flight crew".

Because the investigating groups never completely positively identified the cause as the thrusters reversing, it is only speculation about the precise sequence of occurrences that led to the crash.

However, it was clear in later testing that there were flaws in the system that Boeing had designed. The question of where the responsibility for such critical flaws remains unsolved. Although software failure is just one of several possibilities in this case, why did Boeing fail to isolate the multiple problems, both software and hardware related, during the testing phase of the 767's subsystems?

The other interesting aspect is the tension between the amount of control given to software automation, like the autopilot, versus the amount of control given to human users. In this case, it is possible that over-reliance on software automation may have decreased the readiness of the crew to respond to emergencies by generating a sense of complacency where the crew felt the computer would be able to handle most emergency situations adequately.

Airbus A320 Crash in France:
User Interface in Critical Systems can be Critical

The overall reliability in an otherwise robust safety-critical system can be compromised by a poor human-computer interface, as this case study shows.

Synopsis: January 20, 1992, an Airbus A320 jetliner crashed near Strasbourg, France. A board of inquiry found the fault to lie in "pilot error." However, others have criticized the design of the A320's "glass cockpit" which, allegedly, was confusing and hampered the pilots' ability to monitor flight conditions, such as the high rate of descent experienced by the doomed plane. Further, the system did not warn pilots of danger in time for corrective action.

The A320 jetliner was introduced in 1987 by Airbus. There had been two previous crashes of A320's before the 1992 crash. The first occurred in 1988 when a plane owned by Air France crashed during a demonstration flight at an air show. The official cause was reckless piloting, though the pilot insists the plane failed to warn him of loss of altitude. The second crash was by a Indian Airlines A320 landing at Bangalore. A pilot pushed an incorrect button which idled the engines, causing the plane to drop rapidly and crash land on a golf course.

The 1992 crash near Strasbourg of the Air Inter flight killed 87 people. The pilots were apparently unaware of the plane's too rapid descent as it approach Strasbourg. There may have been a computer warning a second before the crash that altitude was too low - not enough time to do anything. This system measures the aircraft's altitude using a radio beam and calls out altitudes at certain height intervals. The pilots received an altitude message a second before the crash. A second warning system which warns pilots of a too rapid descent or low altitude was not installed on the Air Inter as it was not required and Air Inter felt it gave to many false warnings, leading pilots to ignore those warnings.

How did the plane begin descending too fast? It is believed that pilots had confused the "vertical-speed" and "flight-path-angle" modes of descent, and were in the wrong mode. The two modes had very similar display formats. The pilots were very busy at the time making a last minute change in the flight plan, requested by the tower, and thus were probably concentrating on the navigational display, and so altitude and vertical speed indicators on the main display were overlooked.

An inquiry board deemed the cause of the accident to be pilot error, because the pilots should have noticed (or not made) the error in the descent mode. Others feel some of the blame lies in the design of the system interface. Flint Pellet, of Global Information Systems Technology, says on the RISKS forum, it's "more of a user-interface design error, if you ask me. If you overload a person with things to do and input to consider to the point where they can no longer keep up, it is hardly reasonable to simply brush it off as 'human error' when they fail to keep up."

The A320 uses extensive computer control. The computer control system has full authority to override a pilot's action, for example, only allowing a pilot to bank so far as stresses on the plane are not above limits. Some pilots have claimed this is restrictive in the event of an emergency. Further, the plane is so automated that the pilot is for the most part reduced to the role of system manager, programming in flight paths and such, while the computer actually flies the plane. With little to do, pilots can become complacent. The computers display flight information on computer screens in the cockpit, hence the "glass cockpit". These displays, as in the A320, make it somewhat more difficult to monitor trends in flight data than traditional mechanical instrumentation.

So, while the crash of the A320 in Strasbourg was due to a pilot error in inputting an incorrect flight mode, at least part of the blame lies in a user interface that made it hard for the error to be detected. The human factor must be carefully considered in the design of a safety-critical system, including such factors as complacency arising from little interaction with the system.

Computer Failures in Two Traffic Systems

These case studies remind us to not to forget to test "upgrades" before installing them on critical systems and to think about what the safest state for a system is in the event of an error.

Synopsis: The traffic system in Austin, Texas, fails after a software modification is installed without being test first. The traffic system in Lakewood, Colorado, fails when the only disk drive on the only computer fails. Lights in both cases defaulted to blinking red causing massive backups and a multitude of accidents.

On April 13, 1990, programmers for the city of Austin, Texas, modified software controlling the city's stoplights. The software was loaded into the main computer which sent the changes to the stoplights. The software was not properly tested beforehand and about 360 of the city's 600 or so intersections controlled by the system received erroneous data. Receiving the bad data, the lights defaulted to blinking red, bringing traffic to a grinding halt and causing widespread accidents.

In order to fix the system, each intersection had to be individual reset by city work crews. This could not be done remotely.

Some critics, such as King Ables of Micro Electronics and Computer Technology, believe it would have been better for the lights to go into a green-yellow-red cycle with a default timing and that this would have lessened traffic problems and accidents.

Information on whether fire and rescue vehicle response times were hampered was not available.

In the Lakewood, Colorado, case, a hard drive running the city's traffic management software failed on February 27, 1990. The city had no backup computer or drive, though the hard drive was backed up on tape. Of course, there was no drive to restore the tape contents onto.

As in the Austin case, lights defaulted to blinking red and had to be individually reset at the intersection. Traffic was extremely slow and the accident rate high.