Headlines have started to pin the blame for the British Airways outages on a single employee, enabling what could become a pretty spectacular feat of corporate arse-covering.
Preliminary investigations have pointed to a single worker’s “human error” as the likely culprit for the failure of BA’s Boadicea House data centre, according to The Times. Reportedly this unlucky contractor switched off an Uninterruptible Power Supply. As a result, power was somewhat interrupted.
According to an email sent by Bill Francis, the head of IT at BA’s parent company, and seen by the Press Association: “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries… It was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system.”
There’s a good explanation of what exactly might have happened at the Boadicea House data centre over at The Register.
— Peter Spence (@Pete_Spence) June 2, 2017
Regardless, one dead data centre should be no problem. Large companies like BA are expected to have invested in redundancies for just this sort of an outage. So where were they?
It’s hard to come up with a conceivable explanation for an outage at one data centre also knocking out BA’s backups. Maybe they hadn’t been tested, or there weren’t proper copies of up to date data at the other location. Whatever the reason, the shutdown went on and on, affecting passengers over the bank holiday weekend.
This is a proven case of a “single point of failure” in BA’s operations. One part of its infrastructure went down, and the rest toppled over like dominoes. It is almost certainly the case that the Boadicea House contractor was not responsible for failure at BA’s second site. That would have the job of whole teams of staff, not one person. You can reasonably infer that BA was inadequately prepared.
Any business will pay for IT. They can pay upfront, and avoid – or at least mitigate – the kind of problems BA faced. Or, they can do what BA did, and pay later. The bill for this particular crisis will not be cheap (analysts have put the cost somewhere around £100m).
— David Crow (@bydavidcrow) June 1, 2017
The right answer to such a crisis is for BA’s higher ups to examine its procedures and determine exactly where it needs to invest to avoid a repeat of this crisis.
The contrast with another recent outage – that of Amazon – is instructive. When an engineer’s typo brought down some of its US services the company refused to blame the single individual. Amazon explained that it was the company’s inability to rely on redundancies to bring services back online that was the real culprit.
From the Amazon Web Services team: “We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes.
We have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years… and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”
BA now needs to take the same approach. While a single worker might have taken them offline, the bigger problem is that many other employees failed in their duties to make sure the company was back up and running with minimal downtime. The responsibility for that failure will surely extend up the chain of command.
The right response to any such crisis to blame your processes, not your people.