Was The British Airways Epic Outage Caused By Contractor Who Accidentally Switched off The Power?

Last weekend, a computer outage caused epic chaos for British Airways which was forced to cancel flights from both London Heathrow and Gatwick airports. That in turn caused stranded passengers no end of frustration until everything got sorted. Now, if you were one of those passengers, you want to know what caused this. Here’s the possible answer:

A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker. An investigation into the power outage is likely to focus on human error rather than any equipment failure, it said.

I have a bit of a problem with the above based on my experience in data centers that big companies the size of British Airways typically use. Actually two problems.

Typically with the sorts of data centers that these companies have, you have two sources of power. One is what comes from your local power company, and a second is a diesel generator which takes over if the feed from the power company fails. From there it goes to one or more massive and centralized uninterruptible power supplies (UPS). From there the UPS distributes power to power distribution units (PDU’s) that are in the racks that the servers reside on. The servers plug into the PDU’s in the racks, and those PDU’s have usually on/off switches that are harder to use than your average on/off switch. Now, it is possible to bump into a power switch on a PDU and take down a couple of servers by accident. But I find it implausible to do something of this scale by accident. The only way I can possibly see a scenario like this happening is if somehow the UPS was shut down. But from my experience, that isn’t exactly easy to do. In fact, it takes effort that goes beyond what would constitute an accident.

That brings me to point number two. If a company has systems that under no circumstances can fail for any reason, the company uses a clustered configuration. Meaning the system is made up of more than one computer that if the company is paranoid, may be spread out in different locations. The idea is that if one member of the cluster fails, the others will pick up the slack. This even kind of implies that British Airways doesn’t employ a cluster or any other fault tolerant configuration. Given what they do, that seems really weird as this sort of configuration is kind of commonplace in organizations this size.

My feeling is that this source is serving something up that doesn’t quite pass the smell test. Though, I will admit that there have been cases where something like this has happened and this sort of explanation, no matter how implausible it might be ends up being the truth. And if that’s the case, British Airways has some serious explaining to do.


