Facebook Explains What Happened Yesterday…. And I Explain Why That’s Bad

Yesterday’s Facebook outage was massive and could not have come at a worse time for the company given all the issues that it is facing at the moment. But in an attempt at damage control, the company put out a blog post explaining what happened yesterday:

Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.

Bad as that was, I=it wasn’t that simple. Brian Krebs got some inside info:

That’s bad. If you can’t get physical access or remote access to roll back a bad change, you need to rethink how you deploy changes as you always have to have a roll back plan as this should not have taken six hours to fix.

This is likely to come up as a topic of conversation in the hearings that Congress is holding. If this were just people being unable to post cat videos for a few hours, that would be one thing. But WhatsApp is effectively a critical piece of communications infrastructure in many countries. India for example where it is routinely used for communication between patients and doctors, for example, and used by many to pay for things. That’s going to get a lot of attention from a lot of people. And Facebook isn’t going to like the attention it’s getting because it may accelerate people making the call to #DeleteFacebook.

