#Fail: A Typo Caused AWS To Break The Internet

If you were wondering what caused AWS to break the Internet earlier this week. Wonder no longer. Amazon did a post mortem on the incident and here’s what they found:

The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

That’s a lot of text. But here’s the part that is crucial:

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

In other words, a typo pretty much took down the Internet.

#Fail

Here’s the good news. AWS said it’s adding additional safety checks and ways to improve recovery times. A tool used to remove servers from the system will be modified to prevent someone from accidentally removing too much capacity at once. And by extension, they won’t be able to take down the entire Internet again.

This entry was posted on March 2, 2017 at 2:23 pm and is filed under Commentary with tags Amazon. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

The IT Nerd