Amazon Explains Why AWS Went Down On Tuesday

On Tuesday, AWS had a massive outage that took down a lot of the Internet in the process. Amazon has published a post-event summary that details why they went down. It’s a very detailed explanation that is very technical:

To explain this event, we need to share a little about the internals of the AWS network. While the majority of AWS services and all customer applications run within the main AWS network, AWS makes use of an internal network to host foundational services including monitoring, internal DNS, authorization services, and parts of the EC2 control plane. Because of the importance of these services in this internal network, we connect this network with multiple geographically isolated networking devices and scale the capacity of this network significantly to ensure high availability of this network connection. These networking devices provide additional routing and network address translation that allow AWS services to communicate between the internal network and the main AWS network. At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.

Hopefully Amazon addresses this so that this doesn’t happen again. Though I am not hopeful given that AWS doesn’t exactly have a good track record in terms of stability.

Leave a Reply

%d bloggers like this: