Last week, Microsoft had a major outage that affected a lot of their services including:
- Teams
- Xbox Live
- Outlook
- Microsoft 365
- Minecraft
- Azure
- GitHub
- Microsoft Store
At the time, Microsoft said that a networking change caused this. And at the time, I said this:
My question for Microsoft, which I hope they answer is what specifically happened and what will they do to ensure that it doesn’t happen again. Microsoft does give some version of this information out, so I for one will be interested to see what they say.
And now Microsoft has a Preliminary Post Incident Review that goes into more detail that answers the questions that I had:
We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.
And this is how they responded:
Our monitoring initially detected DNS and WAN related issues from 07:12 UTC. We began investigating by reviewing all recent changes. By 08:10 UTC, the network started to recover automatically. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issues. Networking telemetry shows that nearly all network devices had recovered by 09:00 UTC, by which point the vast majority of regions and services had recovered. Final networking equipment recovered by 09:35 UTC.
Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC.
And this is how they will stop this from happening again:
- We have blocked highly impactful commands from getting executed on the devices (Completed)
- We will require all command execution on the devices to follow safe change guidelines (Estimated completion: February 2023)
This is all good and I really wish that other companies would do the same thing as you’re more likely to trust a company who is open and transparent. Kudos to you Microsoft.
Microsoft Posts Report On Last Week’s Outage
Posted in Commentary with tags Microsoft on January 29, 2023 by itnerdLast week, Microsoft had a major outage that affected a lot of their services including:
At the time, Microsoft said that a networking change caused this. And at the time, I said this:
My question for Microsoft, which I hope they answer is what specifically happened and what will they do to ensure that it doesn’t happen again. Microsoft does give some version of this information out, so I for one will be interested to see what they say.
And now Microsoft has a Preliminary Post Incident Review that goes into more detail that answers the questions that I had:
We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.
And this is how they responded:
Our monitoring initially detected DNS and WAN related issues from 07:12 UTC. We began investigating by reviewing all recent changes. By 08:10 UTC, the network started to recover automatically. By 08:20 UTC, as the automatic recovery was happening, we identified the problematic command that triggered the issues. Networking telemetry shows that nearly all network devices had recovered by 09:00 UTC, by which point the vast majority of regions and services had recovered. Final networking equipment recovered by 09:35 UTC.
Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network. Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC.
And this is how they will stop this from happening again:
This is all good and I really wish that other companies would do the same thing as you’re more likely to trust a company who is open and transparent. Kudos to you Microsoft.
1 Comment »