So What Happened To Cause Rogers To Go Down On Friday? Rogers Won’t Say In Detail But Cloudflare Has A Good Idea What Happened

Rogers went down hard on Friday taking all its services offline across Canada. Home phone, cell phone, Internet, and TV were all offline from 5AM EST Friday into Saturday. And it even affected debit card payments and 9-1-1 services among other things. It was Rogers second major outage in 15 months and enraged people so much that it is driving some to move their business to other telcos. You can get my thoughts on that here. But the real question is what happened to take Rogers down? Well, Rogers hasn’t said what happened. At least not in detail. They put out a statement yesterday via Rogers CEO Tony Staffieri that tries to speak to that, but in my opinion lacks a lot of detail:

I also want to share what we know about what happened yesterday. We now believe we’ve narrowed the cause to a network system failure following a maintenance update in our core network, which caused some of our routers to malfunction early Friday morning. 

That’s pretty vague. But fortunately for us and unfortunately for Rogers, Cloudflare has a pretty good idea of what happened and has documented it here. Here’s the highlights:

Based on what we’re seeing and similar incidents in the past, we believe this is likely to be an internal error, not a cyber attack.

Cloudflare Radar shows a near complete loss of traffic from Rogers ASNAS812, that started around 08:45 UTC (all times in this blog are UTC).

Cloudflare data shows that there was a clear spike in BGP (Border Gateway Protocol) updates after 08:15, reaching its peak at 08:45.

BGP is a mechanism to exchange routing information between networks on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver each network packet to its final destination. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t exist.

The Internet is literally a network of networks, or for the maths fans, a graph, with each individual network a node in it, and the edges representing the interconnections. All of this is bound together by BGP. BGP allows one network (say Rogers) to advertise its presence to other networks that form the Internet. Rogers is not advertising its presence, so other networks can’t find Rogers network and so it is unavailable.

A BGP update message informs a router of changes made to a prefix (a group of IP addresses) advertisement or entirely withdraws the prefix. In this next chart, we can see that at 08:45 there was a withdrawal of prefixes from Rogers ASN.

So in short, it appears from Cloudflare’s perspective that Rogers made some a change that took the entire Rogers network down by effectively erasing the Rogers network off the Internet. And the fact that this happened inside Rogers 2AM to 6AM EST maintenance window makes Cloudflare’s perspective more believable than anything that Rogers is saying at the moment.

As if to back this up, someone took the BGP update that came from Rogers and turned it into a visualization that shows how Rogers was effectively erased off the Internet:

What you’e seeing is that as routes disappear, the Rogers network tries to find new way to route traffic. But it eventually runs out of routes and at that point they are down.

All of this is pretty damming and doesn’t put Rogers in the best light. But This is why I would like to see Rogers publish a root cause analysis that provides way more detail than the lame statement that Rogers put out yesterday, preferably verified by a third party that people can trust, and make sure that it has the complete details of what they are going to do to make sure that something like this never happens again. And by detail, I mean beyond this action plan that Rogers put out:

As CEO, I take full responsibility for ensuring we at Rogers earn back your full trust, and am focused on the following action plan to further strengthen the resiliency of our network:

  1. Fully restore all services: While this has been nearly done, we are continuing to monitor closely to ensure stability across our network as traffic returns to normal. 
  2. Complete root cause analysis and testing: Our leading technical experts and global vendors are continuing to dig deep into the root cause and identify steps to increase redundancy in our networks and systems. 
  3. Make any necessary changes: We will take every step necessary, and continue to make significant investments in our networks to strengthen our technology systems, increase network stability for our customers, and enhance our testing. 

The cynic in me reads this action plan like this:

  • Rogers is lacking in the area of redundancy in its networks and systems, and now it’s a problem. Especially since they didn’t seem to have a roll back strategy for whatever they were doing that caused this. That’s mind blowing as having a roll back strategy is IT 101.
  • Rogers needs to make investments in strengthening their technology systems because they haven’t done so in the past, and now it’s a problem.
  • Rogers needs to make investments in network stability because they have not done so in the past and now it’s a problem.
  • Rogers needs to enhance their testing because it is lacking and now it’s a problem because I am guessing that they didn’t test whatever they were doing enough, or at all.

What Canadians need to know is what specific actions that Rogers will take, when they will start taking action, and how long is it going to take. Because at the moment Rogers looks pretty bad based on the info supplied by Cloudflare.

So how about it Rogers? Will you come clean about what happened on Friday in detail? Will you provide a detailed action plan to remedy this with timetables? Canadians deserve complete and fulsome answers from you as this was a far reaching and catastrophic event that suggests that no Canadian should ever trust you or your services again unless you give Canadians a significant reason or reasons to trust you going forward.

One Response to “So What Happened To Cause Rogers To Go Down On Friday? Rogers Won’t Say In Detail But Cloudflare Has A Good Idea What Happened”

  1. […] seems to dovetail with this observation by Cloudflare where Rogers effectively deleted itself off the Internet. Here’s where it gets […]

Leave a Reply