Google Voice Outage Caused By Expired Certificates…. REALLY?

Back in mid February, Google Voice went down for about four hours. That left users unable to log in and use their Google Voice accounts. That’s a problem if you rely on Google Voice. And a lot of people and companies do given the times that we live in. Well, Google has released an incident report [Warning: PDF] and it is eyebrow raising. The outage was caused by expired TLS certificates:

Google Voice uses the Session Initiation Protocol (SIP) to control voice calls over Internet Protocol. During normal operation, Google Voice client devices aim to maintain continuous SIP connection to Google Voice services. When a connection breaks, the client immediately attempts to restore connectivity. All Google Voice SIP traffic is encrypted using Transport Layer Security (TLS). The TLS certificates and certificate configurations used by Google Voice frontend systems are rotated regularly.

Due to an issue with updating certificate configurations, the active certificate in Google Voice frontend systems inadvertently expired at 2021-02-15 23:51:00, triggering the issue. During the impact period, any clients attempting to establish or reestablish an SIP connection were unable to do so. These clients were unable to initiate or receive VoIP calls during the impact period. Client devices with an SIP connection that was established before the incident and not interrupted during the incident were unaffected.

And this is what they are going to do to stop this from happening again:

To guard against the issue recurring and to reduce the impact of similar events, we are taking the following actions:

  • Configure additional proactive alerting for upcoming certificate expiration events.
  • Configure additional reactive alerting for TLS errors in Google Voice frontend systems.
  • Improve automated tooling for certificate rotation and configuration updates.
  • Utilize more flexible infrastructure for rapid deployment of configuration changes.
  • Update resource allocation systems to more efficiently provision emergency resources during incidents.
  • Develop training and practice scenarios for emergency rollouts of Google Voice frontend systems and configurations.

Now I expect a small or medium company to have issues keeping track of when certificates that power their infrastructure expire. But for a company the size of Google to have this issue is mind blowing.

Chris Hickman, chief security officer at Keyfactor (www.keyfactor.com), a provider of cloud-first PKI as-a-Service and crypto-agility solutions has this to say:

An outage happens when expired certificates fail to authenticate or establish secure communication tunnels. A certificate expiration on its own is not necessarily a security response incident but is disruptive and can lead to outages like that experienced by Google Voice customers. Certificate expiration is an important mechanism to make sure certificates are still being issued to a valid system, similarly to why a driver’s license or passport needs to be renewed periodically. It offers a check and balance system, in the form of workflow and approvals, to maintain legitimacy and authorization. Changes implemented last year by the CA/B forum reduced the lifetime of an SSL/TLS certificate to 398 days and therefore has compounded the issue of keeping up with expiring certificates.

Recent research found that 73% of enterprise respondents experienced unplanned downtime and outages due to mismanaged digital certificates. More than half of those organizations said they experienced four or more certificate-related outages in the past two years. Service outages due to expired certificates are fairly common – and avoidable. Whether you’re a large enterprise or a small business, certificates expire. The key is maintaining visibility to every certificate on the network to stay ahead of expirations and renewals or better yet, using automation to ensure certificates are renewed prior to expiration without the need for human intervention.

These steps can help IT teams avoid similar outages and potential disruptions: 

  • Conduct an audit to understand how many digital certificates the organization has.
  • Build an inventory to identify where certificates live and what they’re used for. 
  • Document the hash algorithm they use and their overall health. 
  • Flag certificate expiration dates. 
  • Assign or note who owns every certificate.
  • Map the methods used to protect valuable code-signing certificates. 
  • Ensure a centralized method is used to securely update every certificate.”

Maybe Google should reach out to Keyfactor as clearly this is a weak point for them.

Leave a Reply

%d bloggers like this: