Here’s why a bunch of Google services went down on Sunday
Last Sunday, Google’s cloud services suffered an outage which resulted in downtime lasting several hours. Services such as Google Cloud Platform, YouTube, Gmail, Google Drive, and others all were affected in certain parts of the US. Not only that, third-party services that use Google Cloud Platform were affected too, such as Snapchat, iCloud, and more. Google has since detailed both the cause of the outage, and their plans going forward to avoid it happening again.
The document starts with an apology from Google themselves, as both companies and users rely on these services to function. Users of Google services in affected areas had their requests handed off to servers in other regions, which is fine for web searches but may introduce problems for the likes of YouTube, which uses a lot of bandwidth. Third-party applications without appropriate fallbacks simply didn’t work for the duration of the outage. The impact on the company’s services was huge.
- YouTube views dropped by 10% worldwide
- Google Cloud Storage had a 30% reduction in traffic
- Approximately 1% of Gmail users had problems
- Low-bandwidth services like Google Search were only mildly affected, suffering an increased latency as requests switched to unaffected regions
Put simply, the cause of the outage was “a configuration change that was intended for a small number of servers in a single region” being “incorrectly applied to a larger number of servers across several neighboring regions”. This caused these servers to stop using more than half of their available network capacity, resulting in network congestion. To make matters worse, the same network congestion that may have stopped you watching a YouTube video stopped the company’s engineers from restoring the correct configurations.
At the moment, Google is now conducting a full investigation in order to understand the causes for both the initial decreased capacity and the slow restoration time.
With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration. We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event.