Why Facebook’s outage shut many employees out of their own offices
Facebook, Instagram, and WhatsApp all went down yesterday. You probably heard about it, given that not only did those three massive services get taken offline for six hours, but a few other websites gotten taken down with it too. It also led to some humorous tidbits of information coming out of Facebook, including the fact that multiple Facebook employees were shut out of their offices as a result. What exactly happened, though?
Understanding Border Gateway Protocol (BGP)
A lot of this is simplified and cut down in order to explain the basic concepts of what went wrong at Facebook. Cloudflare has a fantastic technical write-up on the entire situation if you want a completely in-depth technical analysis complete with data from their own DNS.
Whenever you visit a website, the human-readable domain that you type made up of letters and numbers isn’t actually what directly connects you to the website you want to visit. Instead, they map to an IP address that a Domain Name Server finds for you. Bigger, more important than that is BGP, which stands for Border Gateway Protocol. This mechanism effectively acts as an exchange that routes information between autonomous systems across the internet. It’s effectively the backbone of the wider internet that binds it all together and is how one network can advertise its existence to other networks.
When we use the internet, we’re interacting with thousands of internet service providers, routers, and servers. Every website you visit, every connection that’s routed in the background, they all make contact with several systems that are all directly unrelated to each other. The BGP effectively shows your computer or smartphone or any other device the best way to get to your destination. Part of BGP’s job is also to find the best path from your device to your destination.
The BGP connects autonomous systems in particular, and these autonomous systems are owned by singular entities and have their own network. They can be an internet service provider, a large company, or even a university. I graduated from University College Dublin last month, and it has its own autonomous system that was allocated in 1993. Information on these systems is public.
The problem is, the internet is a living, breathing being. These networks update constantly, and autonomous systems each share their own networks with each other all the time. This network is then built up by different autonomous systems speaking to each other and creating their own map of the internet. Famously, when the Pakistani government attempted to ban YouTube in 2008, it used BGP to route YouTube into a black hole. Because this was then shared with other autonomous systems which copied that configuration, nearly all of YouTube’s traffic got routed into a black hole in Pakistan. YouTube itself was completely fine, but abuse of BGP routing effectively killed the website temporarily.
Facebook’s part to play
Here’s the problem: Facebook operates its own set of Domain Name Servers. These are responsible for telling your internet service provider and all of the intermediaries in that route where “facebook” (and all of the company’s other products, like Instagram and WhatsApp) actually is. Facebook stopped broadcasting BGP routing information to its own domain name servers, which broadcast that information to autonomous systems worldwide. This meant that Facebook had effectively disconnected itself from the internet. Brian Krebs, a cybersecurity reporter, said that it appeared to be a “routine BGP update gone wrong”.
From trusted source: Person on FB recovery effort said the outage was from a routine BGP update gone wrong. But the update blocked remote users from reverting changes, and people with physical access didn’t have network/logical access. So blocked at both ends from reversing it.
— briankrebs (@briankrebs) October 4, 2021
In Facebook’s initial post mortem, it said the following:
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
Later on, the company’s more in-depth breakdown of the situation provided more information.
During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.
Facebook then went on to explain that when their DNS servers can’t speak to their data centers, they withdraw their BGP advertisements. This is what cut Facebook off from the rest of the world, and what made its DNS completely unreachable. The company also talked about how it was difficult for engineers to get on-site in order to fix the problem, which makes sense, as multiple reports talked about how Facebook staff had problems even entering their offices.
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
— Sheera Frenkel (@sheeraf) October 4, 2021
Cloudflare’s excellent write-up goes into detail about some of the problems that it noticed, along with some of the ramifications as a result of Facebook going down. Cloudflare operates its own DNS, 18.104.22.168, and the provider saw that Facebook’s own website stopped resolving. In fact, they worried that it was a problem with their systems. Facebook had stopped broadcasting routing information to its DNS, meaning that its DNS was unavailable.
When Facebook stopped broadcasting routes, DNS resolvers went haywire. Between applications attempting to constantly reconnect to Facebook, and human behavior also resulting in spam towards Facebook’s servers, a “tsunami” of additional DNS traffic hit Cloudflare’s servers.
Funnily enough, some Huawei device owners noted that they could no longer connect to Wi-Fi networks either. It’s possible that Huawei is using Facebook’s servers in some way or another to verify if an internet connection is active. It might also have been an unfortunate coincidence.
It gets worse though still. When Facebook went down, Cloudflare noted that queries for other platforms like Twitter, Signal, Telegram, and TikTok went up. Twitter began to struggle under the load as well, and for a brief few minutes, many thought that it would go down too.
TWITTER PLEASE YOU’RE ALL THAT’S LEFT pic.twitter.com/0zfmcCert4
— Adam Conway (@AdamConwayIE) October 4, 2021
Websites that use single sign-on with Facebook also ran into problems, as many users couldn’t even log in. The entire internet in many ways ran into problems, with services across the globe complaining. Several hours later, Facebook came back online.
If this entire debacle proves one thing, it’s as Eva Galperin, director of cybersecurity at the Electronic Frontier Foundation says: “the internet is held together with bubblegum and string”.