Facebook, Instagram, WhatsApp … The underside of a historic outage

For six hours, Facebook and its affiliates were inaccessible. Never seen ! Futura does the autopsy of this mishap.

4 October, 3:40 p.m. UTC . Cloudflare , one of the largest companies for optimizing and securing traffic on Internet , notes that Facebook has stopped associating their domain name with their IP addresses . On computers around the world, a blank page showing a connection error appears and applications Facebook, Instagram , WhatsApp , Facebook Messenger freeze on smartphones.

Facebook and its various services have completely disappeared from the face of the Web. A bowl of air for some, lambasting the anxiety-provoking atmosphere of the network because of its tendency to let misinformation and those who do it slip away. A disaster, for hundreds of millions of destitute users without the tools and services of the social network .

For Internet users, it is then the rush towards other social networks and, in the first place, Celebrities, Twitter to find out what happened. Networks that suddenly collapsed on connections to the point of saturating. It is also via Twitter that Facebook announced that a real problem was being addressed. And that global outage lasted six hours! A record. A giant cyberattack ? No, rather a handling error during an operation to configure servers . it really happened?

DNS servers wrongly accused

DNS, IP, BGP … These acronyms were used extensively in the evening to describe the source of the concern encountered by Facebook. Each of them is indeed part of the disaster scenario but the one that was the first to blame, wrongly, was the Thus, TCP / IP DNS . DNS, Domain Name Service is what makes it possible to associate a web address, for example facebook.com, with a web page. This page is identified by an IP address, a sequence of digits that can be compared to a single telephone number. To take an image, it is the equivalent of making a call from a mobile by simply entering the name of the correspondent in the address book. The name is linked to a telephone number that the network can manage. Given its size, the social network has its own DNS servers. But, during the blackout yesterday, they were still working, even though they were spinning in a vacuum and no longer communicating with the rest of the network. The concern therefore did not come from them.

A BGP and AS duo who no longer speak to each other

The real one troublemaker, it is not these DNS servers , but a protocol called BGP, for Border Gateway Protocol . When transmitting data, it is he who will evaluate the best routes to route the data packets to their destination over the entire network. Instead of having to browse all the DNS servers to match an address to a number for data delivery, this protocol simply queries large servers called AS ( Autonomous Systems ) which are managed by Internet operators. They are the ones who have the largest directories of IP addresses. They come to give the mapping of the network to the BGP protocol so that it sends the data packets quickly. The DNS servers are part of the IP addresses which communicate with these AS servers.

BGP Updates in Facebook

In large AS servers, updating the network map is very rare. At around 3:40 p.m., we can see a big spike which shows that the base of IP addresses for the BGP protocol has been erased. © Cloudflare

And here again, Facebook has its own AS servers which memorize the IP addresses of all the services and also those of its DNS servers. And it is precisely on these servers that the concern was focused. During an update operation, technicians accidentally deleted the BGP protocol IP address database. From that point on, the AS servers no longer had any instructions to send the data packets. No more road, no more traffic, Facebook and all of its services were disconnected.

And, as trouble often flies in flies, the breakdown is eternalized due to several additional factors. With Facebook’s IP addresses cut off from the network, as a result, company personnel could no longer access remote servers to reestablish the network. A phenomenon accentuated by the massive implementation of teleworking since the start of the pandemic . Worse yet, in place The theories established by physics apply within well-defined frameworks.
Physics physical , employees were stuck at doors because their access badges were not working due to this failure. Finally, once the “cables were reconnected”, it was also necessary to count on a tsunami of requests coming from users all trying to connect at the same time.

Finally, this big mishap shows again that the Internet is incredibly complex and that a small error can have global consequences.

