They exclude a Facebook employee’s explanation of the origin of the crash

The downfall that Facebook, WhatsApp and Instagram experienced yesterday left us wondering what the flaw might be.

It is already known that the cause of the problem was the removal of the publication of paths to the company’s networks, which took out of service the DNS servers that translate domains (for example, to their IP addresses.

This caused applications or browsers to return error messages when trying to contact Facebook’s servers. It was as if they had been suddenly disconnected from the network.

What is not so clear is what is the source of the problem, since a failure of these features is unusual.

Before the problem spread, an alleged Facebook employee who worked on the investigation and recovery team contributed in a Reddit thread [archivo] an explanation of what happened, but then he deleted his account and the message (why?).

This is a global outage of all FB-related services/infrastructure (source: I am currently on the recovery/investigation team).

As many of you know, DNS for FB services has been affected and this is probably a symptom of the real issue, which is that BGP peering with Facebook’s peering routers has crashed, probably due to a configuration change that had little effect. prior to the occurrence of the interruptions (starting at approximately 15:40 UTC).

Now, there are people trying to access interconnect routers to apply patches, but people with physical access are separated from people with knowledge of how to authenticate systems and people who know what to actually do, so now there’s a logistical challenge to unify all that knowledge.

Part of this is also due to the downsizing of data centers due to anti-pandemic measures.

Later, returned to topic to provide additional information:

There is no discussion that I am aware of that is considering a threat/attack vector.

I think the original change was “automatic” (like configuring via a web interface). However, now that the connection to the outside world is down, remote access to these tools is gone, so the emergency procedure is to get physical access to the peering routers and do all the configuration locally.

So everything points to what the failure was due to an automatic update that did not work as expected, which was aggravated by the fact that it was not possible to access the routers remotely. This forced technicians to visit Facebook’s data centers to manually make corrections.

Read Also:  Threads already has 150 million active users

Recent Articles

Related News

Leave A Reply

Please enter your comment!
Please enter your name here