Amazon Web Services Apologises After Major Outage Disrupts Global Platforms

Web Reporter
3 Min Read

Amazon Web Services (AWS) has issued a formal apology following a massive outage on Monday that disrupted thousands of websites and online services worldwide, including Snapchat, Reddit, and Lloyds Bank.

The disruption, which originated in AWS’s North Virginia data hub — its largest cluster of data centres — occurred on 20 October and left many popular platforms offline for several hours. Some services, including banking apps and online games, remained affected well into the afternoon.

In a statement, the company said the outage was caused by a technical fault that prevented its internal systems from connecting websites to their corresponding IP addresses — the numerical identifiers that computers use to locate online services.

“We apologise for the impact this event caused our customers,” AWS said. “We know how critical our services are to our customers, their applications, and end users, and their businesses.”

While many services such as Roblox and Fortnite were restored within hours, others — including Lloyds Bank, Reddit, and the US payments app Venmo — experienced prolonged disruptions. Even some smart home devices were affected, with owners of internet-connected “smart beds” from Eight Sleep reporting that their adjustable mattresses overheated or became stuck in raised positions during the outage.

Experts said the widespread impact underscored the global dependence on AWS, which along with Microsoft Azure, dominates the cloud computing market. “This incident shows how reliant major companies are on a few cloud providers,” said Dr. Junade Ali, a software engineer and fellow at the Institute of Engineering and Technology.

In a detailed technical summary, Amazon explained that the outage stemmed from an issue in its US-EAST-1 region — the same one responsible for a number of past incidents. A fault in the automated systems that manage the Domain Name System (DNS) database led to a breakdown in communication between servers.

The problem, Amazon said, was triggered by a “latent race condition” — a rare software bug that emerged under an unusual sequence of events. Because much of the process is automated, the failure propagated quickly before engineers could intervene.

Dr. Ali described the event as a case of “faulty automation,” explaining that the internal “address book” AWS systems rely on failed to locate one of its key components, causing cascading errors.

He added that the incident highlights the need for greater redundancy and resilience in cloud computing. “Those who had a single point of failure in this Amazon region were susceptible to being taken offline,” he said. “Companies should diversify their cloud providers so they can switch over when one goes down.”

AWS said it is taking steps to prevent similar issues in the future and will “do everything we can to learn from the event and improve our availability.”

TAGGED:
Share This Article