Getty Images/iStockphoto

Feature

8 largest IT outages in history

IT outages can be caused by cyberattacks, hardware failure, natural disasters and human error. Learn about some of the biggest outages here.

Grant Hatchimonji

By

Grant Hatchimonji

Published: 19 Sep 2024

The internet is embedded in our everyday lives. When such a resource goes dark, the effects can be paralyzing.

There are many variables to consider when assessing the severity of a tech outage, such as scale, duration and losses. But the bottom line is when any internet service goes down, people will notice. But there were no hard-and-fast criteria used to compile the following list of large-scale IT outages. The one true throughline is that every one of them was a big enough deal to live on in notoriety.

The eight largest IT outages in history

There has been no shortage of IT outages since the advent of the internet, with some of the most drastic instances coming within the last decade; the increased dependency on cloud computing and internet-based services in general have played a large part in ensuring the effects of recent outages were felt far and wide. Here are some of the largest-scale incidents in history.

1. Dyn (2016)

The only entry on this list that was the result of a cyberattack is Dyn's outage on Oct. 21, 2016, lasting approximately two hours. The outage was due to one of the largest distributed denial of service (DDoS) attacks in history. Before being acquired by Oracle shortly after the attack, Dyn was a domain name system (DNS) provider responsible for taking human-friendly URLs, or domain names, and translating them into IP addresses.

The attack was carried out through a botnet -- a series of internet-connected machines infected with the same malware known as Mirai -- which all made coordinated and repeated calls to Dyn's servers, shutting them down completely. Many big-name sites and services such as CNN, Netflix, Twitter and Reddit relied on Dyn for DNS services throughout the U.S. and Europe, and they were all knocked out for the time that Dyn was down.

2. Amazon Web Services (2017)

In an example of a particularly costly typo, what should have been a simple debugging of the billing system for Amazon's Simple Storage Service (S3) quickly went awry in February 2017, taking service down for approximately four hours. While attempting to remove servers for a subsystem, an S3 engineer executed a mistype command that instead removed a "larger set of servers … than intended," as Amazon described it.

The servers removed supported two key subsystems in the northern Virginia region, and the disruption meant that they required a full restart. During this time, other AWS services that relied on S3 for storage, including Elastic Compute Cloud and Elastic Block Store, were down.

While S3 subsystems are designed in such a way to minimize customer disruption in the event they fail, the subsystems in this region had not been completely restarted for years. As such, the restart process took four hours and resulted in an AWS outage that cost companies relying on the service millions of dollars in losses.

3. Verizon/BGP (2019)

In what was essentially an internet traffic jam, Verizon experienced an outage lasting approximately three hours beginning around 6:30 a.m. on June 24, 2019. The failings of a protocol known as the Border Gateway Protocol (BGP), which is responsible for joining networks together and directing traffic between them, caused all Verizon's internet traffic to be routed through DQE Communications, a small ISP in Pennsylvania.

DQE was using a BGP optimizer from Noction that shared routing information to a customer -- Allegheny Technologies -- before also sharing it with Verizon, according to a blog post from Cloudflare, one of the affected services. Verizon, in turn, sent out this 0specific routing information to the rest of the internet -- a leak that should have been prevented by filtering that Verizon did not have in place. Large swaths of internet traffic were subsequently routed through DQE and Allegheny, whose networks were unequipped to handle such heavy loads. Besides Cloudflare, major services such as Amazon, Google and Facebook were also disrupted.

4. Google (2020)

Even though the downtime was only 47 minutes, users around the world on the morning of Dec. 14, 2020, felt the effects of not being able to access any services that required a Google account, including Gmail, Google Drive and YouTube.

Google uses a quota system to manage and allocate storage for its authentication services and switched to a new system earlier that year. Unfortunately, parts of the old quota system were left in place during the cutover, causing it to incorrectly report on resource usage, according to a post from Google regarding the incident.

While there were some fail-safes in place, none of them accounted for this scenario. New authentication data could not be written, so the data quickly turned stale, resulting in errors on authentication lookups and crashing the system.

5. Fastly (2021)

Fastly is a content delivery network powering operations such as BBC, Shopify, Amazon, CNN, and the U.S. and U.K. governments, all of which were affected when Fastly went down June 8, 2021.

Fastly released a software update in May 2021 that contained a bug, but it could only be triggered under specific circumstances, which is why it took nearly a month before it became known. At approximately 5:47 a.m. EST on June 8, a customer submitted a configuration change that, while not problematic in and of itself, created a scenario that triggered the bug. The scale of the outage was substantial, with Fastly reporting that 85% of their network returned errors.

If there was a silver lining to be found, it's that remediation was swift. In the blog post reporting on the incident, Fastly said within 49 minutes, 95% of their network had returned to normal, with a full bug fix rolling out hours later. Even though sites were only down for a few minutes, news outlets couldn't report on the news and stores couldn't make sales. As such, the damage was both widespread and varied.

6. Meta (2021)

Fail-safes are only useful if they don't fail. Meta -- the parent company of Facebook, Instagram and WhatsApp -- found this out the hard way on Oct. 4, 2021.

While performing routine maintenance, a command was issued to assess Facebook's network capacity. Instead, it disconnected all the company's data centers across the globe. According to Meta, this was an event that should have been prevented by their systems' auditing, but a bug in the audit tool allowed the command to go through. All of Meta's services went dark for nearly seven hours.

While the damage caused by the outage is difficult to quantify, Facebook lost $47.3 billion in market value during the downtime, according to Bloomberg.

7. Rogers Communications (2022)

Similar to the 2019 Verizon outage, Canadian telecom provider Rogers Communications suffered a massive outage due to a routing issue. The outage affected more than 12 million customers nationwide, according to a report published by the Canadian Radio-television and Telecommunications Commission.

Human error was at play here. While configuring the distribution routers, which are responsible for directing internet traffic, Rogers staff removed a key filter known as the access control list. As a result, all possible routes to the internet ended up passing through the routers of Rogers' core network, ultimately exceeding their capacity and causing them to crash. The outage lasted for nearly a day, with mobile networks, internet and 911 service unavailable during that time.

8. CrowdStrike (2024)

The recent CrowdStrike outage is one of the biggest IT outages not just in recent memory but of all time.

Just after midnight Eastern Standard Time on July 19, 2024, cybersecurity company CrowdStrike rolled out an update for its Falcon Sensors to Microsoft Windows hosts around the world. But it contained a faulty configuration file that crashed the machines, causing a blue screen of death. CrowdStrike discovered the issue and rolled back the update a little over an hour later, but it was too late for any systems that had checked in with CrowdStrike's cloud updating service during that window of time. According to Microsoft's estimates, approximately 8.5 million devices were affected in multiple industries, including travel, finance and healthcare.

Because the problematic file prevented Windows from even booting, the recommended fix was slow, requiring users to start up machines in Safe Mode and navigate to the directory where the file was stored so they could delete it. Microsoft and CrowdStrike eventually released instructions on how the problem could also be addressed by creating and using bootable USB drives. Regardless of the approach, the sheer scale of the outage made remediation a cumbersome process.

Though some systems were restored and brought back online within hours, it wasn't until July 29 at 8 p.m. -- 10 days later -- that CrowdStrike declared approximately 99% of Windows sensors were online. The lasting effects of the outage could be seen firsthand through the struggles of companies such as American Airlines, United Airlines and Delta Airlines, which were still experiencing problems days after the initial failure.

Common causes of IT outages

An important takeaway when looking at history's biggest IT outages is that they can be caused by many factors and are, to an extent, inevitable. Some of the more common reasons outages occur include the following:

Human error. Nobody's perfect. A recurring theme from the list of outages above is many of the incidents occurred simply due to someone's mistake or oversight. Network issues such as misconfigured hardware, software bugs and glitches are all common problems.
Hardware failures. Some of the more common unplanned outages can be related to hardware failures. Power outages and natural disasters can also affect hardware.
Malicious behavior. A good rule of thumb is to always be on the lookout for suspicious activity online, because cyberattacks can happen. These include DDoS attacks, data breaches and ransomware attacks.

How to prepare for outages

It's impossible to be completely prepared for an outage that you don't see coming. But there are certainly some measures that both providers and end users can take to soften the blow in the event of an incident:

Have a safety net. They might not be flawless, but redundancy and failover systems, which allow a system to automatically switch to backup infrastructure, can go a long way in providing protection. An alert system and monitoring are also helpful.
Test and communicate. Whatever form your contingency plan takes, make sure it has been well tested. Plans should be designed with cross-company consensus on critical elements, as outages can affect more than just IT.
Back it up. End users should always keep offline resources and have backups -- both data and infrastructure -- to access in the event of an outage. Regular backups can help too.

Grant Hatchimonji is a freelance writer and solutions architect, where he does software engineering and consulting.

Dig Deeper on IT management

Search Networking

What is network bandwidth and how is it measured?
Network bandwidth is a measurement indicating the maximum capacity of a wired or wireless communications link to transmit data ...
What is telematics?
Telematics is a term that combines the words 'telecommunications' and 'informatics' to describe the use of communications and IT ...
What is multi-user MIMO?
Multi-user MIMO (MU-MIMO) is a wireless communication technology that uses multiple antennas to improve communication by creating...

Search Security

What is biometric authentication?
Biometric authentication is a security process that relies on the unique biological characteristics of individuals to verify ...
What is cloud infrastructure entitlement management (CIEM)?
Cloud infrastructure entitlement management (CIEM) is a modern cloud security discipline for managing identities and privileges ...
What is cybersecurity?
Cybersecurity is the practice of protecting systems, networks and data from digital threats.

Search CIO

What is a procurement plan?
A procurement plan -- also called a procurement management plan -- is a document that is used to manage the process of finding ...
What is a quantum circuit? Quantum vs. classical circuit
Quantum circuits are systems consisting of logic gates that operate on quantum bits (qubits) to process information and perform ...
What is prescriptive analytics?
Prescriptive analytics is a type of data analytics that provides guidance on what should happen next.

Search HRSoftware

What is a 360 review?
A 360 review, or 360-degree review, is a continuous performance management strategy aimed at helping employees at all levels ...
What is a talent pipeline?
A talent pipeline is a pool of candidates who are ready to fill a position.
What is an applicant tracking system (ATS)?
An applicant tracking system (ATS) is software that manages the recruiting and hiring process, including job postings and job ...

Search Customer Experience

What is field service management (FSM)?
Field service management (FSM) is a system of managing off-site workers and the resources they require to do their jobs ...
What are customer service and support?
Customer service is the support organizations offer to customers before, during and after purchasing a product or service.
What is quality of experience (QoE or QoX)?
Quality of experience (QoE or QoX) is a measure of the overall level of a customer's satisfaction and experience with a product ...

Close