Browse Definitions :

Getty Images/iStockphoto

Explaining the largest IT outage in history and what's next

A CrowdStrike update caused a massive IT outage, crashing millions of Windows systems. Critical services and business operations were disrupted, revealing tech reliance risks.

What might be considered the largest IT outage in history was triggered by a botched software update from security vendor CrowdStrike, affecting millions of Windows systems around the world.

The outage occurred July 19, 2024, with millions of Windows systems failing and showing the infamous blue screen of death (BSOD).

CrowdStrike -- the company at the core of the outage -- is an endpoint security vendor whose primary technology is the Falcon platform, which helps protect systems against potential threats in a bid to minimize cybersecurity risks.

In many respects, the outage was a real manifestation of fears that computing users had at the end of the last century with the Y2K bug. With Y2K, the fear was that a bug in software systems would trigger widespread technology failures. While the CrowdStrike failure was not Y2K, it was a software issue that did, in fact, trigger massive disruption on a scale that has not been seen before.

What caused the outage?

The CrowdStrike Falcon platform is widely used by organizations of all sizes across many industries. It is the pervasiveness of CrowdStrike's technology and its integration into so many mission-critical operations and industries that amplified the effect.

The outage was not a Microsoft Windows flaw directly, but rather a flaw in CrowdStrike Falcon that triggered the issue.

Falcon hooks into the Microsoft Windows OS as a Windows kernel process. The process has high privileges, giving Falcon the ability to monitor operations in real time across the OS. There was a logic flaw in Falcon sensor version 7.11 and above, causing it to crash. Due to CrowdStrike Falcon's tight integration into the Microsoft Windows kernel, it resulted in a Windows system crash and BSOD.

The flaw in CrowdStrike Falcon was inside of a sensor configuration update. The sensor is regularly updated -- sometimes multiple times daily -- to provide users with mitigation and threat protection.

The flawed update was contained in a file that CrowdStrike refers to as "channel files," which specifically provide configuration updates for behavioral protections. Channel file 291 is an update that was supposed to help improve how Falcon evaluates named pipe execution on Microsoft Windows. Named pipes are a common type of communication mechanism for interprocess communications on Microsoft Windows.

With channel file 291, CrowdStrike inadvertently introduced a logic error, causing the Falcon sensor to crash and, subsequently, Windows systems in which it was integrated.

The flaw isn't in all versions of channel file 291. The problematic version is channel file 291 (C-00000291*.sys) with timestamp 2024-07-19 0409 UTC. Channel file 291 timestamped 2024-07-19 0527 UTC or later does not have the logic flaw. By that time, CrowdStrike had noticed its error and reverted the change. But, for many of its users, that reversion came too late as they had already updated, leading to BSOD and inoperable systems.

What services were affected?

Microsoft estimated that approximately 8.5 million Windows devices were directly affected by the CrowdStrike logic error flaw. That's less than 1% of Microsoft's global Windows install base.

But, despite the small percentage of the overall Windows install base, the systems affected were those running critical operations. Services affected include the following.

Airlines and airports

The outage grounded thousands of flights worldwide, leading to significant delays and cancellations of more than 10,000 flights around the world. In the United States, affected airlines included Delta, United and American Airlines. These airlines were forced to cancel hundreds of flights until systems were restored. Globally, multiple airlines and airports were affected, including KLM, Porter Airlines, Toronto Pearson International Airport, Zurich Airport and Amsterdam Schiphol Airport.

Public transit

Public transit in multiple cities was affected, including Chicago, Cincinnati, Minneapolis, New York City and Washington, D.C.

Healthcare

Hospitals and healthcare clinics around the world faced significant disruptions in appointment systems, leading to delays and cancellations. Some states also reported 911 emergency services being affected, including Alaska, Indiana and New Hampshire.

Financial services

Online banking systems and financial institutions around the world were affected by the outage. Multiple payment platforms were directly affected, and there were individuals who did not get their paychecks when expected.

Media and broadcasting

Multiple media and broadcast outlets around the world, including British broadcaster Sky News, were taken off the air by the outage.

Why Apple and Linux were not affected

CrowdStrike's software doesn't just run on Microsoft Windows; it also runs on Apple's macOS and the Linux OS.

But the July outage only affected Microsoft Windows. The root cause of the outage was a faulty sensor configuration update that specifically affected Windows systems. The channel file 291 update was never issued to macOS or Linux systems as the update deals with named pipe execution that only occurs on the Microsoft Windows OS.

The way that the Falcon sensor integrates as a Windows kernel process is also not the same in macOS or Linux. Those OSes have different integration points to limit potential risk.

However, there was a reported incident in June from Linux vendor Red Hat, where the Falcon sensor -- running as an eBPF program in Linux -- triggered a kernel panic. In Linux, a kernel panic is a type of crash, though typically not as dramatic as BSOD. That issue was resolved without Red Hat reporting any major incidents.

How long will it take businesses to recover from this outage?

CrowdStrike itself was able to identify and deploy a fix for the issue in 79 minutes. While CrowdStrike quickly identified and deployed a fix for the issue, the recovery process for businesses is complex and time-consuming. Among the issues is that, once the problematic update was installed, the underlying Windows OS would trigger BSOD, rendering the system inoperative using the normal boot process.

IT administrators had to manually boot affected systems into Safe Mode or the Windows Recovery Environment to delete the problematic channel file 291 and restore normal operations. That process is labor-intensive, especially for organizations with many affected devices. In some cases, the process also required physical access to each machine, adding further time and effort to the process.

Some businesses were able to apply the fix within a few days. However, the process was not straightforward for all, particularly those with extensive IT infrastructure and encrypted drives. The use of the Microsoft Windows BitLocker encryption technology by some organizations made it significantly more time-consuming to recover as BitLocker recovery keys were required.

It is estimated that it could potentially take months for some organizations to entirely recover all affected systems from the outage.

Timeline of events

Here is a timeline of outage events.

July 19, 2024

04:09 UTC: CrowdStrike released a sensor configuration update to Windows systems. This update included channel file 291, which contained a logic error that led to system crashes and BSOD on affected systems.

04:09-05:27 UTC: Systems running Falcon sensor for Windows version 7.11 and above that were online during this period downloaded the faulty update and began experiencing crashes.

05:27 UTC: CrowdStrike identified and remediated the issue by reverting the problematic update. The corrected version of channel file 291 was deployed.

15:30 UTC: The Cybersecurity and Infrastructure Security Agency issued an alert acknowledging the widespread IT outage due to the CrowdStrike update. CISA confirmed the issue was not related to a cyberattack and provided initial guidance for remediation.

July 20, 2024

21:59 UTC: CrowdStrike updated its remediation and guidance hub, providing consolidated resources for affected customers. The company reiterated the issue was not a cyberattack and assured customers that systems not currently affected would continue to operate as expected.

July 21, 2024

21:22 UTC: Microsoft released a custom Windows Preinstallation Environment recovery tool to find and remove the faulty CrowdStrike update that crashed an estimated 8.5 million Windows devices.

How can businesses be better prepared for tech outages?

The CrowdStrike Windows outage highlighted the vulnerabilities of modern society's heavy reliance on technology. While system backups and automated processes are essential, having manual procedures in place can significantly enhance business continuity during tech outages.

But there are a few things businesses can do to be better prepared for tech outages, including the following.

Test all updates before deploying to production

It has been a best practice for years to allow automated updates to ensure systems are always up to date. However, the CrowdStrike issue laid bare the underlying risk with that approach. For mission-critical systems, testing updates before deployment or having some form of staging environment before pushing updates to production might help to mitigate some risk.

Develop and document manual workarounds

Manual workarounds ensure critical business processes can continue even when technology fails. This approach was common before the digital age and, in the event of outage, can serve as a fallback. Documenting and practicing manual procedures can help mitigate the effect of outages, ensuring businesses can still operate and serve their customers, even during an outage.

Perform disaster recovery and business continuity planning

Outages happen for any number of different reasons. Having extensive disaster recovery and business continuity practices and plans in place is critical. Part of that effort should include the use of redundant systems and infrastructure to minimize downtime and ensure critical functions can switch to backup systems when needed.

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.

Dig Deeper on Business software

Networking
  • What is wavelength?

    Wavelength is the distance between identical points, or adjacent crests, in the adjacent cycles of a waveform signal propagated ...

  • subnet (subnetwork)

    A subnet, or subnetwork, is a segmented piece of a larger network. More specifically, subnets are a logical partition of an IP ...

  • secure access service edge (SASE)

    Secure access service edge (SASE), pronounced sassy, is a cloud architecture model that bundles together network and cloud-native...

Security
  • intrusion detection system (IDS)

    An intrusion detection system monitors (IDS) network traffic for suspicious activity and sends alerts when such activity is ...

  • cyber attack

    A cyber attack is any malicious attempt to gain unauthorized access to a computer, computing system or computer network with the ...

  • digital signature

    A digital signature is a mathematical technique used to validate the authenticity and integrity of a digital document, message or...

CIO
  • What is data privacy?

    Data privacy, also called information privacy, is an aspect of data protection that addresses the proper storage, access, ...

  • product development (new product development)

    Product development -- also called new product management -- is a series of steps that includes the conceptualization, design, ...

  • innovation culture

    Innovation culture is the work environment that leaders cultivate to nurture unorthodox thinking and its application.

HRSoftware
  • organizational network analysis (ONA)

    Organizational network analysis (ONA) is a quantitative method for modeling and analyzing how communications, information, ...

  • HireVue

    HireVue is an enterprise video interviewing technology provider of a platform that lets recruiters and hiring managers screen ...

  • Human Resource Certification Institute (HRCI)

    Human Resource Certification Institute (HRCI) is a U.S.-based credentialing organization offering certifications to HR ...

Customer Experience
  • What is an abandoned call?

    An abandoned call is a call or other type of contact initiated to a call center or contact center that is ended before any ...

  • What is an outbound call?

    An outbound call is one initiated by a contact center agent to prospective customers and focuses on sales, lead generation, ...

  • What is lead-to-revenue management (L2RM)?

    Lead-to-revenue management (L2RM) is a set of sales and marketing methods focusing on generating revenue throughout the customer ...

Close