Browse Definitions :
Tech Accelerator

CrowdStrike outage explained: What caused it and what’s next

A CrowdStrike update caused a massive IT outage, crashing millions of Windows systems. Critical services and business operations were disrupted, revealing tech reliance risks.

What might be considered the largest IT outage in history was triggered by a botched software update from security vendor CrowdStrike, affecting millions of Windows systems around the world. Insurers estimate the outage will cost U.S. Fortune 500 companies $5.4 billion.

The outage occurred July 19, 2024, with millions of Windows systems failing and showing the infamous blue screen of death (BSOD).

CrowdStrike -- the company at the core of the outage -- is an endpoint security vendor whose primary technology is the Falcon platform, which helps protect systems against potential threats in a bid to minimize cybersecurity risks.

In many respects, the outage was a real manifestation of fears that computing users had at the end of the last century with the Y2K bug. With Y2K, the fear was that a bug in software systems would trigger widespread technology failures. While the CrowdStrike failure was not Y2K, it was a software issue that did, in fact, trigger massive disruption on a scale that has not been seen before.

What caused the outage?

The CrowdStrike Falcon platform is widely used by organizations of all sizes across many industries. It is the pervasiveness of CrowdStrike's technology and its integration into so many mission-critical operations and industries that amplified the effect.

The outage was not a Microsoft Windows flaw directly, but rather a flaw in CrowdStrike Falcon that triggered the issue.

Falcon hooks into the Microsoft Windows OS as a Windows kernel process. The process has high privileges, giving Falcon the ability to monitor operations in real time across the OS. There was a logic flaw in Falcon sensor version 7.11 and above, causing it to crash. Due to CrowdStrike Falcon's tight integration into the Microsoft Windows kernel, it resulted in a Windows system crash and BSOD.

Analysis of the CrowdStrike outage

In this podcast, TechTarget Security editors Rob Wright, Alex Culafi and Arielle Waldman assess last week's CrowdStrike outage and the organization's response.

The flaw in CrowdStrike Falcon was inside of a sensor configuration update. The sensor is regularly updated -- sometimes multiple times daily -- to provide users with mitigation and threat protection.

The flawed update was contained in a file that CrowdStrike refers to as "channel files," which specifically provide configuration updates for behavioral protections. Channel file 291 is an update that was supposed to help improve how Falcon evaluates named pipe execution on Microsoft Windows. Named pipes are a common type of communication mechanism for interprocess communications on Microsoft Windows.

With channel file 291, CrowdStrike inadvertently introduced a logic error, causing the Falcon sensor to crash and, subsequently, Windows systems in which it was integrated.

The flaw isn't in all versions of channel file 291. The problematic version is channel file 291 (C-00000291*.sys) with timestamp 2024-07-19 0409 UTC. Channel file 291 timestamped 2024-07-19 0527 UTC or later does not have the logic flaw. By that time, CrowdStrike had noticed its error and reverted the change. But, for many of its users, that reversion came too late as they had already updated, leading to BSOD and inoperable systems.

The dangers of putting all your eggs in one IT basket

Discover the possible consequences of relying on a concentrated and interconnected pool of vendors for all your infrastructure needs.

What happens when the IT infrastructure is too big to fail?

CrowdStrike chaos shows risks of concentrated big IT

What services were affected?

Microsoft estimated that approximately 8.5 million Windows devices were directly affected by the CrowdStrike logic error flaw. That's less than 1% of Microsoft's global Windows install base.

But, despite the small percentage of the overall Windows install base, the systems affected were those running critical operations. Services affected include the following.

Airlines and airports

The outage grounded thousands of flights worldwide, leading to significant delays and cancellations of more than 10,000 flights around the world. In the United States, affected airlines included Delta, United and American Airlines. These airlines were forced to cancel hundreds of flights until systems were restored. Globally, multiple airlines and airports were affected, including KLM, Porter Airlines, Toronto Pearson International Airport, Zurich Airport and Amsterdam Schiphol Airport.

Public transit

Public transit in multiple cities was affected, including Chicago, Cincinnati, Minneapolis, New York City and Washington, D.C.

Healthcare

Hospitals and healthcare clinics around the world faced significant disruptions in appointment systems, leading to delays and cancellations. Some states also reported 911 emergency services being affected, including Alaska, Indiana and New Hampshire.

Financial services

Online banking systems and financial institutions around the world were affected by the outage. Multiple payment platforms were directly affected, and there were individuals who did not get their paychecks when expected.

Media and broadcasting

Multiple media and broadcast outlets around the world, including British broadcaster Sky News, were taken off the air by the outage.

Why Apple and Linux were not affected

CrowdStrike's software doesn't just run on Microsoft Windows; it also runs on Apple's macOS and the Linux OS.

But the July outage only affected Microsoft Windows. The root cause of the outage was a faulty sensor configuration update that specifically affected Windows systems. The channel file 291 update was never issued to macOS or Linux systems as the update deals with named pipe execution that only occurs on the Microsoft Windows OS.

The way that the Falcon sensor integrates as a Windows kernel process is also not the same in macOS or Linux. Those OSes have different integration points to limit potential risk.

However, there was a reported incident in June from Linux vendor Red Hat, where the Falcon sensor -- running as an eBPF program in Linux -- triggered a kernel panic. In Linux, a kernel panic is a type of crash, though typically not as dramatic as BSOD. That issue was resolved without Red Hat reporting any major incidents.

How long will it take businesses to recover from this outage?

CrowdStrike itself was able to identify and deploy a fix for the issue in 79 minutes. While CrowdStrike quickly identified and deployed a fix for the issue, the recovery process for businesses is complex and time-consuming. Among the issues is that, once the problematic update was installed, the underlying Windows OS would trigger BSOD, rendering the system inoperative using the normal boot process.

IT administrators had to manually boot affected systems into Safe Mode or the Windows Recovery Environment to delete the problematic channel file 291 and restore normal operations. That process is labor-intensive, especially for organizations with many affected devices. In some cases, the process also required physical access to each machine, adding further time and effort to the process.

Some businesses were able to apply the fix within a few days. However, the process was not straightforward for all, particularly those with extensive IT infrastructure and encrypted drives. The use of the Microsoft Windows BitLocker encryption technology by some organizations made it significantly more time-consuming to recover as BitLocker recovery keys were required.

It is estimated that it could potentially take months for some organizations to entirely recover all affected systems from the outage.

CrowdStrike timeline

Hackers take advantage of outage

While the outage was not due to a cyberattack, threat actors have taken advantage of the incident.

According to a blog post from CrowdStrike, the security vendor has received reports of the following malicious activity:

  • Phishing emails sent to customers posing as CrowdStrike support.
  • Fake phone calls impersonating CrowdStrike staff.
  • Selling scripts claiming to automate recovery from the botched update.
  • Posing as independent researchers saying the outage was due to a cyberattack and offering remediation insights.

CISA urges individuals and organizations to only follow instructions from legitimate sources and avoid opening suspicious emails and links.

How can businesses be better prepared for tech outages?

The CrowdStrike Windows outage highlighted the vulnerabilities of modern society's heavy reliance on technology. While system backups and automated processes are essential, having manual procedures in place can significantly enhance business continuity during tech outages.

But there are a few things businesses can do to be better prepared for tech outages, including the following.

Test all updates before deploying to production

It has been a best practice for years to allow automated updates to ensure systems are always up to date. However, the CrowdStrike issue laid bare the underlying risk with that approach. For mission-critical systems, testing updates before deployment or having some form of staging environment before pushing updates to production might help to mitigate some risk.

Develop and document manual workarounds

Manual workarounds ensure critical business processes can continue even when technology fails. This approach was common before the digital age and, in the event of outage, can serve as a fallback. Documenting and practicing manual procedures can help mitigate the effect of outages, ensuring businesses can still operate and serve their customers, even during an outage.

Perform disaster recovery and business continuity planning

Outages happen for any number of different reasons. Having extensive disaster recovery and business continuity practices and plans in place is critical. Part of that effort should include the use of redundant systems and infrastructure to minimize downtime and ensure critical functions can switch to backup systems when needed.

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.

For more information about the CrowdStrike outage, read the following articles:

Is Microsoft's CrowdStrike outage a sign of the new normal?

CrowdStrike chaos casts a long shadow on cybersecurity

Dig Deeper on Business software

Networking
  • What is wavelength?

    Wavelength is the distance between identical points, or adjacent crests, in the adjacent cycles of a waveform signal propagated ...

  • subnet (subnetwork)

    A subnet, or subnetwork, is a segmented piece of a larger network. More specifically, subnets are a logical partition of an IP ...

  • secure access service edge (SASE)

    Secure access service edge (SASE), pronounced sassy, is a cloud architecture model that bundles together network and cloud-native...

Security
  • What is exposure management?

    Exposure management is a cybersecurity approach to protecting exploitable IT assets.

  • intrusion detection system (IDS)

    An intrusion detection system monitors (IDS) network traffic for suspicious activity and sends alerts when such activity is ...

  • cyber attack

    A cyber attack is any malicious attempt to gain unauthorized access to a computer, computing system or computer network with the ...

CIO
  • What is a startup company?

    A startup company is a newly formed business with particular momentum behind it based on perceived demand for its product or ...

  • What is a CEO (chief executive officer)?

    A chief executive officer (CEO) is the highest-ranking position in an organization and responsible for implementing plans and ...

  • What is labor arbitrage?

    Labor arbitrage is the practice of searching for and then using the lowest-cost workforce to produce products or goods.

HRSoftware
  • organizational network analysis (ONA)

    Organizational network analysis (ONA) is a quantitative method for modeling and analyzing how communications, information, ...

  • HireVue

    HireVue is an enterprise video interviewing technology provider of a platform that lets recruiters and hiring managers screen ...

  • Human Resource Certification Institute (HRCI)

    Human Resource Certification Institute (HRCI) is a U.S.-based credentialing organization offering certifications to HR ...

Customer Experience
Close