CrowdStrike outage underscores software testing dilemmas
Experts say efforts to avoid incidents such as last week's CrowdStrike outage will face time-honored tradeoffs between velocity, stability, access and security.
Endpoint security vendor CrowdStrike pledged to improve its software testing after a flawed content update caused a massive Windows systems outage last week. But avoiding future incidents might not be that simple, according to IT experts.
CrowdStrike issued a preliminary incident review this week detailing how a software bug caused a failure in the content validation tool it uses to look for errors such as the one that sent some 8.5 million Windows machines into a reboot loop last week. The company also clarified that the update was to configuration data within what it calls a Rapid Response Content template rather than to its main codebase or OS kernel drivers.
Rapid Response Content is meant to monitor systems for emerging cybersecurity threats "at operational speed," according to the CrowdStrike report. This type of file is more frequently updated than CrowdStrike's core application and had been subject to a less-extensive testing process. Following last week's incident, however, the company pledged to apply the same software testing procedures, including canary deployments, to Rapid Response Content.
In one software engineer's view, CrowdStrike is rightly answering for "doing lots of stuff it shouldn't," including less-thorough testing on Rapid Release Content updates that still had direct access to the Windows OS.
"Even testing for one minute would have discovered these issues," said Kyler Middleton, senior principal software engineer at healthcare tech company Veradigm. "In my mind, that one minute of testing would have been acceptable."
In general, software testing and test coverage is lacking in many corners of the market. For example, a Federal Communications Commission report this week found that an AT&T mobile network outage in February was caused by a network configuration update that had not followed the telco's own internal testing procedures.
This wasn't an isolated incident for IT, according to IDC data. The analyst firm's June 2024 "DevOps Practices, Perceptions, and Tooling" survey found that just 44% of 300 respondents' software quality tests were automated. Additionally, 27.3% chose testing and QA as one of the top two bottlenecks in their DevOps pipelines.
"Testing continues to be a significant point of friction [in application development]," IDC analyst Katie Norton said. "Software quality governance requires automation with agile, continuous quality initiatives in the face of constrained QA staff and increasing software complexity."
Software testing, both for security and quality, appears to be among the most promising uses for generative AI in other IDC surveys, Norton said.
"I am hopeful that the next few years will see improvements in these statistics," she said. "However, AI can't fix the lack of or failure to follow policy and procedures."
Balancing velocity, stability and security
However, Middleton and other industry observers acknowledged that the CrowdStrike outage wasn't simply caused by lax software testing processes. Instead, it's an example of the complex set of factors software developers must weigh when testing releases.
The CrowdStrike flaw was caused by multiple layers of bugs. That includes a content validator software testing tool that should have detected the flaw in the Rapid Release Content configuration template -- an indirect method that, in theory, poses less of a risk of causing a system crash than updates to system files themselves, said Gabe Knuth, an analyst at TechTarget's Enterprise Strategy Group.
"This is a challenge in fully automated systems because they, too, rely on software to progress releases from development through delivery," Knuth said. "If there's a bug in the software somewhere in that CI/CD pipeline … it can lead to a situation like this. So to discover the testing bug in an automated way, you'd have to test the tests. But that's software, too, so you'd have to test the test that tests the tests and so on."
How extensive software testing should be depends on a risk assessment that encompasses not only the stability of systems but also how quickly they can be updated to mitigate rapidly emerging security threats.
"What's worse?" Knuth said. "A bug that crashes millions of endpoints and causes global disruption while it's fixed or a damaging vulnerability that results in lost intellectual property, private information, state secrets, etc.?"
As painful as a Windows outage that grounded airline flights and affected hospital systems was, for many companies, that kind of security compromise would be worse, Middleton said.
"In the end, companies would rather risk an availability failure from a bad update [to] their security tooling than risk a confidentiality failure from malware compromising their hosts," she said. "On the outside, as consumers, we see it as about the same -- the services we use aren't available. But from the inside, it's totally different."
While compromised service availability affects the bottom line to a small degree, according to Middleton, malware could leak data that causes a company to close due to legal fees or causes so much damage to a company's reputation that it loses customers.
"Companies would much rather be shut down by a bad update than malware," Middleton said. More extensive software testing "does come with risks. These update files are composed to quickly respond to emerging malware threats, and any delay, even one minute, could possibly leave the door open for a sensitive enterprise server to be infected."
IT pros call for canary deployments and more
In response to the outage, CrowdStrike will perform incrementally phased rollouts of changes, or canary updates, with Rapid Release Content files the same way it does with less frequent app updates, according to the company's preliminary post-incident report.
For some organizations, this will offer some reassurance that a flawed update won't hit every customer machine all at once.
"I'm incredibly surprised, even though they call it 'Rapid Response,' that [CrowdStrike] doesn't have some phased approach that allows them to check in on the health of the endpoints that have been deployed," said Andy Domeier, senior director of technology at SPS Commerce, a Minneapolis-based communications network for supply chain and logistics businesses. "Even with some logical order of customer criticality, they could have circuit breakers to stop a deployment early that they see causes health issues. For example, don't [update] airlines until your confidence level is higher from seeing the health of endpoints from other customers."
Kyler MiddletonSenior principal software engineer, Veradigm
Other software engineers said canary deployments would be a good step. However, they said CrowdStrike should rethink its application architecture more broadly so that rapidly updated files are separated from the operating system kernel by an abstraction layer, such as a management controller, hypervisor or eBPF program.
"It is absolutely irresponsible to auto-deploy a kernel module update globally without a health-mediated process or, at least, a recovery path at a lower level of the control plane," said David Strauss, co-founder and CTO at WebOps service provider Pantheon. "Something that remains functional even if the OS deployed on top crashes."
Customers that run such relatively high-octane malware detection software on relatively noncritical machines also bear some responsibility for the impact of the CrowdStrike outage, Strauss added.
"The use of CrowdStrike on things like airline gate terminals is absurd to me," he said. "Machines like that are single-purpose and should be secured using restricted privileges … and integrity validation. … The only place where it makes sense to watch for malware is when you can't do those two things. Even then, app stores, signed releases and OS-enforced sandboxing are the modern approaches to handling that -- much more than scanning agents that run on end-user computer devices."
Beth Pariseau, senior news writer for TechTarget Editorial, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.