Researcher: CrowdStrike blunder could benefit open source Is today's CrowdStrike outage a sign of the new normal?

InfoSec community sounds off on CrowdStrike outage, next steps

Security experts offered their thoughts on the recent IT outage, praising CrowdStrike's response time but saying the outage highlights issues in the software updating process.

While infosec experts agree the recent global IT outage caused by a defective CrowdStrike channel file update highlights inherent problems with the way software is updated, many applauded the vendor for its swift incident response.

Last month, more than 8.5 million Windows devices experienced blue screens of death and reboot loops, triggered by an errant configuration update for CrowdStrike's Windows sensors originally released on July 19. Although 8.5 million Windows devices represents a small share of the total, the outage led to days-long disruptions for a number of organizations across multiple sectors, including healthcare and transportation.

CrowdStrike's investigation determined that a bug in the vendor's content validation system was responsible in part for the massive outage. Moreover, the sensor update was classified as Rapid Response Content, which does not undergo the same type of pre-release testing as releases classified as Sensor Content. It did not help matters that the remediation process was largely manual, though Microsoft and CrowdStrike later published tools to make that process easier.

On the positive side of things, the security vendor acted quickly to respond to the outage, assist customers and provide detailed status updates on its investigation. CrowdStrike CEO George Kurtz likewise provided additional information via communications such as LinkedIn blog posts.

CrowdStrike said 97% of Windows sensors were back online one week after the outage. And on Wednesday, the vendor said in its remediation hub that approximately 99% of sensors have been brought back online.

In the wake of the outage, infosec experts shared their thoughts with TechTarget Editorial regarding the outage and CrowdStrike's response to it. While the security vendor's rapid response was largely praised, others felt more rigorous software testing and phased rollouts could have prevented the issue from reaching such a massive scale.

The failure of CrowdStrike's testing process

Both CrowdStrike's testing process and use of its own products to identify issues failed and allowed an undetected bug to be shipped off to customers, said Chris Eng, chief research officer at Veracode, a secure application development company based in Burlington, Mass. He said staged rollouts may have helped some customers avoid the outage.

"What this incident also illustrates is that [quality assurance] tooling itself can contain bugs, just like any other software," Eng said. "The speed and complexity of modern software development requires rigorous testing and multiple safeguards in place to build resiliency and avoid incidents like this."

Tony Anscombe, chief security evangelist at ESET, a security software provider based in Slovakia, agreed that security vendors need to ensure they have failsafes when releasing updates. Anscombe added that the need to continually update software is inherently linked to how threat actors continue to adapt their tactics, techniques and procedures. He also addressed CrowdStrike's hold on major critical infrastructure organizations.

"It also demonstrates the need for diversity; the reliance on a single provider by so many critical infrastructure companies needs to be changed," Anscombe said.

Like Anscombe, Chris Wysopal, CTO and co-founder of Veracode, stressed that software must be continually updated to keep up with the influx of attacks. He told TechTarget Editorial that the outage came down to two problems: First, a design and architecture issue in the way that Windows drivers run in the kernel. He emphasized that this issue allows any bug to cause a blue screen and crash the system. Wysopal stressed that the kernel driver problem does not affect Linux or macOS.

"Driver software has to be heavily tested, and so the testing requirements are higher for software like that," he said. "On the flip side, you have anti malware software, which must be constantly updated because there are always new attacks coming out.

"These two things are opposing each other -- one is really rigorous testing requirements, and on the other hand you have to update several times a day when new attacks come out because they move so quickly," Wysopal said. "The best solution would be for Microsoft to rearchitect its drivers so there's no way it could crash the system, but I'm not holding my breath for that. It was a decision made many, many years ago."

Tim Mackey, head of software supply chain risk strategy at Synopsys, told TechTarget Editorial that the outage may have vendors rethinking how to better handle system updates where data is encrypted at rest. Like Eng, he called for phased update rollouts in situations where there is less control over the deployment environment.

"While there has been speculation as to the root cause and assertions of the nature of pre-release testing, the reality is that with most outages there are multiple contributing factors," Mackey said. "CrowdStrike intends to improve how it's using various software testing techniques. That shouldn't be read as to imply that those techniques weren't being used, but rather it's clear something was able to slip through the gaps.

"Such analysis is common as a post-incident assessment and is typically part of the threat modelling any software producer should be doing -- where the threat in this case has internal origins," Mackey added.

On the Falcon Content update page, CrowdStrike said it will improve its rapid response content testing with stability testing, content update and rollback testing, as well as local developer testing. Regarding deployment, CrowdStrike said it will improve monitoring and implement staggered update roll outs.

"Based on what we experienced and what CrowdStrike states they're going to do moving forward, bulk update of software without validating the success and monitoring failures that should pause rollout of updates is a key lesson that anyone who remotely deploys software should learn from," Mackey said.

Timeline of the CrowdStrike outage that took down millions of Windows devices July 19.
Timeline of the defective CrowdStrike channel file update that downed more than 8 million Windows devices in July.

CrowdStrike's response

Chris Steffen, vice president of research at analyst firm Enterprise Management Associates, told TechTarget Editorial that while this outage was particularly painful, CrowdStrike is a "good security vendor" he has recommended throughout the years and will continue to recommend.

"I believe that it should be viewed as an isolated incident and not a larger problem," Steffen said.

He said the faulty update was caused by a "process problem within the development teams," and that while these kinds of issues aren't isolated to CrowdStrike, he felt there was room for the security vendor to improve.

"Could CrowdStrike have done better? Of course. And I am sure that they will, either through improved release processes or better education of dev teams and end users," Steffen said. "By all accounts, they realized their error and deployed a fix after just over an hour. Could it have been sooner? Possibly. But the damage from the initial release was already done by that point. It is also fair to note that this was not the first issue that CrowdStrike had created with a faulty release. So, there is absolutely room for improvement."

Jake Williams, a faculty member at cybersecurity research and advisory firm IANS Research in Boston, said Kurtz should have announced early into the outage response that CrowdStrike was hiring an external auditor to review company processes. "This is almost certain to happen anyway; nothing less will instill full customer trust," Williams said. Announcing this early in the incident would have been a positive step towards restoring customer confidence."

Paul Davis, field CISO of supply chain security vendor JFrog, commended CrowdStrike's incident response team for taking quick action to determine the root cause and notify customers, adding Kurtz's LinkedIn blog outlining what happened was "honest and clear."

Regarding the idea that CrowdStrike could have prevented the errant update that led to the outage with better testing before it was released, Davis said he didn't necessarily agree.

"Writing software is a complex process, which gets even more challenging as the software's functionality changes or ages over time, making testing every potential deployment scenario near impossible," Davis said.

There is no such thing as perfect software. After all, software is built by humans and to err is human. It's how quickly you identify and recover from the problem that matters most.
Paul DavisField CISO, JFrog

"In the world of security, one must always be prepared for the unexpected and have an incident plan for those surprise events," he said. "There is no such thing as perfect software. After all, software is built by humans and to err is human. It's how quickly you identify and recover from the problem that matters most."

Similarly, Danny Jenkins, CEO and co-founder at security software provider ThreatLocker, said, "hindsight is always 2020."

"It is very easy to say they could have done things differently post-event," Jenkins said. "In general, CrowdStrike was in a difficult situation. Its response was very fast."

Omdia senior principal analyst Fernando Montenegro noted CrowdStrike's swift tactical action and an immediate response from executives but said there was room for improvement -- pointing to the "borderline useless gesture" of CrowdStrike allegedly sending $10 Uber Eats gift cards to partners following the outage.

CrowdStrike's response is just beginning

Although the most acute aspects of the outage seem to have been addressed, the more complicated aspects of CrowdStrike's response have only just begun.

In addition to assisting any remaining customers affected by the outage, CrowdStrike must implement new testing and deployment processes for its Rapid Response Content sensor updates. The vendor must also answer to Congress.

Congressmen Mark Green (R-Tenn.) and Andrew Garbarino (R-N.Y.) requested public testimony from Kurtz before the House Committee on Homeland Security regarding the global IT outage. In an open letter, the representatives positively referenced CrowdStrike's response but said Americans deserved to know the truth of the outage in detail.

"While we appreciate CrowdStrike's response and coordination with stakeholders, we cannot ignore the magnitude of this incident, which some have claimed is the largest IT outage in history," the representatives wrote. " Recognizing that Americans will undoubtedly feel the lasting, real-world consequences of this incident, they deserve to know in detail how this incident happened and the mitigation steps CrowdStrike is taking."

Moreover, CrowdStrike may have to contend with multiple lawsuits. Law firm Labaton Keller Sucharow LLP announced on July 30 that it had filed a class action lawsuit against CrowdStrike and certain executives on behalf of investors financially affected by the outage. According to the complaint, CrowdStrike's claims that the Falcon platform is "validated, tested, and certified" were false and misleading because, as the firm argues, CrowdStrike had deficient testing processes for its updates.

A CrowdStrike spokesperson told TechTarget Editorial that "We believe this case lacks merit and we will vigorously defend the company."

CNBC reported Monday, meanwhile, that Delta allegedly hired attorney David Boies to pursue damages from CrowdStrike for costs related to the outage. TechTarget Editorial contacted both Delta and Boies' law firm, Boies Schiller Flexner LLP, but neither responded by press time. A CrowdStrike spokesperson said that, "We are aware of the reporting, but have no knowledge of a lawsuit and have no further comment."

Omdia's Montenegro said while CrowdStrike handled the outage response well, the company's true test comes now.

"In the short term, CrowdStrike must withstand the inevitable quagmire of legal wranglings, navigate uncomfortable conversations with existing customers around losses, and fight off renewed vigor from its competitors," Montenegro said.

"In the longer term, the question becomes: how can it demonstrate that it has improved its processes to reduce the likelihood of this happening again in the future, all the while maintaining the efficacy of its offering? How will internal practices at CrowdStrike change? What changes, new features or configuration options are being added to the product to address this type of situation?"

Alexander Culafi is a senior information security news writer and podcast host for TechTarget Editorial.

Arielle Waldman is a news writer for TechTarget Editorial covering enterprise security.

Dig Deeper on Network security

Networking
CIO
Enterprise Desktop
Cloud Computing
ComputerWeekly.com
Close