CrowdStrike details errors that led to mass IT outage
CrowdStrike's investigation into the recent defective update found that a 'confluence' of issues led to the release of the channel file last month, causing a mass IT outage.
CrowdStrike on Tuesday published its full root cause analysis for last month's defective channel file update that caused more than 8 million Windows systems to enter reboot loops.
The report, titled "External Technical Root Cause Analysis -- Channel File 291," examined the factors that led to the botched Falcon sensor update being delivered to CrowdStrike customers, which trigged a mass IT outage on July 19. The cybersecurity vendor had previously issued a preliminary report that attributed the incident to a vulnerability in the company's content validator, which allowed the channel file to pass through internal checks.
In the 12-page report, CrowdStrike identified a series of issues that contributed to the errant release of Channel File 291, starting with an earlier update to its Falcon platform. The company had previously released version 7.1 of its Windows sensors in February, which contained a new type of inter-process communication (IPC) template.
"The new IPC Template Type defined 21 input parameter fields, but the integration code that invoked the Content Interpreter with Channel File 291's Template Instances supplied only 20 input values to match against," the report read. "This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field. In part, this was due to the use of wildcard matching criteria for the 21st input during testing and in the initial IPC Template Instances."
CrowdStrike explained that two IPC templates were released on July 19, one of which included a non-wildcard matching criterion for the 21st input parameter. The template instances triggered a new version of Channel File 291 that required the sensors to review the 21st input parameter. However, CrowdStrike said no IPC templates in previous channel file updates had used the 21st input parameter field.
"Sensors that received the new version of Channel File 291 carrying the problematic content were exposed to a latent out-of-bounds read issue in the Content Interpreter. At the next IPC notification from the operating system, the new IPC Template Instances were evaluated, specifying a comparison against the 21st input value," the report read. "The Content Interpreter expected only 20 values. Therefore, the attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash."
A logic error in the content validator allowed Channel File 291 to be sent to the content interpreter, CrowdStrike said. While the defective channel file was merely a content update for configuration settings, it impacted the sensor software, including the Windows kernel driver, running on customers' systems. The incident has sparked a debate within the tech industry over whether Microsoft should provide kernel-level access to third-party vendors like CrowdStrike.
"In summary, it was the confluence of these issues that resulted in a system crash: the mismatch between the 21 inputs validated by the Content Validator versus the 20 provided to the Content Interpreter, the latent out-of-bounds read issue in the Content Interpreter, and the lack of a specific test for non-wildcard matching criteria in the 21st field," CrowdStrike wrote. "While this scenario with Channel File 291 is now incapable of recurring, it also informs process improvements and mitigation steps that CrowdStrike is deploying to ensure further enhanced resilience."
Remediations and improvements
CrowdStrike was heavily criticized over the last several weeks by critics who said the cybersecurity vendor's testing processes and safeguards for content updates were woefully inadequate. CrowdStrike said in its preliminary report that content updates like Channel File 291 are what the vendor calls Rapid Response Content, which previously were not subjected to the same internal tests and reviews as software updates.
In its root cause analysis, CrowdStrike outlined several mitigations and changes to improve the process. The company said it updated the Falcon platform to give customers greater control over how Rapid Response Content is deployed. "Customers can choose where and when Rapid Response Content updates are deployed," the report read. "We are continuing to enhance this capability to provide more granular control over Rapid Response Content deployments together with content update details via release notes, to which customers can subscribe."
Content and template type updates will also undergo more testing procedures, including fuzz testing, CrowdStrike said. Additionally, the company will use staged deployment for such updates going forward.
"Staged deployment mitigates impact if a new Template Instance causes failures such as system crashes, false-positive detection volume spikes or performance issues," the report read. "New Template Instances that have passed canary testing are to be successively promoted to wider deployment rings or rolled back if problems are detected. Each ring is designed to identify and mitigate potential issues before wider deployment."
Lastly, CrowdStrike said the content validator, which allowed the defective channel file to pass through to the content interpreter, has been modified to only allow wildcard matching criteria in the 21st field, which mitigates issues in sensor updates that only provide 20 inputs. The vendor said it will release an additional fix for the content validator later this month.
"The Content Validator is being modified to add new checks to ensure that content in Template Instances does not include matching criteria that match over more fields than are being provided as input to the Content Interpreter," CrowdStrike said. "This fix will be released to production by August 19, 2024."
Along with the root cause analysis, CrowdStrike said it is conducting independent reviews of internal processes. "CrowdStrike has engaged two independent third-party software security vendors to conduct further review of the Falcon sensor code for both security and quality assurance," the company said. "Additionally, we are conducting an independent review of the end-to-end quality process from development through deployment."
Dustin Childs, head of threat awareness at Trend Micro's Zero Day Initiative, told TechTarget Editorial that CrowdStrike's overall response has been good, with mostly consistent messaging and a rapidly mobilized response team to assist customers with remediation.
"In a lot of ways, for crisis response, they did relatively well," he said. "I hate to say that for a competitor, but I'll give props where it's due."
However, Childs had concerns about the root cause analysis report.
"It's really hard for me to make an accurate judgment, because I don't think they are being 100% transparent," he said. "I think they're withholding information because that's sometimes what CrowdStrike does. They've always played things very close to the vest."
As an example, Childs said CrowdStrike has typically required security researchers to sign NDAs regarding vulnerabilities found in the company's products. "They're not very transparent when it comes to their security."
Senior security news writer Alex Culafi contributed to this article.
Rob Wright is a longtime reporter and senior news director for TechTarget Editorial's security team. He drives breaking infosec news and trends coverage. Have a tip? Email him.