Olivier Le Moal - stock.adobe.co
How Zoom security incident response survived the pandemic
March 2020's influx of users meant the video conferencing company had to massively scale its incident response operation and the observability infrastructure that fed it, and fast.
BOSTON -- As the COVID-19 pandemic spread across the globe in early 2020, engineers at video conferencing company Zoom faced a daunting twofold task: improve incident response while accommodating exponential growth.
In December 2019, the number of daily meeting participants on Zoom's free and paid services was approximately 10 million. By March, when the pandemic reached the U.S. and forced a major transition to remote work, that number reached 200 million, according to a blog post by CEO Eric Yuan. From Dec 2019 to April 2020, the company saw 30 times number of daily meeting participants. The company also expanded its services to include Zoom Apps, which integrates with external business workflow services such as DocuSign's electronic signature tool, Asana workflow management and Dropbox online storage; a Contact Center offering; and a revamped Zoom Whiteboard feature.
Zoom's hybrid cloud infrastructure, including its security monitoring and observability data volumes, have grown accordingly, according to Zoom engineers who spoke at the AWS re:Inforce conference this week. The volume of security data logs produced by the company's apps in AWS, data center colocation facilities and SaaS services has gone from gigabytes per day to hundreds of terabytes per day.
Before the pandemic, the company's security data log infrastructure consisted of a handful of servers -- a single search head, 10 clustered log indexers and a few intermediate log forwarders, said Iman Roodbaei, security infrastructure and data architect at Zoom, during the presentation.
"Now, we have more than 250 indexers, more than 200,000 forwarders at the peak of use, and more than 50 intermediate forwarders," Roodbaei said.
Podcast: The awkward state of the remote vs. in-person work debate
Zoom security under scrutiny amid a firehose of data
Zoom's security practices were called into question by customers just after the initial pandemic surge in late March 2020, when reports surfaced about a vulnerability in Zoom's Windows client that could expose user credentials. This, along with concerns about Zoom bombing and encryption, prompted Yuan to make a public commitment to improve security in a blog post on April 1. In the 90 days that followed, Zoom hired executives including a new president of product and engineering, a chief information security officer and a chief operating officer, and enacted a freeze on all its software features not related to privacy, trust and security.
Amit AgrawalLead solutions architect, AWS
In the meantime, Zoom had contracted with AWS on a Security Epics program before the pandemic, according to this week's presentation. Security Epics consist of professional and technical services, including AWS Organizations for account access control, Amazon GuardDuty for threat detection, and AWS Security Hub for cloud security posture management; upskilling training for more than 100 operations and security staff through hands-on workshops and labs; and incident response simulations.
"Zoom was able to rapidly scale their security practice in response to unforeseen business circumstances because they started working on a framework [for scalability] before the pandemic," said Amit Agrawal, AWS lead solutions architect, during the presentation.
To accommodate long-term log data storage that now requires petabyte scale, Zoom makes heavy use of Amazon EC2 autoscaling, along with automated compute provisioning via HashiCorp's Terraform infrastructure-as-code tool. All of Zoom's cloud accounts send logs to the Amazon Kinesis data pipeline service via various security tools such as AWS CloudWatch, AWS CloudTrail and AWS Config.
As data is ingested into this pipeline, Amazon Kinesis triggers AWS Lambda functions that funnel data into Amazon S3 storage buckets. This event-driven architecture is easy to scale to accommodate burgeoning data growth, including the occasional spikes in data ingestion load, Roodbaei said.
From there, Amazon's SQS messaging service triggers Zoom's custom-built SmartLogger utility, which further parses and sorts log data for use by the company's security information and event management (SIEM) and security orchestration, automation and response (SOAR) systems. Along the way, Amazon's Kinesis Data Firehose helps to normalize the data before it's passed to Zoom's Splunk SIEM system, which is important for getting the most out of that tool's analytics, according to Roodbaei.
"It's very easy for any data producer, with a minimal amount of writing applications and such, to send data to any downstream systems [through Kinesis Firehose]," he said. "It's also helping us with data normalization, which is very important... to reducing our compute [requirements] and [making] our search as fast as possible."
SOARing above security incident noise
With the company's cloud infrastructure updated to accommodate security log data growth, Zoom engineers next turned to ways they could harness that data and use automation to make incident response more effective. About a year ago, it added Splunk's SOAR tool to pinpoint the most critical incidents for its security operations center teams, and to automate triage for lower-level issues.
SOAR helps Zoom's incident response teams tune alert logic and reduce the number of incident alerts they have to sort through, said Vijay Chepuri, engineering manager for security monitoring and logging at Zoom, during the presentation.
Chepuri gave a hypothetical example of a security incident response workflow where AWS GuardDuty identifies an AWS Identity and Access Management account that may be compromised, Splunk's SIEM inspects CloudTrail logs, and GuardDuty service monitoring data and identifies a priority for the incident. From there, Splunk SOAR would automatically block the suspicious account and create a ticket in the company's helpdesk system, after which the security team might remove any rogue resources created by the compromised account.
"SOAR is still a work in progress, and we're still working to close gaps and mature our process of collaboration between the security team and program owners," Chepuri said in an interview after the presentation.
So far, however, "it's reduced false positives and improved our alert efficiency by 30 to 40%," Chepuri said. "It's cut down hundreds of thousands of incidents [to deal with] to hundreds."
Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.