Getty Images/iStockphoto

Enterprises rework log analytics to cut observability costs

Organizations such as Netflix could be a harbinger of an observability cost crisis, where monitoring cloud-native apps comprises an untenable portion of operational costs.

As observability data grows, enterprises are rethinking the way it's analyzed to avoid breaking the bank with infrastructure costs.

Mainstream enterprises have begun feeling the pinch of observability data growth and its attendant infrastructure costs over the last two years. But video streaming service Netflix first encountered this problem more than a decade ago. In 2012, according to a presentation at the recent Monitorama conference, observability data management comprised an estimated 25% of the Netflix AWS bill.

At the time, Netflix processed observability data through its internally developed Atlas in-memory database, which queried data after it was written to a back-end cloud storage system and issued thousands of alerts per day.

"It worked when we had thousands of alerts. But when we got to hundreds of thousands of alerts, it started to break," said David Gildeh, telemetry insights leader at Netflix, during his presentation. "Not only did we have to write all this volume into the database … [but as] more and more queries just kept hitting the database … we would have had to add more Atlas nodes just to scale with that load."

Now this architecture wouldn't just be expensive -- it would be unworkable. As of this year, Atlas generates millions of alerts on 1.5 petabytes of log data per day, along with 17 billion metrics and 700 billion distributed traces.

However, Netflix's observability infrastructure costs have not scaled proportionately with this data growth, according to Gildeh. In fact, observability data processing now accounts for less than 5% of Netflix infrastructure costs.

Gildeh attributed the savings to streaming log analytics within the Atlas system, a shift spearheaded by his predecessors on the telemetry engineering team. Streaming analytics processes queries as data is collected rather than requiring data to be stored and indexed first.

"Now, in real time, our ingestion service forwards on all the queries we need for our alerting system in real time to the alerting service, which does its evaluations and triggers the alerts," he said. "Not only do we get lower latency, that also has allowed us to scale to millions of alerts without having to add a huge amount of infrastructure."

Monitorama Netflix presentation
Netflix designed its own streaming log analytics system because vendor products were too expensive, according to David Gildeh during a Monitorama presentation.

Puma drops Splunk for Coralogix streaming log analytics

Like data pipelines, streaming analytics emerged from the data science field that encountered massive data management issues earliest among enterprise applications. Many open source and vendor-produced tools use streaming analytics to handle large-scale data ingestion without overwhelming infrastructure costs.

Now this architecture is becoming more common among observability tools. Grafana's Loki log monitoring tool also queries log streams. An emerging log analytics vendor, Coralogix, caught the eye of a senior DevOps manager at athletic wear company Puma late last year, with a product similar to the streaming log analytics system Netflix designed for itself -- including its cost savings.

"We had this vendor in Bulgaria who had developed a Splunk-based system for monitoring Salesforce Commerce Cloud. … But we were building this new, headless front end [for our website] that ran on AWS" starting in 2021, said Michael Gaskin, senior DevOps manager for global e-commerce at Puma. "All of our focus was on monitoring this new thing that we were building."

Gaskin considered Splunk's Observability Cloud toolset. But that product and those from competitors such as New Relic and Datadog were prohibitively expensive for Gaskin's application, which generates hundreds of gigabytes of logs per day.

Not only do we get lower latency, that also has allowed us to scale to millions of alerts, without having to add a huge amount of infrastructure.
David GildehTelemetry insights leader, Netflix

Gaskin attended last year's AWS Summit Berlin in search of another alternative. He set up and experimented with the Coralogix tool from his hotel room during the conference, which sold him on it. Because Coralogix analyzes log streams as they come in, it can parse them into simpler metrics before storing them, and it charges a lower price for storing log data that way, Gaskin said.

"Back in Salesforce times, you couldn't do anything with [CDN logs] because they were too big," Gaskin said. "You could only request them from Salesforce in one-hour increments, which was a pain. But now I could ingest all the logs from [Fastly] into Coralogix and not bankrupt myself in the process."

Between data ingestion costs and professional services, Gaskin estimated the 5-GB-per-day license from Splunk that he used with Salesforce Commerce Cloud cost about $20,000 per year. Further complicating matters was the fact that he oversees a DevOps team of third-party contractors. Getting them to set up more extensive log monitoring using Splunk would have been a logistical hassle, Gaskin said.

By contrast, the initial license Gaskin negotiated with Coralogix last year cost $50,000 and could accommodate 400 GB of log data per day generated by the new e-commerce application, which included logs from the company's Fastly CDN.

Splunk offers some event streaming analysis features, and its index actions let users route, filter and mask data as it's being ingested into Splunk Enterprise or the Splunk Cloud Platform. There are some things Gaskin said he misses about Splunk while using Coralogix.

"It's a double-edged sword. Coralogix is really easy to use … [but] if you're one of those people who can write crazy [Splunk Search Processing Language] searches in Splunk to transform things in all kinds of ways and extract data in really mind-bending fashion, Coralogix isn't really [caught] up on that yet," he said.

Long-term, observability vendors face cost disruptions

At Netflix's scale, it can afford the in-house expertise to build its own systems, but it has considered vendor products as well, Gildeh said during his presentation. He didn't disclose specific numbers but showed in a graph that most of the half-dozen or so log analytics vendor products Netflix considered dwarfed the cost of its internally developed streaming log analytics system.

"None of the vendors could even get in the same ballpark as what we're able to do," he said. "This is why we build everything -- because we can do it way cheaper than what vendors can offer us."

Nancy Gohring, analyst, IDCNancy Gohring

In the long run, enterprises facing the observability cost crunch without Netflix's internal expertise will begin to demand new approaches to pricing, especially from application performance management vendors, which are typically known for their premium licensing costs, said Nancy Gohring, an analyst at IDC.

"I'm hearing from a lot of vendors that are developing pricing model changes. So my sense is, more change is coming," she said. However, as in Puma's case, "this topic presents an opportunity for new technologies to emerge and potentially disrupt the market."

Such a wave of disruption has already occurred with the emergence of cloud-native observability tools over the last decade, such as SignalFx, now part of Splunk, and Lightstep, now owned by ServiceNow.

"They all developed technologies designed to more efficiently pull intelligence out of a growing amount of data," Gohring said. "Even if they didn't all harp on cost, at the end of the day, efficiently making sense of a huge volume of data comes down to cost."

Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.

Dig Deeper on IT systems management and monitoring