alex_aldo - Fotolia
Grafana Loki users reap log data savings, with tradeoffs
Grafana Loki won't replace advanced log analytics tools, but it may be a boon for shops that want to collect massive amounts of log data for troubleshooting applications.
Log data has taken on massive proportions as complex microservices apps and distributed systems proliferate, and it's clear that new approaches to log management are needed to handle the volume.
Early adopters of the Grafana Loki log data management tool said it's helped them manage logs more efficiently alongside Prometheus metrics, specifically for troubleshooting purposes on Kubernetes-based microservices apps.
Grafana Loki may not be a one-size-fits-all tool -- its creators in the open source community said it isn't meant to replace tools like Splunk or Elasticsearch for business intelligence analytics use, for example. But it can have major advantages if used correctly, chiefly the ability to ingest large amounts of log data quickly and store it cheaply.
"We are ingesting about 2,000 logs per minute, per service, and we are centralizing logs from about 25 services," said Piyush Baderia, DevOps engineer at Paytm Insider, an event ticketing service based in Mumbai."[With Grafana Loki,] we've saved around 75% [on storage costs] already, and we're looking to maximize it to about 90% if possible -- we think that's achievable."
Instead of indexing each full log file and storing that data in a structured database for later queries, as log analytics platforms do, Loki stores a series of labels that are paired with metadata values in a key-value store, or NoSQL database, such as AWS DynamoDB.
Paytm Insider had run into performance problems using open source Elasticsearch to collect log data, as high volumes of log data overwhelmed Elasticsearch front-end database shards. Paytm Insider's IT staff struggled to scale Elasticsearch effectively under that load, and some logs got dropped during collection as a result. Since Grafana Loki indexes only metadata, Paytm Insider has been able to avoid these performance management challenges.
Full log data remains available for later access, but because full log data doesn't have to be kept available for indexing, Paytm Insider uses a tiered system to store log data collected with Grafana Loki, first as StatefulSets in live Kubernetes pods, then moves them to S3, then to Amazon Glacier.
"Grafana Loki [metadata] indexes are stored in DynamoDB ... and we have a multi-tiered system for storing logs," said Aayush Anand, also a DevOps engineer at Paytm Insider. "This simplifies the process [of log collection] and also helped us decrease the cost of storing the logs."
Grafana Loki best if paired with Prometheus
Paytm Insider was well-suited to become an early adopter of Grafana Loki because it already used Prometheus for Kubernetes monitoring metrics, which is important for effective troubleshooting using the tool.
Since Grafana Loki indexes only metadata, when users want to query log data for troubleshooting, they must first narrow down the time frame in which an issue occurred using Prometheus metrics, then check the Loki log data associated with that time frame to view more detailed information. This can be done through one interface with Grafana Explorer. Previously, Paytm Insider engineers had to use two different platforms and UIs to correlate metrics and logs.
Aayush AnandDevOps engineer, Paytm Insider
Paytm Insider was also prepared to deal with Grafana Loki's early growing pains -- the product reached version 1.0 in late November 2019. This version didn't support parallelized queries, a feature that was added in version 1.3, released Feb. 5, which Paytm Insider engineers said they looked forward to testing, as it will speed the performance of queries and make queries over large volumes of data more practical.
"Now even if you make a query for 70 logs of seven days, it will divide the query into four-hour durations and then make the queries in parallel to DynamoDB," Anand said.
The overhauled Query Frontend released in version 1.3 will also offer better protection from potential DDoS attacks by organizing queries into limited per-tenant queues, according to a Grafana Labs blog. Anand said he's also looking forward to a feature still on the Grafana Loki roadmap, result caching, which should also improve query performance by holding the most recent query results in cache.
Grafana Loki optimizes log data collection for a particular use case, but log analytics vendors are also taking new approaches to streamline the cost of log data storage and retrieval. Sumo Logic and Splunk have also introduced tiered pricing for log data storage and query plans based on frequency of access. Elastic Inc. does not charge separately to index IT ops and security data behind its SIEM product.