Monitoring distributed systems means first, do no harm

SRE veterans who work with large, complex, latency-sensitive infrastructures say some monitoring tools can hurt the systems they're meant to help.

BROOKLYN, N.Y. -- The best approach to monitoring distributed systems combines Schrödinger's cat with the Hippocratic Oath: Don't let observation itself harm that which is observed.

Despite the importance of that principle, some commonly used tools for monitoring distributed systems, such as synthetic monitoring and log monitoring tools, can be counterproductive. Such tools offer more detailed information than others that rely on less invasive data collection methods, such as metrics- and event-based monitoring, but they also strain systems as they extract data, said site reliability engineering experts at SREcon here this week.

One SRE discovered this the hard way at the worst possible time --  at an investment bank during the 2008 financial crisis.

"I was a member of the middleware team, and we owned the messaging and distributed transaction management infrastructure that provided the glue for all the distributed processing that was going on at the entire company," said Danny Chen, now a trading solutions SRE at Bloomberg, the global finance, media and tech company based in New York, in a presentation.

As stock trades increased during the market crisis, the bank's infrastructure suffered performance degradation that at first seemed like an application problem, Chen said. However, it turned out that a logging tool was actually saturating the network, and to boot, it was because a non-critical function converted binary-encoded timestamps into human-readable form.

Over a decade later, developers of low-level libraries used in logging tools remain unaware of their potential effects on latency, and there's widespread confusion about the different performance profiles of metrics-based monitoring and logging tools, Chen said.

Most Unix and Linux systems have long exposed hooks that allow less invasive access to system metrics. These metrics are less detailed than logs, but can be extracted more easily, and SREs must know when and how to use each type of tool, he said.

"I'm not saying don't log -- I'm saying log smarter," Chen said. "Understand what the costs are as you make decisions."

Chen participated in a performance management working group in the 1980s that sought to perfect access to Unix logs, but those efforts failed, he said. He urged the next generation of engineers not to wait for vendors to fix the problem. SREs should also test the effects of tools on distributed systems before a crisis hits, to understand how they might disrupt performance under heavy load.

USDS' knocks on the door threaten the whole house

At high scale, even less strenuous techniques for monitoring distributed systems can have unintended consequences, according to a presentation by Aaron Wieczorek, an SRE at the United States Digital Service (USDS) in Washington D.C.

The USDS, a distributed incident response team for federal government websites, wants to improve its awareness of outages before they make headlines, as they did before the agency's inception in response to Healthcare.gov's high-profile problems in 2013. But this meant USDS had to devise a system to monitor more than 25,000 .gov and .mil website endpoints, which it began to do in 2018 with open source time-series monitoring tools such as Prometheus, Grafana and InfluxDB.

These tools made simple HTTP requests of government websites to determine their availability and measure their performance, collected and displayed the response data, and issued alerts on the websites' status.

That approach was just a knock on the front door of the federal websites, Wieczorek said -- the tools don't test deeper site functions such as a login to an account or a transaction. The sheer number of endpoints involved, however, required the USDS to tread carefully to avoid causing issues for the sites.

"We got a scary email from Amazon," soon after the project to monitor distributed systems began, Wieczorek said. Owners of the sites pinged by USDS probes complained about their impact in AWS EC2 Abuse Reports. Enough of those reports could cause AWS to shut down the USDS account.

USDS scaled back its probe requests from every three to five minutes to 15- to 30-minute intervals, and added HAProxy servers to issue requests from more varied IP addresses so they looked less like a DDoS attack or other malicious action. But that ballooned the agency's AWS costs, to more than $700 a month for HAProxy instances alone.

Proactive monitoring is better than seeing an outage in the 'Wall Street Journal' or on Twitter, but let's be nice to the systems we're trying to help.
Aaron WieczorekSRE, United States Digital Service

Again, USDS redesigned the system, this time to run on AWS Lambda and Amazon API Gateway. That design was more complex to set up and made debugging trickier, but ran at 10% of the cost of the previous infrastructure.

"Proactive monitoring is better than seeing an outage in the Wall Street Journal or on Twitter," Wieczorek concluded. "But let's be nice to the systems we're trying to help."

Beware user imitations when monitoring distributed systems

The issues for USDS came from the extremely high scale of the operation, but synthetic monitoring tools meant to mimic more complex user behavior can affect latency-sensitive infrastructures of smaller size, SREcon attendees said.

"Synthetic monitoring tools are meant to hit a website the way a real user would, but they might be based on artificial patterns," said James Meickle, senior SRE at Quantopian, a fintech startup in Boston. He said he encountered these problems when he worked for a performance monitoring software vendor four years ago.

If users don't carefully evaluate and change the sequence of scripts synthetic monitoring tools run to suit their particular environment, it can have serious performance impacts on the systems being tested, Meickle said. He recalled an instance in his previous job where a synthetic monitoring test hit a set of servers with 50 times the traffic that actual users would generate.

"The test was over in five minutes," he said. "The servers just melted."

Dig Deeper on IT systems management and monitoring