AIOps platforms delve deeper into root cause analysis

AIOps tools offer fresh insights into the root cause of IT incidents, triangulating between application outages, code changes and infrastructure issues.

The promise of AIOps platforms for enterprise IT pros lies in their potential to provide automated root cause analysis, and early customers have begun to use these tools to speed up problem resolution.

The city of Las Vegas needed an IT monitoring tool to replace a legacy SolarWinds deployment in early 2018 and found FixStream's Meridian AIOps platform. The city introduced FixStream to its Oracle ERP and service-oriented architecture (SOA) environments as part of its smart city project, an initiative that will see municipal operations optimized with a combination of IoT sensors and software automation. Las Vegas is one of many U.S. cities working with AWS, IBM and other IT vendors on such projects.

FixStream's Meridian offers an overview of how business process performance corresponds to IT infrastructure, as the city updates its systems more often and each update takes less time as part of its digital transformation, said Michael Sherwood, CIO for the city of Las Vegas.

"FixStream tells us where problems are and how to solve them, which takes the guesswork, finger-pointing and delays out of incident response," he said. "It's like having a new help desk department, but it's not made up of people."

The tool first analyzes a problem and offers insights as to the cause. It then automatically creates a ticket in the company's ServiceNow IT service management system. ServiceNow acquired DxContinuum in 2017 and released its intellectual property as part of a similar help desk automation feature, called Agent Intelligence, in January 2018, but it's the high-level business process view that sets FixStream apart from ServiceNow and other tools, Sherwood said.

FixStream's Meridian AIOps platform creates topology views that illustrate the connections between parts of the IT infrastructure and how they underpin applications, along with how those applications underpin business processes. This was a crucial level of detail when a credit card payment system crashed shortly after FixStream was introduced to monitor Oracle ERP and SOA this spring.

"Instead of telling us, 'You can't take credit cards through the website right now,' FixStream told us, 'This service on this Oracle ERP database is down,'" Sherwood said.

This system automatically correlated an application problem to problems with deeper layers of the IT infrastructure. The speedy diagnosis led to a fix that took the city's IT department a few hours versus a day or two.

AIOps platform connects IT to business performance

Instead of telling us, 'You can't take credit cards through the website right now,' FixStream told us, 'This service on this Oracle ERP database is down.'
Michael SherwoodCIO for the city of Las Vegas

Some IT monitoring vendors associate application performance management (APM) data with business outcomes in a way similar to FixStream. AppDynamics, for example, offers Business iQ, which associates application performance with business performance metrics and end-user experience. Dynatrace offers end-user experience monitoring and automated root cause analysis based on AI.

The differences lie in the AIOps platforms' deployment architectures and infrastructure focus, said Nancy Gohring, an analyst with 451 Research who specializes in IT monitoring tools and wrote a white paper that analyzes FixStream's approach.

"Dynatrace and AppDynamics use an agent on every host that collects app-level information, including code-level details," Gohring said. "FixStream uses data collectors that are deployed once per data center, which means they are more similar to network performance monitoring tools that offer insights into network, storage and compute instead of application performance."

FixStream integrates with both Dynatrace and AppDynamics to join its infrastructure data to the APM data those vendors collect. Its strongest differentiation is in the way it digests all that data into easily readable reports for senior IT leaders, Gohring said.

"It ties business processes and SLAs [service-level agreements] to the performance of both apps and infrastructure," she said.

OverOps fuses IT monitoring data with code analysis

While FixStream makes connections between low-level infrastructure and overall business performance, another AIOps platform, made by OverOps, connects code changes to machine performance data. So, DevOps teams that deploy custom applications frequently can understand whether an incident is related to a code change or an infrastructure glitch.

OverOps' eponymous software has been available for more than a year, and larger companies, such as Intuit and Comcast, have recently adopted the software. OverOps identified the root cause of a problem with Comcast's Xfinity cable systems as related to fluctuations in remote-control batteries, said Tal Weiss, co-founder and CTO of OverOps, based in San Francisco.

OverOps uses an agent that can be deployed on containers, VMs or bare-metal servers, in public clouds or on premises. It monitors the Java Virtual Machine or Common Language Runtime interface for .NET apps. Each time code loads into the CPU via these interfaces, OverOps captures a data signature and compares it with code it's previously seen to detect changes.

OverOps Grafana dashboard
OverOps exports reliability data to Grafana for visual display

From there, the agent produces a stream of log-like files that contain both machine data and code information, such as the number of defects and the developer team responsible for a change. The tool is primarily intended to catch errors before they reach production, but it can be used to trace the root cause of production glitches, as well.

"If an IT ops or DevOps person sees a network failure, with one click, they can see if there were code changes that precipitated it, if there's an [Atlassian] Jira ticket associated with those changes and which developer to communicate with about the problem," Weiss said.

In August 2018, OverOps updated its AIOps platform to feed code analysis data into broader IT ops platforms with a RESTful API and support for StatsD. Available integrations include Splunk, ELK, Dynatrace and AppDynamics. In the same update, the OverOps Extensions feature also added a serverless AWS Lambda-based framework, as well as on-premises code options, so users can create custom functions and workflows based OverOps data.

"There's been a platform vs. best-of-breed tool discussion forever, but the market is definitely moving toward platforms -- that's where the money is," Gohring said.

Dig Deeper on IT systems management and monitoring