Overcome these challenges to detect anomalies in IT monitoring
Faster and more accurate anomaly detection is a major benefit of machine learning in IT systems monitoring -- but it's not something that enterprises can achieve overnight.
Anomaly detection can drastically change -- and improve -- the way IT teams monitor system performance. In particular, it can reduce analyst fatigue, eliminate false alarms and more quickly uncover events that are truly significant within an IT environment.
In previous articles, we discussed the benefits of applying anomaly detection -- as a function of machine learning -- to log analysis, and to enable time-series monitoring. Here, we dive into the challenges an organization must overcome to successfully detect anomalies during IT monitoring, in general. These challenges range from data formatting issues to staffing and skill set limitations.
Benefits of anomaly detection in IT system monitoring
Before reviewing the main challenges of anomaly detection in IT monitoring -- and how to address them -- let's recap the benefits of the technology.
Traditional IT performance monitoring dashboards plot a metric, such as CPU usage, over time. This offers little insight into what is going on with an application as a whole. What's more, traditional dashboards are not trend-aware; they don't take into consideration the normal ebb and flow of business. As a result, they flag events as problems that are not problems at all.
Anomaly detection uses applied mathematics and machine learning to flag events as statistically significant. This addresses the signal-to-noise problem with the traditional approach outlined above. In other words, anomaly detection uses artificial intelligence to replace the practice of setting thresholds manually.
For example, IT teams can detect anomalies and potential issues with a web page based on end-user click patterns, or identify outliers in a traffic routing log that indicate events in need of investigation.
Top data, staffing challenges
To reach the utopia of anomaly detection in IT monitoring, there are three major obstacles.
The first -- and one that's true of machine learning, in general -- is that enterprise data must conform to a specific format (as outlined further in the next section). In particular, IT teams must convert every data field into a number. This isn't too significant of a hurdle, considering that any string of text, such as an IP address, can convert easily into a number.
Second, teams must collect data from disparate systems and compile it into a central file or source. A log from a single resource, such as a database or server, is not sufficient to provide an end-to-end view of an application, such an inventory system. And it's not feasible to just join different files based on a timestamp of outages -- as system outages build over time -- when events begin to cascade.
To start, build a dataset that tracks system outages and the events that led up to them. Experiment with just one or two metrics and ask programmers to write code to classify that data, and then feed the results into whatever graphics and log analysis system you use.
The third challenge is to find those programmers or engineers with the right skill sets in applied mathematics. They need to know linear algebra, calculus and statistics -- topics to which most IT professionals were exposed in college, but likely haven't revisited since.
Even with an AI-driven IT monitoring tool in place, skilled programmers are a must. This is because these tools don't follow a one-size-fits-all approach; each business has its own unique mix of systems, applications and requirements. As a result, it is necessary to write custom code.
Even cloud-based machine learning tools such as Amazon SageMaker don't completely eliminate the need for programmers, who must prepare data to be fed into machine learning models.
A deeper look at the data challenge
IT teams can identify and react to anomalies using unclassified data, but they'll require classified data to forecast and predict future IT system issues and trends. Classified data, however, is more difficult to obtain -- especially for performance monitoring and cybersecurity.
With classified data, some criteria are correlated with some outcome; more specifically, inputs, called features, are associated with outputs, called labels. This is called [features, label] data and is the format required for most machine learning. For performance monitoring, an example set of features could be:
([time of year, time of day, CPU % used, end-to-end-latency in the app, length of MQSeries queue, etc.])
The labels could then include:
- normal
- outage
- approaching outage
- need additional resources
Without these labels, IT teams cannot build a predictive model; they can only make observations as to what is happening now.
A human touch
Think of how Google predicts, and even suggests, the words you type into its search console. That model works well because Google has gathered millions, if not billions, of data records. The model also improves over time, as machine learning is self-correcting.
But to get this data, Google, along with other cloud giants with machine learning-based services, like Amazon, hire humans. Programs like Amazon Mechanical Turk aim to recruit freelancers for data labelling: a worker, for example, might review a set of images and categorize them as flowers or fish. These labels ultimately help train machine learning models, including to detect anomalies.
To assign labels to features requires a great deal of manual effort. Teams must observe system outages and how they are reflected in their logs. Downloading public firewall logs from the internet, which is one approach to train machine learning models, won't work, as they are missing the most important piece: the labels.
You could parse your organization's ticketing system to extract labels, but this requires high volumes of accurate records. The more data your organization has, the more accurate the model will be. Figure out how to best mine your organization's ticket support system. Dump the data from IBM Rational, Slack or whatever system is in use, and insist that IT staff are consistent in how they document support issues.
Where to store your data
To process logs, read simple text files into a Python program. However, the results will not persist -- meaning they'll disappear when the program stores -- and you cannot train large data models that way.
Use a distributed file system and resource manager, such as Mesos or YARN, to scale programs and persist data. This is important, as you'll want to keep historical data in perpetuity, and this leads to large data sets.
Machine learning frameworks to detect anomalies
To actually program anomaly detection capabilities, pick an algorithm, such as K-means clustering, and then pick a machine learning framework.
Below is a list of some machine learning frameworks and charting programs, along with some of their benefits and limitations.
Apache Spark ML. A challenge with Spark ML is that it works with its own data frames -- or data structures, organized by columns -- and not Pandas dataframes, which is what most machine learning models use. This means if an organization adopts Apache Spark ML but wants to use some of the other charting libraries and frameworks mentioned below, it has to convert the Spark dataframes into Pandas. However, Spark offers greater scale than Pandas, and is a relatively easy-to-use machine learning framework.
Keras. This framework offers one of the simplest possible ways to build neural networks. It sits atop TensorFlow, which enables it to scale immensely with minimal complexity.
Scikit-learn. This machine learning framework is also comparatively easy to use. Users can feed almost any kind of data into scikit-learn, and it does the heavy lifting to convert that data into the proper format for the selected algorithm.
Amazon SageMaker. The idea behind this cloud-based machine learning service, and similar ones from Azure and Google, is that it eliminates the need for programming. However, that's not entirely true, as users still need to put data into the proper format. To use SageMaker, upload data to Amazon S3 storage and then run classification or regression models against it. The service will automatically find correlations and perform classification.
Matplotlib. This is one of the most widely used charting libraries.
Seaborn. Searborn is a data visualization library built atop Matplotlib. Data scientists use this to find correlations between data points.
Plotly. This is another popular charting library, with interactive features.