Key SOC metrics and KPIs: How to define and use them
Enterprises struggle to get the most out of their security operation centers. Using the proper SOC metrics and KPIs can help. Learn how to define and benefit from them here.
To some, metrics are the holy grail of infosec. Being able to monitor, measure, analyze and communicate the security state of an enterprise can be powerful.
One area where metrics are extremely important is in the security operations center (SOC). Yet, despite their importance, many SOC teams struggle to define the specific metric needs of their organization's security program, as well as use those metrics to improve their company's security posture.
Why SOCs need metrics
Once enterprises realized log data gleaned from their IT infrastructure was insufficient, network operations centers advanced, and dedicated SOCs formed.
SOCs handle many different functions, from managing and maintaining security tools to detecting and analyzing threats, responding to incidents, verifying security compliance and other security administration tasks.
Many SOCs evolved from enterprise SIEM tools used to monitor the corporate infrastructure and the administrators who set up and managed cybersecurity platforms. While many businesses choose to operate a SOC in-house, there are managed security service providers and cloud services companies that provide SOC functions to enterprises or supplement an enterprise's internal resources.
As SOCs matured, analytics and decision support processes have been added. This analysis creates additional cybersecurity value for enterprises and provides insight into how effectively security resources are used.
The need for metrics and their supporting definitions is becoming more important than ever, as is using these metrics to make changes and monitor enterprise environments. Even defining what constitutes an incident is important, as not all security incidents have the same effect or require the same response.
Developing SOC metrics: Getting started
There are several resources enterprises can use to learn more about security metrics, starting with the seminal book Security Metrics: Replacing Fear, Uncertainty, and Doubt by Andrew Jaquith. The Center for Internet Security also provides guidance on security metrics. The SANS Institute offers several papers related to SOC metrics, and NIST hosts the National Vulnerability Database, which provides metrics for tracked vulnerabilities.
When developing SOC metrics, security operations teams should identify the highest-value processes or areas that need the most resources to identify where metrics and management attention are most needed. This should be part of continuous improvement and shouldn't be limited to how enterprise SIEM or any particular tool is licensed or can be used.
With an outsourced SOC, it is critical to set these metrics upfront and include them in a contract to ensure the SOC can generate the data and support the required metrics.
Examples of key SOC metrics and KPIs
While some SOC metrics and KPIs may be specifically catered to an organization, there are some common SOC metrics used across the security industry. These include the following:
- operational health of infrastructure components;
- number of cybersecurity tickets/incidents;
- severity of cybersecurity tickets/incidents;
- time to cybersecurity threat detection, referred to as mean time to detect or discover;
- time to cybersecurity threat response;
- time to cybersecurity threat containment;
- time to cybersecurity threat resolution;
- mean time to recovery or repair;
- threats detected by cybersecurity tools over time;
- global threat intelligence numbers;
- user/group access levels to data/apps;
- onboarding/offboarding numbers;
- assessment of false positives and true positives; and
- state of regulatory compliance.
How SOC metrics improve security posture
An enterprise's specific metric needs will vary depending on the tools used, the size of its infrastructure and the capabilities of its security department. However, even basic security controls can be mapped to the tools and processes to identify the potential metrics to use, as shown in the following examples.
Monitoring and baselining firewall alerts or failed login attempts may be useful if there is a sudden increase in these numbers. A spike can indicate malicious activity is occurring. Modern cybersecurity tools often include AI to assist with data analysis over time. The AI can set statistical thresholds, and when numbers exceed those thresholds, the AI will notify security administrators and may even identify the root cause of the security incident.
Identifying what should be included or excluded from a data collection and monitoring perspective is something that must be well defined. For example, if an enterprise has cloud-deployed servers, apps and data -- and if those cloud resources are not monitored by the SOC -- then the metrics may not reflect all the IT services in an environment. Thus, it's critical the SOC be properly scoped to monitor and analyze everything deemed business-critical to the organization.
Basic SOC metrics should focus on data, such as security alerts pulled from network or server equipment, forensics from incident investigations, and response and compliance reporting coming from SIEM and SOAR tools. Each of these categories can be broken down into more detailed metrics, as well as the corresponding data sources or tools used to generate the metrics in the SOC.
Many security operations teams monitor endpoint security tool logs and respond when high-risk malware is detected, so a metric could be built around the identification of an incident and the steps and time taken to remediate the issue. Data generated in this process can be used to determine KPIs, such as the costs required -- in terms of resources, as well as any financial costs -- of incident response and how effective the response can be.
A security team should also track elapsed times at different stages in the incident response process -- starting from when the alert is generated by a security tool to when an analyst begins investigating and, finally, when the analyst or AI-based network detection and response tool resolves the problem. Measuring and baselining each of these steps can be useful to evaluate how effective current remediation processes are -- and to potentially identify where changes could streamline certain tasks. Of course, the process of collecting and analyzing SOC-related data becomes more complicated when multiple systems are included in a single incident or when sensitive data is involved. Having an analyst or AI-based monitoring tool validate when an incident began should be part of checking the effectiveness of an endpoint security tool.
Measuring the effectiveness of various security tools can help determine which ones are most effective -- and which need to be replaced with modern alternatives. For example, the security team should consider how long it takes to determine if something malicious happened, if the tool can capture the start of an incident and the detection time, or if a different tool detected the incident. These factors could signal a need to study the tool's effectiveness, configuration or end-of-life status to ensure the overall protection of the enterprise.