IBM
Which data center KPIs are the most useful?
As admins evaluate data center performance, they need ways to look at both short-term and long-term data. KPIs offer one way to evaluate components, such as storage.
Data center admins must regularly gauge hardware and software performance to help them make decisions about upgrades and staffing. Key performance indicators are a useful way to gain perspective that can help business operations track data center health and monitor individual components, such as storage.
To ensure storage hardware is up to date, admins should track these three main data center KPIs.
Utilization. This shows the ratio of storage capacity used compared to total storage capacity. A low ratio indicates storage waste; relatively little of the available storage is being used. Business leaders usually limit new storage investments when the utilization KPI is low. A high ratio indicates the need for additional capacity and can help business leaders justify storage hardware purchases.
Availability. This is the ratio of measured storage uptime versus the planned or desired storage uptime. Availability is measured for major storage subsystems -- such as storage servers or storage arrays -- or for storage tiers when storage is pooled in software-defined environments.
An extremely high ratio indicates that a particular storage resource is available most of the time. As the ratio declines, business leaders can track early warnings of storage problems that could affect workload availability, user satisfaction and business revenue.
Planned unavailability. This is the ratio of actual downtime versus planned downtime. A KPI of 1.0 means the work performed on a storage resource has been completed within the allotted timeframe. A ratio of less than 1.0 indicates the actual downtime is shorter than predicted. If the ratio climbs above 1.0, the actual downtime is longer than expected.
Having a ratio higher than 1.0 could indicate staffing issues, equipment shortages, purchasing approval delays or prolonged service hours. Business leaders often use this data center KPI as a gauge of operational effectiveness and can do a root cause analysis if it is consistently high.
IT and business leaders might adopt more granular data center KPIs for storage, such as mean time between failure (MTBF) and mean time to repair (MTTR).
MTBF is the average time between equipment failures or issues that require a service ticket. With respect to storage, this is an average measure of a storage system's reliability. Any changes in the average over time can offer valuable insight into possible system problems that could demand a closer investigation or equipment evaluation.
MTTR is the average time between the start of an incident and its resolution. In most cases, MTTR is simply tracked over time and leaders look for changes in MTTR as an indirect measure of the average cost of repair, staff expertise and system reliability issues.