What is mean time to detect (MTTD)?

Mean time to detect (MTTD) is a measure of how long a problem exists in an IT deployment before the appropriate parties become aware of it. MTTD is also known as mean time to discover or mean time to identify. MTTD is a common key performance indicator (KPI) for IT incident management. A shorter MTTD indicates that users suffer from IT disruptions for less time compared with a longer MTTD.

Problems can be detected by people, such as end users reporting an outage, or software, such as system monitoring and management tools. Generally, IT organizations strive to detect issues before end users do in order to minimize disruption, but this is not always possible.

The formula for mean time to detect is calculated as the sum total of all incident detection times divided by total number of incidents.

Affected IT equipment and the software programs running on it should record the onset of an issue. For example, IT teams could trace a security intrusion to a password entered into the breached system at a specific time. The MTTD KPI can indicate whether IT monitoring technologies collect sufficient data and cover the probable sources of incidents.

How to calculate MTTD

The formula for MTTD is the sum of all incident detection times for a given technician, team or time period divided by the total number of incidents. To gauge performance, IT teams can then compare the resulting MTTD with those for other time periods, other incident response teams and so on.

For example, say the 24/7 IT operations support team for internal applications at a national bank tracks its MTTD monthly. In August, the team experienced eight incidents, and it determined each incident's start and discovery time based on system logs, the organization's intrusion detection system and help desk tickets filed by users, as shown in Table 1.

Table 1. Example incident start and discovery times.
Start time	Detection time	Elapsed (min.)
2:35 a.m.	3:42 a.m.	67
4:13 p.m.	8:30 p.m.	257
1:10 p.m.	1:55 p.m.	45
1:43 p.m.	2:25 p.m.	42
8:05 a.m.	11:16 a.m.	191
3:15 p.m.	3:30 p.m.	15
9:28 a.m.	4:14 p.m.	406
10:09 p.m.	12:32 a.m.	143

For the example shown in Table 1, the MTTD is calculated as follows.

(67 + 257 + 45 + 42 + 191 + 15 + 406 + 143) / 8

MTTD = 145.75 minutes

Some organizations might choose to remove outliers from the equation, as shown in Table 2. In this case, 406 minutes is the highest time to detect, and 15 minutes is the lowest. Without these outliers, the MTTD equals 124.17 minutes.

Table 2. Example incident times with outliers removed.
Start time	Detection time	Elapsed (min.)
2:35 a.m.	3:42 a.m.	67
4:13 p.m.	8:30 p.m.	257
1:10 p.m.	1:55 p.m.	45
1:43 p.m.	2:25 p.m.	42
8:05 a.m.	11:16 a.m.	191
~~3:15 p.m.~~	~~3:30 p.m.~~	15
~~9:28 a.m.~~	~~4:14 p.m.~~	~~406~~
10:09 p.m.	12:32 a.m.	143

Organizations can also tier incidents by severity, as shown in Table 3. For example, an organization might decide that a decrease in MTTD for security problems is more important than a decrease in MTTD for minor performance issues. In this example, the MTTD for the most severe problems -- 42.33 minutes -- is significantly lower than the overall MTTD.

Table 3. Example incident times ranked by severity.
Start time	Detection time	Elapsed (min.)	Severity
2:35 a.m.	3:42 a.m.	67	High
4:13 p.m.	8:30 p.m.	257	Low
1:10 p.m.	1:55 p.m.	45	High
1:43 p.m.	2:25 p.m.	42	Medium
8:05 a.m.	11:16 a.m.	191	Medium
3:15 p.m.	3:30 p.m.	15	High
9:28 a.m.	4:14 p.m.	406	Low
10:09 p.m.	12:32 a.m.	143	Low

If the MTTD is lower for August than for July and June, the IT team might observe a trend of faster problem discovery. However, individual organizations set different thresholds for what constitutes a significant change, and assessing improvement in incident response must incorporate other metrics.

Related IT incident management metrics

MTTD is one of several metrics used to gauge the efficiency and efficacy of IT incident response. Other common metrics include the following:

Mean time to repair or restore (MTTR). How long it takes to fix a problem once detected.
Mean time between failures (MTBF). How long the IT deployment goes without a performance degradation or outage.
First-time resolution rate (FTRR). The percentage of incidents resolved without requiring follow-up, used as a measure of how effectively the team troubleshoots a problem.
Downtime. The percentage of time that systems are not operational over a given time period, such as 0.999% downtime per year.

IT organizations use MTTD to gauge the effectiveness of monitoring and management systems, as well as the communication routes from users -- either internal or external customers -- to the troubleshooting parties. Changes in MTTD or MTBF can indicate the effects of implementing a new tool or approach.

Together, metrics such as FTRR and MTTR can help assess a response team's troubleshooting skills and IT management capabilities. Similarly, MTTD and MTTR combined show the overall timeline of incident response.

The relationship between MTTD and MTTR: MTTD shows time from failure to detection, whereas MTTR shows time from detection to resolution. — Together, MTTD and MTTR cover the full timeline of a failure or incident.

MTTD does not reflect either the security threat level to the deployment or its resiliency. For example, an organization might track the number of incidents in a given time period to determine the extent of its IT deployment's exposure to attack or failure, regardless of how quickly these incidents are discovered and resolved.

Continue Reading About mean time to detect (MTTD)

How to choose IT operations metrics that deliver real value

Words to go: Incident management KPI categories

12 DevOps KPIs you should track to gauge improvement

IT incident management best practices to minimize disruptions

Incident response automation: What it is and how it works

mean time to detect (MTTD)