mean time to innocence
What is mean time to innocence?
Mean time to innocence is the average elapsed time between when a system problem is detected and any given team's ability to say the team or part of its system is not the root cause of the problem. A related metric is mean time to blame, or MTTB, which is the average time it takes to find what the root cause of the problem is, i.e., what is to blame for an outage or suddenly poor application performance.
Mean time to innocence, or MTTI, is not an organization-wide metric, but rather is specific to a given team or part of the technology stack. Server administrators, for example, would seek a small MTTI to quickly show that the servers are not the cause of a major slowdown in a key client-facing system.
What is mean time to identify?
The initialism MTTI is normally interpreted as mean time to identify, which is the average amount of time an organization takes to identify the root cause of an outage. Mean time to identify is not siloed by team since it's centered on the incident and not personal or group responsibility for the incident. It also helps to keep the focus on fixing whatever problem led to the outage.
Mean time to identify usually pairs with another metric called mean time to resolve (MTTR), which is the average time elapsed from incident detection to the restoration of normal services. In modern architectures, it's possible to have a shorter MTTR than mean time to identify. In these cases, simply restarting a microservice or taking a poorly performing instance out of a production load-balancing set can restore service long before the reason for performance degradation can be worked out.
How to avoid the blame game
The IT industry may commonly use the phrase mean time to innocence. However, even if it's used as a joke, it can be indicative of a bad IT culture.
Mean time to innocence and mean time to blame focus on the blame and avoiding it rather than the affected service and how to restore it, avoid repeats and minimize the effect on end users. Sometimes, this attitude propagates from the top down with leadership content to assign blame and penalize those deemed responsible for a problem. But, sometimes, this attitude percolates from the bottom up with team members and managers trying to avoid responsibility and intentionally or accidentally making it harder to find the root cause of a problem.
Teams focused solely on avoiding blame can distract from the work to find causes of network problems and can waste time and resources that could provide a faster resolution and service restoration. Blame shifting is a related term for trying to shift the blame for a problem from one's own team to someone else's team.
The best way to avoid a mean-time-to-innocence culture is to not play the blame game at all. Instead, focus on these steps:
- Restore services, and then fix any systemic or procedural issues that led to the incident.
- Prioritize and troubleshoot incidents affecting customer service.
- Avoid speaking and thinking in terms of blame, guilt and innocence.
Follow the blameless retrospective approach advocated as part of the DevOps methodology, or blameless post-mortems as they're called in site reliability engineering circles. Normally, in these cases, no penalty is levied for simply making a mistake that caused a problem. However, career-ending mistakes can happen but usually after a pattern of carelessness or cluelessness rather than a one-strike-and-you're-out approach.