MTTD (Mean Time to Detect)

MTTD (Mean Time to Detect) is the average time between when a production issue actually begins and when monitoring or alerting systems first detect it—one of the foundational metrics for measuring reliability program effectiveness.

MTTD is often the largest hidden component of total incident duration. An incident's "official" clock typically starts when an alert fires and a ticket gets created, but the user-visible impact began earlier: sometimes minutes, sometimes hours. A 30-minute MTTD doesn't mean the system was healthy for those 30 minutes; it means the system was already failing and the team didn't yet know.

Reducing MTTD compounds the effect of MTTR improvements. The clock on user-visible impact starts at detection failure, not at incident creation. A team that cuts diagnosis time from 30 minutes to 5 saves 25 minutes per incident only if the detection itself happened on time. If detection was 20 minutes late, total impact is still 25 minutes of bad customer experience. MTTD improvements typically come from better instrumentation, smarter alerting (especially on user-impacting signals rather than just infrastructure metrics), and synthetic monitoring that detects symptoms before customers notice them.

AI SRE can reduce MTTD by treating telemetry continuously rather than reactively. Instead of waiting for a threshold to be crossed, Traversal can detect anomalous patterns, surfacing degradation before it becomes severe enough to trigger conventional alerts. This is especially valuable for the failure modes AI-enabled systems introduce, where surface metrics may stay green while semantic quality is degrading.

MTTR (Mean Time to Resolution)

MTTR (Mean Time to Resolution) is the average time it takes to fully remediate a production incident: from initial detection through investigation, remediation, and verification that the issue has been corrected. It is the headline metric for incident response programs.

Incident Response

Incident response is the structured process of detecting, diagnosing, and resolving production issues that affect users, encompassing alert triage, investigation, remediation, and post-incident learning.

Observability

Observability is the practice of instrumenting production systems to expose enough internal state, through metrics, events, logs, and traces, that engineers can ask new questions about system behavior without needing to ship new code.

Alert Fatigue

Alert fatigue is the operational condition in which engineers receive so many alerts, many of which are non-actionable, redundant, or false positives, that critical signals get missed and responders become desensitized to paging.

SHARE TERM

MTTD (Mean Time to Detect)

Related

MTTR (Mean Time to Resolution)

Incident Response

Observability

Alert Fatigue

Ready to put AI to work?