MTTD (Mean Time to Detect)

MTTD (Mean Time to Detect) is the average time between when a production issue actually begins and when monitoring or alerting systems first detect it — one of the foundational metrics for measuring reliability program effectiveness.
MTTD is often the largest hidden component of total incident duration. An incident's "official" clock typically starts when an alert fires and a ticket gets created, but the user-visible impact began earlier: sometimes minutes, sometimes hours. A 30-minute MTTD doesn't mean the system was healthy for those 30 minutes; it means the system was already failing and the team didn't yet know.
Reducing MTTD compounds the effect of MTTR improvements. The clock on user-visible impact starts at detection failure, not at incident creation. A team that cuts diagnosis time from 30 minutes to 5 saves 25 minutes per incident only if the detection itself happened on time. If detection was 20 minutes late, total impact is still 25 minutes of bad customer experience. MTTD improvements typically come from better instrumentation, smarter alerting (especially on user-impacting signals rather than just infrastructure metrics), and synthetic monitoring that detects symptoms before customers notice them.
AI SRE can reduce MTTD by treating telemetry continuously rather than reactively. Instead of waiting for a threshold to be crossed, Traversal can detect anomalous patterns, surfacing degradation before it becomes severe enough to trigger conventional alerts. This is especially valuable for the failure modes AI-enabled systems introduce, where surface metrics may stay green while semantic quality is degrading.