Incident Response

Incident response is the structured process of detecting, diagnosing, and resolving production issues that affect users, encompassing alert triage, investigation, remediation, and post-incident learning.
Incident response is the most operationally visible function of a reliability team and the most variable in quality across organizations. A well-run incident converges quickly: the blast radius is established within minutes, the leading hypothesis is named and falsifiable, the bridge has the right people on it (no more, no fewer), and the incident commander directs threads in parallel rather than letting investigation collapse into sequential conversation. A poorly-run incident does the opposite: too many people on the bridge, no clear hypothesis, anchoring on the first plausible explanation, wrong rollbacks that extend the incident by 40+ minutes.
The first fifteen minutes of a serious incident shape everything that follows. A good fifteen minutes produces five outputs: confirmed real user impact, an estimate of blast radius, a bounded time window for investigation anchored to actual behavior change (not alert time), a ranked short list of candidate causes stated as falsifiable claims, and a named incident commander with explicit authority. If fifteen minutes pass without these outputs, the incident is running without a foundation. The AI SRE Handbook covers the protocol in detail.
AI SRE changes the shape of incident response by compressing the orientation phase. When an alert fires, a real-time investigation begins before the human bridge opens — the Production World Model is already mapped, dependencies are already traversed, ranked hypotheses are already produced. The incident commander's job in the first fifteen minutes shifts from assembling context to validating the AI's hypothesis set, identifying what it may have missed, and directing human investigation toward the gaps. This is a fundamentally more efficient use of the most expensive minutes of an incident.