Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE) is the discipline of applying software engineering practices to operations problems: automating toil, defining measurable reliability objectives, balancing velocity against risk through error budgets, and treating production reliability as an engineering function rather than a reactive support burden.

SRE was developed at Google in the early 2000s and codified in the Google SRE Book, which became the foundational text of the discipline. The core insight was that production reliability had been treated as a separate function from software engineering—a manual, reactive, often understaffed activity that absorbed cost without producing engineering value. SRE reframed reliability as an engineering problem: define measurable SLOs, instrument systems to measure them, automate the manual work (toil) that gets in the way, and use error budgets to make reliability-vs-velocity tradeoffs explicit. The discipline introduced practices—postmortems, incident command, alerting discipline, change discipline—that are now industry standard.

The contributions of classic SRE remain foundational. Even in environments where AI has reshaped the operational landscape, organizations still need clear service expectations, fast and well-run incidents, meaningful postmortems, capacity planning, change discipline, and a bias toward automation over repetitive human work. SRE has not become obsolete in the age of AI—far from it. The point is that the load now placed on SRE has grown beyond what its original human-centered workflows were designed to absorb.

AI SRE

An AI SRE (AI Site Reliability Engineer) is an autonomous agentic system that performs causal investigation, root cause analysis, and remediation across production environments, operating as a continuously available teammate alongside human reliability engineers.

Toil

Toil is the manual, repetitive, automatable operational work that scales with service growth: work that produces no enduring engineering value, consumes time that could be spent on higher-leverage activity, and accumulates as a structural tax on engineering capacity.

Service Level Objective (SLO)

A Service Level Objective (SLO) is a target value for a service's reliability: for example, "99.9% of requests will complete in under 200 milliseconds over a 30-day rolling window." SLOs are the foundational unit of measurement in modern reliability engineering and the input that defines a service's error budget.

Error Budget

An error budget is the amount of unreliability a service is permitted before reliability work takes priority over feature work, defined as the difference between a service's target reliability and 100%, and treated as a budget to be deliberately spent.

Incident Response

Incident response is the structured process of detecting, diagnosing, and resolving production issues that affect users, encompassing alert triage, investigation, remediation, and post-incident learning.

SHARE TERM