Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE) is the discipline of applying software engineering practices to operations problems — automating toil, defining measurable reliability objectives, balancing velocity against risk through error budgets, and treating production reliability as an engineering function rather than a reactive support burden.

SRE was developed at Google in the early 2000s and codified in the 2016 Google SRE Book, which became the foundational text of the discipline. The core insight was that production reliability had been treated as a separate function from software engineering—a manual, reactive, often understaffed activity that absorbed cost without producing engineering value. SRE reframed reliability as an engineering problem: define measurable SLOs, instrument systems to measure them, automate the manual work (toil) that gets in the way, and use error budgets to make reliability-vs-velocity tradeoffs explicit. The discipline introduced practices—postmortems, incident command, alerting discipline, change discipline—that are now industry standard.

The contributions of classic SRE remain foundational. Even in environments where AI has reshaped the operational landscape, organizations still need clear service expectations, fast and well-run incidents, meaningful postmortems, capacity planning, change discipline, and a bias toward automation over repetitive human work. SRE has not become obsolete in the age of AI—far from it. The point is that the load now placed on SRE has grown beyond what its original human-centered workflows were designed to absorb.