Error Budget

An error budget is the amount of unreliability a service is permitted before reliability work takes priority over feature work, defined as the difference between a service's target reliability and 100%, and treated as a budget to be deliberately spent.

The error budget framework was introduced in the Google SRE Book as a way to make reliability-vs-velocity tradeoffs explicit. Rather than treating every outage as a failure to be apologized for, the discipline acknowledges that 100% availability is the wrong target for almost everything: the cost of approaching it grows exponentially, while the marginal customer benefit shrinks. Setting a Service Level Objective of, say, 99.9% creates a 0.1% budget; the team can spend that budget on faster release cadence, riskier changes, or aggressive experimentation, knowing exactly when the budget is gone and reliability work has to take precedence.

The framework remains foundational but needs extension for AI-enabled systems. Traditional error budgets cover availability and latency; AI systems can be 100% available and 100% on-latency while still failing semantically, returning answers that are wrong, stale, or unsafe. Mature reliability programs are extending budgets to cover answer quality regression, fallback rate, policy compliance, and other dimensions specific to AI-enabled components. Without those extended budgets, an organization can meet every infrastructure SLO while running a system that is failing its users in ways that won't show up in a weekly reliability review.

Error budgets work best when paired with the discipline of feeding budget consumption back into prioritization. When the budget is burning, engineering leadership has a clear signal that reliability investment needs to take priority. When budget is unspent at quarter end, the team has license to ship faster. AI SRE reduces the time spent on each incident, which extends the effective budget without sacrificing velocity.

Service Level Objective (SLO)

A Service Level Objective (SLO) is a target value for a service's reliability: for example, "99.9% of requests will complete in under 200 milliseconds over a 30-day rolling window." SLOs are the foundational unit of measurement in modern reliability engineering and the input that defines a service's error budget.

Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE) is the discipline of applying software engineering practices to operations problems: automating toil, defining measurable reliability objectives, balancing velocity against risk through error budgets, and treating production reliability as an engineering function rather than a reactive support burden.

MTTR (Mean Time to Resolution)

MTTR (Mean Time to Resolution) is the average time it takes to fully remediate a production incident: from initial detection through investigation, remediation, and verification that the issue has been corrected. It is the headline metric for incident response programs.

Toil

Toil is the manual, repetitive, automatable operational work that scales with service growth: work that produces no enduring engineering value, consumes time that could be spent on higher-leverage activity, and accumulates as a structural tax on engineering capacity.

SHARE TERM

Error Budget

Related

Service Level Objective (SLO)

Site Reliability Engineering (SRE)

MTTR (Mean Time to Resolution)

Toil

Ready to put AI to work?