Error Budget

An error budget is the amount of unreliability a service is permitted before reliability work takes priority over feature work, defined as the difference between a service's target reliability and 100%, and treated as a budget to be deliberately spent.

The error budget framework was introduced in the Google SRE Book as a way to make reliability-vs-velocity tradeoffs explicit. Rather than treating every outage as a failure to be apologized for, the discipline acknowledges that 100% availability is the wrong target for almost everything: the cost of approaching it grows exponentially, while the marginal customer benefit shrinks. Setting a Service Level Objective of, say, 99.9% creates a 0.1% budget; the team can spend that budget on faster release cadence, riskier changes, or aggressive experimentation, knowing exactly when the budget is gone and reliability work has to take precedence.

The framework remains foundational but needs extension for AI-enabled systems. Traditional error budgets cover availability and latency; AI systems can be 100% available and 100% on-latency while still failing semantically, returning answers that are wrong, stale, or unsafe. Mature reliability programs are extending budgets to cover answer quality regression, fallback rate, policy compliance, and other dimensions specific to AI-enabled components. Without those extended budgets, an organization can meet every infrastructure SLO while running a system that is failing its users in ways that won't show up in a weekly reliability review.

Error budgets work best when paired with the discipline of feeding budget consumption back into prioritization. When the budget is burning, engineering leadership has a clear signal that reliability investment needs to take priority. When budget is unspent at quarter end, the team has license to ship faster. AI SRE reduces the time spent on each incident, which extends the effective budget without sacrificing velocity.