Service Level Objective (SLO)

A Service Level Objective (SLO) is a target value for a service's reliability: for example, "99.9% of requests will complete in under 200 milliseconds over a 30-day rolling window." SLOs are the foundational unit of measurement in modern reliability engineering and the input that defines a service's error budget.

The SLO framework was codified in the Google SRE Book and has since become standard practice across the industry. Its power is that it makes the reliability-velocity tradeoff explicit. Rather than treating reliability as an aspiration ("we want the service to be reliable") or a binary state ("it's up or it's down"), an SLO defines a measurable target. The team's job becomes meeting that target: no more, no less. Exceeding the target means engineering effort is being spent on reliability when it could be invested elsewhere. Missing the target means the error budget is exhausted and reliability work needs to take precedence.

Good SLOs measure user-impacting behavior, not infrastructure metrics. A 99.99% CPU availability SLO is meaningless to the customer; a 99.9% successful-checkout SLO is directly meaningful. The shift from infrastructure SLOs to outcome SLOs has been one of the maturity arcs of the discipline over the past decade. Modern programs typically maintain SLO hierarchies: high-level business-outcome SLOs at the top, decomposed into service-level SLOs that the underlying engineering teams own. The decomposition makes the business case for reliability investment defensible to executive stakeholders.

SLOs for AI-enabled systems need extension beyond traditional latency and availability. A service can be 100% available and 100% on-latency while still failing semantically—returning answers that are wrong, stale, or unsafe. Mature reliability programs are extending SLOs to cover answer quality regression, fallback rate, policy compliance, and other dimensions specific to AI-enabled components. The discipline is the same; the surface being measured is broader.