Service Level Objective (SLO)

A Service Level Objective (SLO) is a target value for a service's reliability: for example, "99.9% of requests will complete in under 200 milliseconds over a 30-day rolling window." SLOs are the foundational unit of measurement in modern reliability engineering and the input that defines a service's error budget.

The SLO framework’s power is that it makes the reliability-velocity tradeoff explicit. Rather than treating reliability as an aspiration ("we want the service to be reliable") or a binary state ("it's up or it's down"), an SLO defines a measurable target. The team's job becomes meeting that target: no more, no less. Exceeding the target means engineering effort is being spent on reliability when it could be invested elsewhere. Missing the target means the error budget is exhausted and reliability work needs to take precedence.

Good SLOs measure user-impacting behavior, not infrastructure metrics. A 99.99% CPU availability SLO is meaningless to the customer; a 99.9% successful-checkout SLO is directly meaningful. The shift from infrastructure SLOs to outcome SLOs has been one of the maturity arcs of the discipline over the past decade. Modern programs typically maintain SLO hierarchies: high-level business-outcome SLOs at the top, decomposed into service-level SLOs that the underlying engineering teams own. The decomposition makes the business case for reliability investment defensible to executive stakeholders.

SLOs for AI-enabled systems need extension beyond traditional latency and availability. A service can be 100% available and 100% on-latency while still failing semantically—returning answers that are wrong, stale, or unsafe. Mature reliability programs are extending SLOs to cover answer quality regression, fallback rate, policy compliance, and other dimensions specific to AI-enabled components. The discipline is the same; the surface being measured is broader.

Error Budget

An error budget is the amount of unreliability a service is permitted before reliability work takes priority over feature work, defined as the difference between a service's target reliability and 100%, and treated as a budget to be deliberately spent.

Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE) is the discipline of applying software engineering practices to operations problems: automating toil, defining measurable reliability objectives, balancing velocity against risk through error budgets, and treating production reliability as an engineering function rather than a reactive support burden.

MTTR (Mean Time to Resolution)

MTTR (Mean Time to Resolution) is the average time it takes to fully remediate a production incident: from initial detection through investigation, remediation, and verification that the issue has been corrected. It is the headline metric for incident response programs.

Observability

Observability is the practice of instrumenting production systems to expose enough internal state, through metrics, events, logs, and traces, that engineers can ask new questions about system behavior without needing to ship new code.

SHARE TERM

Service Level Objective (SLO)

Related

Error Budget

Site Reliability Engineering (SRE)

MTTR (Mean Time to Resolution)

Observability

Ready to put AI to work?