Observability

Observability is the practice of instrumenting production systems to expose enough internal state, through metrics, events, logs, and traces, that engineers can ask new questions about system behavior without needing to ship new code.

The discipline of observability emerged as distributed systems made traditional monitoring insufficient. Monitoring asks predetermined questions ("is this service up? is latency under threshold?"). Observability is meant to answer questions you didn't know to ask in advance — the unknown unknowns that show up during real incidents. The distinction is captured in MELT (Metrics, Events, Logs, Traces), the four foundational data types that make a system observable.

Observability has been a genuine advance. Tools like Splunk, Datadog, Grafana, New Relic, and Honeycomb have given teams the ability to move from anecdote to evidence during investigation. The ceiling becomes visible when you traverse a typical incident: an alert fires, the engineer opens their laptop, scans recent deployments, queries logs, traces requests across services, and gradually narrows toward a hypothesis. The evidence existed all along, but no one had assembled it. The bottleneck is no longer visibility; it is reasoning. As The AI SRE Handbook puts it: "investing in more observability addresses visibility, not the reasoning bottleneck. A system with abundant telemetry will not generally recover faster from a complex multi-hop incident if it gets more telemetry. What it needs is a faster path from evidence to explanation."

The 2025 Grafana Observability Survey found that observability now consumes 17% of total compute infrastructure spend on average, with some organizations spending more on it than on the compute it monitors. AI SRE fully activates observability: it captures observability data as input and produces causal explanations as output. It's not a replacement for observability platforms; it's the reasoning layer that finally turns the data they generate into actionable answers.