Toil

Toil is the manual, repetitive, automatable operational work that scales with service growth: work that produces no enduring engineering value, consumes time that could be spent on higher-leverage activity, and accumulates as a structural tax on engineering capacity.
Toil is one of the most durable concepts from the Google SRE Book. Before toil was named as a category, manual operational work was often accepted as the unavoidable cost of running production systems: engineers spent time on it because someone had to. The Google SRE Book reframed toil as an engineering problem to be solved: identify it, measure it, automate it, and reduce its share of the team's time over time. The discipline introduced explicit toil-reduction targets (Google's original guidance was that toil should consume no more than 50% of an SRE's time) and the practice of measuring toil as rigorously as any other operational metric.
The toil concept has aged well, but the surface has expanded. Classic toil included things like restarting failed services, manually scaling capacity, running deployment scripts, and acknowledging predictable alerts. Modern toil includes much more: maintaining runbook libraries that drift from reality, updating architecture diagrams that no one reads, answering the same engineering questions across a dozen shared platform support channels, manually correlating across observability tools during investigation, and reconstructing dependency graphs the system already knows.
AI SRE targets toil reduction across multiple dimensions. Alert Intelligence removes the per-alert investigation cost for transient noise. Chat with Prod answers engineering questions that previously consumed senior SME attention. The Production World Model™ replaces hand-maintained topology documentation with auto-discovered representation. The Knowledge Bank™ externalizes tribal knowledge into queryable form rather than requiring it to live in human memory. In Traversal customer environments, these capabilities have collectively returned hundreds of engineering hours per month to platform and reliability work that toil was previously consuming.