DigitalOcean Uses Traversal to Improve Infrastructure Resilience Across its Complex Enterprise Environment

Published August 21, 2025

38%

Reduction in mean time to resolution (MTTR)

3,600

Engineering hours saved annually

At a Glance

DigitalOcean (DO), a leading public cloud provider serving over 600,000 customers, partnered with Traversal to strengthen infrastructure resilience and streamline on-call operations. Traversal enabled engineers to identify and resolve incidents faster and with greater confidence — without constant context-switching across fragmented tools.

What began as a proof-of-concept with a few teams quickly expanded across the engineering organization and into customer-facing systems — accelerating recovery, reducing operational stress, and freeing up engineering time for higher-impact work.

38%

Reduction in mean time to resolution (MTTR)

36,000

Engineering hours saved annually

The Challenge

DO operates a global cloud infrastructure platform with hundreds of thousands of customers and over a dozen data centers worldwide. Like many scaled organizations, they had built mature incident response processes with structured alerting, ticketing, and triaging workflows.

But the combination of a large, distributed engineering organization and a complex technology stack spanning diverse services and multiple observability systems meant that incident response often involved navigating vast amounts of fragmented data. Even with strong processes in place, triaging was time-consuming, requiring engineers, each with limited context, to manually connect disparate signals to find the root cause of issues.

DO partnered with Traversal to reduce this manual operational workload by building an agentic AI SRE system that resolved both the most complex production incidents they faced and the noisy alerts that continuously distracted engineers.

Our Deployment

Traversal was initially deployed at DO as a proof-of-concept, focused on evaluating the AI’s ability to accurately analyze a broad sample of historical and live incidents. Within days, Traversal demonstrated rapid and consistent root cause identification across multiple scenarios.

Based on a successful pilot, DO deployed Traversal on-prem with read-only access to its full observability stack. Traversal now provides comprehensive incident analysis across nine distinct systems — including Grafana, Elastic, VictoriaMetrics, GitHub, Alertmanager, Confluence, and Slack — surfacing root causes that might otherwise remain buried across fragmented tools. To do so, it routinely processes massive amounts of observability data — for example, between 30 million and 300 million logs per incident.

Traversal is now in General Availability across DO’s entire engineering organization, and is used by 50+ engineers per month.

Traversal’s Impact at DigitalOcean

DO engineers can engage with Traversal in two ways: (1) by initiating an investigation directly in our UI, or (2) via automated triggers in Slack.

In Slack, Traversal auto-triggers in new incident channels within seconds — eliminating the need for engineers to manually gather context or ping others to get started. Traversal immediately begins working to identify the root cause, drawing on the observability stack without requiring direction. Traversal also runs continuously in key alert channels, launching its own investigation or deferring to DO’s automated runbook, depending on the nature of the alert.

Engineers can ask follow-up questions and use the UI to explore the full investigation — including blast radius, response timeline, alternate hypotheses, system maps, and the supporting evidence behind each conclusion. These features were developed in close collaboration with the DO engineering team to align with their workflows.

Over the course of three months, Traversal correctly identified the relevant logs, metrics, or PRs in 73% of all in-scope incidents. Each investigation also includes a lightweight feedback loop, allowing engineers to quickly signal whether the output was useful — helping improve accuracy both during the incident and over time.

We have a very thorough evaluation process in these situations. We ran [Traversal] through multiple use cases, and every single time, they knocked it out of the park. We took real customer incidents that used to take our engineers an hour or more to resolve — and Traversal’s agents were identifying root causes in under a minute.

Bratin Saha

CTO & CPO, DigitalOcean

Inside a Real Incident

Around 4:54 p.m., DigitalOcean customers noticed they couldn’t select a data center region when creating Droplets — the virtual machines used to run their apps. A support ticket was filed, and by 5:45 p.m. the issue had escalated to CloudOps.

At 5:52 p.m., an incident Slack channel was created and the Traversal Slackbot joined immediately. By 6:03 p.m., Traversal identified the root cause – a recent change in a deeply nested, non-obvious service.

With the cause clear, the DO team rolled back the PR and fully resolved the issue by 6:20 p.m.

What would’ve taken 10–15 engineers over an hour was traced in 11 minutes and fixed in 28 minutes — preventing prolonged downtime and customer impact.

DigitalOcean Uses Traversal to Improve Infrastructure Resilience Across its Complex Enterprise Environment

At a Glance

The Challenge

Our Deployment

Traversal’s Impact at DigitalOcean

Inside a Real Incident

More customer stories

Cloudways Launches Self-Healing Site Reliability Solution, Powered by Traversal

Cloudways Launches Self-Healing Site Reliability Solution, Powered by Traversal

Cloudways Launches Self-Healing Site Reliability Solution, Powered by Traversal

Eventbrite Turns to Traversal’s AI SRE to Overcome Complexity of Legacy Systems

Eventbrite Turns to Traversal’s AI SRE to Overcome Complexity of Legacy Systems

Eventbrite Turns to Traversal’s AI SRE to Overcome Complexity of Legacy Systems