Published
October 7, 2025
Eventbrite, a global leader in event management and ticketing serving millions of users worldwide, faced the challenge of evolving its infrastructure over time — blending a legacy PySOA monolith with a growing set of microservices. This mix often left engineers with limited context and circular dependencies when diagnosing incidents.
To address these challenges, Traversal’s platform leveraged LLMs, causal statistical techniques, and partial tracing to build a comprehensive knowledge graph that surfaced a far more granular map of dependencies across Eventbrite’s systems than had been possible before. This capability was key to Traversal’s AI SRE surfacing accurate and deep root cause analyses for incidents, even when starting from very vague contexts.
Eventbrite engineers also proactively use Traversal’s knowledge graph to understand dependencies outside of incident response, and it has been particularly helpful for junior engineers lacking tribal knowledge.
Eventbrite’s infrastructure blended legacy and modern systems. At its core was a PySOA monolith, a structure with many interdependent services, libraries, and configurations. When trying to integrate with traditional out-of-the-box observability tools like Datadog, the team instrumented the monolith in creative ways to get granular observability. However, such instrumentation led to dozens of newer microservices in the system map and unusual circular dependencies creating novel challenges when troubleshooting and even in just understanding the Eventbrite system. Several custom dashboards were then created to improve visibility which only added to the complexity in monitoring and incident response.
This is a common reality for mature companies that have evolved their infrastructure over time. But this hybrid approach creates challenges that traditional observability tools struggle to address (and in some cases, even augment), including:
To address these issues, Eventbrite partnered with Traversal to cut through this complexity and provide clearer visibility into their complex infrastructure, towards the goal of automating their incident response.
To operate effectively in Eventbrite's environment, Traversal's AI SRE integrated across multiple systems — Datadog, GitHub, Slack, and FireHydrant — while accommodating the specific requirements of their PySOA-based services.
Most observability tools struggle with the vague starting points and circular dependencies inherent in distributed monoliths. To address this challenge, Traversal collaborated with Eventbrite to build a comprehensive service map that revealed dependencies beyond what standard tools expose. To do so, Traversal’s AI SRE platform applied proprietary causal machine learning tools paired with reasoning models to connect evidence across fragmented systems and identify true root causes, even when context is limited.
The map served dual purposes: powering accurate root cause analysis for Traversal's AI SRE, and giving engineers clearer visibility into service relationships and system topology — particularly helpful for junior engineers, reducing the learning curve and reliance on senior SREs.
Traversal has allowed Eventbrite engineers to run parallel searches across more than 50 dashboards and microservices, eliminating the need to manually piece together context from different parts of the system.
These efficiency gains already translate to an estimated ~$740K in annual savings from downtime eliminated and ~2,400 hours of engineering time saved annually, while also reducing on-call stress.
Across incidents, Traversal has accurately identified root causes more than 70% of the time, with the remainder largely attributable to integration or access constraints. Even in those cases, Traversal narrowed the search space, giving engineers a faster path to resolution.
“Traversal is a 24/7 expert AI SRE companion — it learns deep system context across all of our complex and mature microservices architecture. When an incident occurs, Traversal autonomously surfaces the blast radius, the key bottleneck services, and candidate root causes with supporting evidence within minutes, which would have otherwise taken our engineers hours to find and escalate.”
- Luca Valtulina, Staff SRE, Eventbrite
Between 22:10 UTC and 01:23 UTC on July 3, Eventbrite’s order-processing APIs experienced elevated latencies and timeouts. An on-call engineer noted the issue at 01:23 UTC and raised it in a Slack channel with ~40 colleagues. At the same time, the engineer engaged Traversal. By 01:26 UTC, Traversal surfaced the root cause — a DynamoDB service degradation — with supporting evidence tracing increased DynamoDB latency through to downstream slowdowns and upstream failures. The engineer quickly validated this finding against the AWS status page. In this single incident, Traversal saved the on-call engineer roughly 45 minutes of manual investigation — not including time that the other engineers in the Slack channel might have spent if they’d been pulled in to provide context or tribal knowledge.