The AI SRE Landscape

Blog

TABLE OF CONTENTS

After hundreds of interviews with engineers across vastly different verticals, observability stacks, and engineering team sizes, we have found that people often think about AI SRE solutions along two axes: (i) the complexity of the troubleshooting workflow they tackle, and (ii) the scope of data searched.

‍
The Current Focus: Simple, High Volume Alerts

Most current AI solutions have squarely focused on dealing with simple troubleshooting workflows, i.e., routine, noisy alerts that occur frequently. These are the types of alerts that companies often have pre-existing runbooks for, with procedures like:

Check for a service-specific dashboard.
Correlate them with recent error logs.
Correlate both with recent deployments.

Even though it’s simple and the monetary value of such troubleshooting is low, it requires toil from on-call engineers due to the large volume of such alerts. The process kills productivity with paper cuts. LLMs and AI agents can automate a lot of this toil without too much innovation required on the AI side (e.g., a straightforward ReAct agent architecture works). This is where there has been a lot of activity from incumbent observability platforms, startups, and even companies building in-house.

Why Incumbents Will Fall Short

AI solutions from incumbent platforms are easy to set up, but they only provide insight on the data that is stored on their platform. There is an opportunity for AI agents to execute more complicated runbooks today that require interacting with disparate observability data sources and avoiding vendor lock-in, which we believe is ready for market and likely to be taken on by startups or those building in-house.

The Real Opportunity: Cross-Platform, High-Impact Incidents

There is significantly more value to be gained by having AI agents troubleshoot the most complex, hairy, and painful incidents. These are the incidents that can take days to solve and disrupt multiple teams. We’ve seen 50-100+ engineers huddled in incident war rooms, with tens of thousands of customers affected, and it draws the attention and ire of leadership and users.

Such incidents are, by definition, extremely difficult to troubleshoot, and LLM-powered runbook automation alone does not suffice. They are some of the hardest problems for AI agents to automate. Every incident is a “snowflake” and it doesn’t neatly fit into a routine workflow, requiring traversing through petabytes of data across disparate parts of a lengthy observability stack; this same fragmentation of observability data is why we believe incumbents’ AI SRE solutions will not be able to handle complex incidents.

Our Take: What Matters Most

At Traversal, this is why we focus on helping engineering organizations deal with real production incidents—customers have repeatedly emphasized this is where the most value is in terms of downtime, reputational damage, and developer productivity. It’s a hard problem. We’ve made major technical leaps with our AI agent architecture to get here, finding ways to get swarms of distributed parallel agents to execute complicated statistical tests over petabytes of heterogeneous data, all while respecting query guardrails for system health.

For us, handling alerts is table stakes, but the specialization and advantage of our technology is how it runs on the real, severe, and complicated incidents that hit teams the hardest.

A Framework That Applies Beyond AI SRE

More broadly, we believe this framework — single vs. cross-platform, and complexity of workflow an AI agent is expected to handle — applies well beyond this domain. We see the same pattern emerging in security, product analytics, networking, and beyond.

‍

If this framework resonates, check out our demo to see how Traversal approaches it in practice. You can also read our evaluation guide for a deeper look at what matters when choosing an AI SRE solution.

Learn More