The Determinism Trap

Blog

TABLE OF CONTENTS

Matt Schoenbauer

The future of enterprise AI isn’t making models give the same answer every time. It’s learning when to use judgment, when to use software, and how to make the two work together.

Mark Cuban wrote on X:

I'm coming to the conclusion that the biggest challenge for Enterprise AI, and AI in general, as of now, is that it's still impossible to make sure that everyone gets the same answer to the same question, every time.

Am I wrong?

Yes, Mark. You're wrong.

‍

Stochasticity is not the problem

Yes, LLMs are technically non-deterministic. Same prompt, same parameters, different outputs. But the reason is surprisingly mundane.

Mira Murati's Thinking Machines Lab published a deep dive on this. The root cause isn't some fundamental property of neural networks, or even the token sampling methods at the final stage of inference; it's batch invariance failure in GPU kernels. Dynamic batching means your request gets grouped with other active requests, and different batch sizes produce slightly different floating-point results that cascade into different outputs. Thinking Machine Labs fixed these issues and “defeated” nondeterminism in LLMs.

Cool technical achievement. But I'd argue it doesn't really matter.

‍

The real issue is unpredictability

What people are actually struggling with isn't that the same input gives different outputs. It's that slight variations in input produce large variations in output.

When you're using an LLM for a real problem—investigating an incident, triaging an alert, analyzing a document—you're almost never seeing the same input twice. The inputs are messy, contextual, and always a little different. Change a few words in a prompt, add or remove a tool parameter, and the model might go in a completely different direction.

This is what makes people uncomfortable. And they call it "stochasticity" because that's the technical-sounding word that feels right. But what they're describing is unpredictability, and making LLMs deterministic doesn't fix it.

‍

This is actually fine

Here's the thing people miss when they complain about unpredictability: the standard they're holding LLMs to is the standard of human activity. And human activity isn't predictable either.

Ask two engineers to investigate the same production incident. They'll look at different data, take different paths, and write different summaries. We don't call them "unreliable" for this. We call it judgment.

LLMs operate in the same space. The tasks where they're valuable (tasks that require interpretation, synthesis, reasoning about novel situations) are inherently tasks where there isn't a single correct output. Unpredictability isn't a bug here. It's the nature of the work.

‍

The real distinction: procedural vs. cognitive work

This brings me to the thing I think the industry hasn't fully thought through.

Every task in a business falls into one of two categories:

	Procedural work	Cognitive work
Inputs	Known, structured	Novel, contextual
Outputs	Specified, verifiable	Contextual, judgment-based
Right tool	Traditional software	LLMs
Examples	Parse a date, transform data, enforce a schema	Investigate an incident, triage an alert, summarize a doc
Cost	Cheap	Expensive

If you have a procedural set of inputs and a determined expectation for the outputs, you should be using traditional software. Full stop. That's what software does. It's been doing it reliably for decades.

If you have a task where the inputs are always different and the "right" output requires judgment, that's cognitive work, and that's where LLMs belong.

The mistake is trying to use LLMs for procedural work and then being surprised when the results aren't deterministic. Even a perfectly deterministic LLM would be the wrong tool for that job.

‍

The interplay that makes it all work

LLMs are very good at producing software. And software is very good at invoking LLMs. These aren't competing paradigms; they're complementary. And the real unlock is in how they compose.

Right now when we want an AI agent to do something procedural—calculate a number, parse a format, enforce a rule—we mostly just ask the LLM to do it inline. Sometimes it works, sometimes it doesn't, and then people say "see, LLMs are unreliable."

The better pattern: have the LLM write code to do the procedural thing. Compile and execute the code. The LLM handles the cognitive work: understanding context, figuring out what needs to happen. The software handles the procedural work: doing it correctly every time.

This isn't theoretical. Over a six-month window in late 2025 and early 2026, two of the largest AI labs independently rebuilt their agent APIs around exactly this pattern:

Anthropic shipped programmatic tool calling in November 2025—the model writes Python that runs in a sandbox, calls tools, and returns only the synthesized output to the LLM. Benchmarks showed 20% higher task success and 37% fewer tokens versus traditional JSON tool calling.
OpenAI shipped a model-native harness in April 2026, with a shell tool, native sandbox execution, and a built-in agent loop.

Two labs, two sandboxes—same architecture. The atomic unit of agent action stops being "one tool call" and becomes "a code block that orchestrates many." Procedural work runs in the sandbox. Cognitive work runs in the model. The API surface itself has been redrawn around the division.

But the runtime alone doesn't close the loop.

‍

Three problems to solve

In both implementations, the script the LLM writes is ephemeral. It's generated, executed, discarded. The next time the same procedural workflow comes up, the LLM writes the same code again—re-deriving the orchestration from scratch every invocation.

That isn't really procedural-vs-cognitive division of labor. It's the LLM doing both, with the procedural part briefly outsourced to a sandbox. The cognitive work of figuring out the procedure is being done over and over.

There are three problems one needs to solve to close the loop:

Persistence. AI-generated code needs to graduate from ephemeral script to durable procedural work. At the moment we live in a world of two extremes:

Agents writing code in tool-calling loops to meet user requests and then immediately throwing it away.
Slow software development lifecycles that were designed for humans but fail to let agents truly shine.

We need a solution that allows for agents to build their corpuses of procedural capabilities over time without having humans heavily in the loop.

User experience. This is Mark Cuban's actual problem, dressed up as a determinism complaint. He's using AI and expecting procedural outputs: the same answer to the same question. The fix isn't making LLMs deterministic; it's building systems where users don't need to know or care whether their answer came from procedural code or cognitive reasoning. Give them what they want. Route the work behind the scenes.

Observability. Once procedural code is persisted and runs autonomously, you need to know whether it's still doing the right thing. The systems it depends on shift; edge cases emerge; schemas change. Persisted procedural work needs a monitoring story: is this still producing reasonable outputs, or does the cognitive layer need to step back in?

Two labs built the runtime. The interesting work is solving these three problems on top of it.

‍

The most valuable human skills

We are now seeing two common expectations about the future of work that conflict on close examination.

Heavy increases in layoffs at large companies for software engineers leave us asking what the future holds for today’s software engineers.
Many expect that most knowledge work done by humans today will eventually be accomplished by AI agents.

We are right to question the future of software engineering work, since the cognitive work of taking product requirements and converting them to procedural software programs is bread and butter for AI agents.

But in order to make (2) a reality, we know that we can never rely on agents alone- we will also have to rely on the unprecedented mountains of software that they will produce and maintain in order to accomplish procedural work. As we make this transition, the most valuable operators will be those who can understand and correctly manage agents, software, and their interplay.

Behind us is a world of software engineering that is never coming back.

Today, we are confused by our misunderstanding of the different modes of work and grappling with the thought of a future of massive AI-driven productivity.

The coming years belong to those who can understand the cognitive work that agents do, the procedural work that software does, and can leverage the interactions between the two to bring us to that future.

‍

Matt Schoenbauer is a Founding Engineer at Traversal. Learn more about Traversal by booking a demo.

‍

Learn More