AI AgentAI CodingDeveloper Tools

Claude Code Dynamic Workflow: Where the Determinism Boundary Lies

If you have worked on RAG systems, you have probably hesitated between two designs.

One is traditional RAG: the user asks a question, the program turns it into a query, searches the database, feeds the results to the model, and generates an answer. The flow is fixed: search first, then answer. There is no question of “should I search” or “how many times should I search.” The benefit of this approach is stability, predictability, and debuggability. The problem is a hard ceiling: the system can only handle as many situations as the flow designer can foresee. If the first search result is not good enough, it will not try a different keyword on its own.

The other is agentic RAG: you hand the search tool to an agent and let it decide when to search, what to search, how many times to search, and whether to follow up. Much more flexible, but the cost is that the agent needs to keep both the search strategy and the search results in its head at the same time. This is fine for simple tasks, but if you ask it to do a multi-round, multi-angle, cross-validating deep investigation, the search strategy and intermediate results end up crammed into the same context window, and the agent starts losing information. Earlier judgments are no longer remembered, and later decisions lose their baseline.

The tension between these two options is not about which one is better. They represent two different sources of determinism. Traditional RAG draws its confidence from the process: every step is spelled out clearly in code. Agentic RAG draws its confidence from the result: I do not prescribe the process, but I do prescribe what the end state looks like, and the agent decides how to get there.

This is not a problem unique to RAG. The entire field of agentic systems faces the same choice: what should be locked down in code (process determinism), and what should be left to the agent’s discretion (result determinism). On May 28, Anthropic released Claude Code dynamic workflow (official docs, blog), offering a concrete answer to this question.

More Than Orchestration

First, let’s be clear about what this feature is. A dynamic workflow is a script (officially described as JavaScript) that Claude generates on-the-fly based on your request, executed by an independent runtime in the background. The script handles control flow: what to do first, what to do next, under what conditions to parallelize, under what conditions to branch. But the script itself does not do the work – it only spawns subagents. Each subagent runs in its own context window, receives a specific task, completes it, and returns the result to the script.

The key difference from a regular Claude Code session: in subagent mode, the answer to “who decides what to do next” is Claude, and all intermediate results live in Claude’s context. In a workflow, the answer is “the script,” and intermediate results live in the script’s variables. Claude is no longer the orchestrator. The orchestrator is a piece of code.

Here is an example. Given a question for deep research, the script can split the search into multiple parallel angles, fetch sources, have different agents cross-validate each other’s claims, and return only the verified results.

But the most illustrative case is not this demo. Jarred Sumner used dynamic workflow to migrate Bun’s runtime from Zig to Rust: roughly 750,000 lines of code, eleven days. The first workflow analyzed the lifetime mappings of all structs. The second workflow wrote all .rs files in parallel (each file with two reviewers for validation). The third workflow looped over build and test until they passed. The fourth workflow optimized unnecessary data copies.

Why it was designed this way is more interesting than what it did.

Where the Boundary Lies

Looking at dynamic workflows alongside a regular Claude Code session, Anthropic has employed both types of determinism in a single product, drawing a line at a specific place.

Control flow is process-deterministic. The JS script dictates every step: what to do, in what order, when to parallelize, when to branch, when to validate. Every time you run the same script, the execution path is identical. This layer needs to be deterministic because agents lose information over long runs. After dozens of steps, early judgments get pushed to the edge of the context window, and the agent forgets what step it is on or what to check next. The script solves this problem: it does not forget. It is an external, persistent stepper.

The execution layer is result-deterministic. Each subagent, given its task from the script, decides autonomously within its own context window how to proceed: which tools to use, what keywords to search, how many rounds to go, when to stop. The script tells it only “what to produce” and “how to validate,” not “how to do it.” This layer needs to be flexible because the execution path of any single task cannot be exhaustively predetermined. If the agent encounters an unexpected data format, it needs to adjust on its own. If the first search result is wrong, it needs to try a different term.

The validation layer works in the opposite direction – its standard is result-deterministic. The script can prescribe the validation steps (cross-checking, voting, reporting only verified claims), but the judgment of “is this claim reliable” is expressed in natural language and applied autonomously by the agent. This is fundamentally different from TDD. In TDD, “correct” means the test suite passes green – the standard is encoded as an executable program, leaving no room for the agent to judge. But in a dynamic workflow, the standard is “is this source independent” or “does this reasoning have gaps” – these are things the agent must judge for itself. The validation protocol (voting, aggregation) is process-deterministic at the step level, but the judgment of validation itself is result-deterministic. Anthropic did not try to encode judgment into code. Instead, it replaced a single agent’s self-confirmation with consensus from multiple independent agents. At its core, this uses multiple independent samples to replace a single continuous chain of reasoning, so that validation errors are no longer drawn from the same source.

More Than an Engineering Decision

Where this boundary is drawn answers a fundamental question: in an agentic system, what should be given to code, and what should be given to the agent?

The answer: the things agents are bad at, give to code. Specifically, agents are bad at long-term state management – they forget – so control flow should be written in the script, not held in the agent’s head. Agents are also bad at self-validation: errors within the same reasoning chain are correlated, so checking your own work is as good as not checking at all. That is why validation should be designed as multi-agent cross-checking, not single-agent self-review.

This is not a question of “code is better than agents,” nor is it a matter of philosophical stance.

Zooming out, the industry seems to fall into two extremes when dealing with agent reliability. One extreme is “encode everything in code” – since agents are unreliable, go back to code. TDD is a canonical example: you put the agent through a TDD workflow (write tests first, then implementation, run tests, fix until green), with code-level checks at every step ensuring the output is correct. But this path has its own ceiling. A recent study identified a “TDD paradox”: when agents were strictly required to follow the TDD process but were not told which tests were relevant to the current change, their regression rate actually increased from 6% to 10%. Pure process constraints, when information is insufficient, can be worse than no constraints at all. The other extreme is “make the agent stronger” – larger context windows, better models, more tools. The bottleneck here is equally structural: no matter how strong the model, it cannot solve the “checking your own work” problem.

Anthropic’s dynamic workflow sits in a middle position. It is not “code replaces agents” nor “make agents do more.” It breaks the problem down: what things does code do better (control flow does not forget, state does not drift), what things do agents do better (exploring within fuzzy boundaries, dynamically adjusting strategy based on intermediate results), and then assigns each to the most appropriate mechanism. Control flow goes to the script, execution goes to the agent, validation goes to multi-agent cross-checking. The relationship between these three is not substitution – it is division of labor.

Once this judgment holds, the demands on a builder’s energy and skills shift accordingly. Before, you spent a lot of time on prompt engineering: tuning context structure, adding checkpoints, writing longer instructions to compensate for context the agent lost. After, you spend your time designing system boundaries: is the agent good at this? If not, what替代 replaces it? What does the interface between the replacement mechanism and the agent look like? Energy shifts from prompt engineering to system design.

Limited, But Not a Flaw

Of course, this design has its boundaries.

The hardest part has not changed: defining acceptance criteria. The script can precisely tell the agent “do X now,” but how well X needs to be done still has to be written in natural language. These natural language criteria are still interpreted and executed by the same agent. The independence problem in validation is solved at the control flow layer, but it persists at the standard-definition layer. The agent may strictly follow every step of the process, and every criterion may “pass,” but the overall result may not be what you wanted, because the criteria themselves were written wrong.

The workflow script itself is also generated by Claude. If Claude writes a flawed script (cross-validation logic too weak, branching conditions too vague), that script will reliably execute a flawed plan.

There is another limitation: the script cannot dynamically adjust its strategy mid-execution based on intermediate discoveries. If a finding in one phase suggests that the search direction for the second phase needs to be completely changed, the script cannot handle that. Currently, dynamic workflows are best suited for tasks where “the plan can be clearly written in advance”: code audits, large-scale migrations, multi-angle cross-validating research. If the task itself requires exploratory, adaptive planning, it is not enough.

But these limitations do not change a more fundamental signal: Claude Code dynamic workflow is the first major agent platform to turn “at which layer should process determinism and result determinism be applied” into a product feature. The line it draws (control flow in code, execution in agents, validation in multi-agent cross-checking) may not be optimal, but it transforms this question from a philosophical discussion into a concrete design that can be debated, compared, and improved.

Next time you build an agentic system and the agent forgets its plan after dozens of steps, or confirms a wrong conclusion to itself, do not rush to swap the model. Ask yourself: does this fall into the category of things agents are fundamentally not good at? If so, could the answer be not to make the agent better, but to use a different mechanism instead?