AI AgentAI CodingDeveloper Tools

Your Pipeline Is Laundering: The Real Danger in Multi-Agent Systems Isn't Agent Mistakes

At the start of 2026, the AI engineering world is squeezed between two opposing forces. On one side, Kimi K2’s Agent Swarm spawned 300 agents collaborating across 4,000 steps, pushing BrowseComp from 78.4% to 86.3% on Moonshot’s published benchmarks; Claude Code’s Dynamic Workflow spent 11 days porting Bun from Zig to Rust, 750,000 lines of code, 99.8% of tests passing. On the other side, the MAST taxonomy analyzed over 1,600 execution traces across seven major frameworks and found that 78.7% of failures stem from specification issues and coordination problems; Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027.

Both sets of evidence hold simultaneously. The contradiction is not about whether multi-agent systems work. It is about why the same pattern delivers stunning results in some places and fails to guarantee even basic reliability in others. The answer lies not in the quantity or quality of the agents, but in a more hidden mechanism.

Your Pipeline Is Laundering

This is not about money laundering in the criminal sense. It describes a mechanism: a mistaken assumption passes through multiple layers of agent processing. Each step does not intercept it. Each step makes it look more credible. The more processing steps it goes through, the more the output appears to have undergone rigorous scrutiny, the same structure by which money laundering makes dirty cash look like legitimate income after enough transactions.

Consider a concrete bug. Anthropic’s April postmortem documented a defect that passed all automated checks, human code review, unit tests, end-to-end tests, and dogfooding, and took over a week to discover. The original text reads: “made it past multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding.” The agent that wrote the code made a mistaken assumption, and that assumption went through at least five processing stages: the agent writing it, the agent reviewing it, a human reviewing it, automated testing, and end-to-end testing. All passed. None caught it.

That mistaken assumptions survive review chains is something both single-agent and multi-agent systems encounter. But multi-agent systems raise the problem by at least an order of magnitude. In a single-agent scenario, a mistaken assumption hides inside one output. A human reviewer, if sufficiently familiar with the domain, might spot the crack between the assumption and inconsistent facts. In a multi-agent scenario, the mistaken assumption does not merely hide inside one output: other agents receive that output and build on top of it. Agent 2 writes a middleware based on the mistaken assumption. Agent 3 writes tests against the middleware’s interface. Agent 4 fills in the documentation. By the time a human reviews it, what they see is an internally consistent system: middleware, tests, documentation, all corroborating each other. They corroborate each other not because the assumption itself is correct, but because they all trace back to the same flawed root. This is not a defect hidden inside an output. It is a self-consistent, empirically verified, well-documented house of cards.

Imagine a mistaken assumption. An agent generates an output based on it. A second agent takes that output, reorganizes it, adds some citations, adjusts the formatting. A third agent analyzes the structured version and produces a conclusion with tables and classifications. You, the human, see the third agent’s output: clean formatting, clear categorization, rich citations. You judge that this product has been through three agents and therefore must be well-founded.

The problem is that neither the second agent nor the third agent verified whether the first agent’s assumption was correct. They simply took the assumption and refined the presentation. What the first agent’s original assumption was, what was missing from its context, what leap it made at that moment, these things were not corrected in steps two and three. They were erased. Each laundering step moves the output further from the original reasoning flaw and closer to a product that looks trustworthy. This is why I call it laundering: the more processing steps, the more credible the output appears, but its distance from the truth does not shrink.

Claude Code Dynamic Workflow’s verification mechanism offers a precise example. Its documentation describes it as: “agents address the problem from independent angles, other agents try to refute what they found, and the run keeps iterating until the answers converge.” This sounds like the scientific method. But the workers and the verifiers use the same model. Blind spots produced by the same training process are shared by all agents. When a verifier’s reasoning bias aligns with the worker’s, what it checks is not the assumption itself, but whether the assumption’s expression is internally consistent. A consistent output, after multiple rounds of cross-checking, becomes tighter, more coherent, more uniform, and more right-looking. But consistency is not correctness. Uniformity is not independence. In this system, each additional round of cross-checking adds only another laundering step.

This mechanism reveals the deeper meaning of the MAST taxonomy. 78.7% of failures are attributed to specification problems and coordination failures, but this classification does not distinguish between two different things.

One is agents failing to understand each other’s intent. You tell the first layer to search for a keyword; it searches, it understands. The second layer misunderstands and reads a different document. This is a pure coordination failure.

The other is agents understanding each other’s intent, but laundering away the original assumption’s uncertainty during handoff. The first layer says, “I suspect it might be like this.” The second layer picks it up and directly says, “It is like this.” The third layer builds a classification on top of “It is like this.” Every step followed the correct process. Every step lost the original word “might.”

In multi-agent systems where agents communicate via natural language in free-form dialogue, the second case is the default behavior. Each agent receives the previous agent’s summary, not the previous agent’s reasoning process. Summaries strip away the fragility of the assumption, “I suspect it might be like this.” They keep the hardness of the assumption, “It is like this.”

Oh My OpenAgent’s Sisyphus framework is one of the few systems that defend against this at the design level. It runs three independent review layers across three different models: Prometheus (Claude Opus) handles planning; Metis (Claude Opus) does gap detection, looking for what Prometheus missed; Momus (GPT-5.2) performs final plan verification, rejecting plans that do not meet standards with no retry limit. The design intent here is not for three agents to reach consensus, but to exploit the inevitable judgment differences across three different models to expose assumption flaws that a single model would miss, before the plan enters execution. Sisyphus does not rely on “multiple agents verifying each other” to guarantee quality. It relies on different models having different blind spots to increase the probability that an assumption gets intercepted.

But even then, it does not guarantee that Momus’s review is more correct than Prometheus’s plan. It only adds one more filter, made of a different material, before the mistaken assumption enters the execution layer.

Every Architecture Has a Blind-Spot Distance

Translated into engineering language: every architecture has a blind-spot distance: the number of intermediate processing steps a mistaken assumption travels from its origin to the moment human eyes see it. This distance is not a technical detail. It is the upper bound on the system’s integrity.

Kimi Swarm has the longest blind-spot distance. An RL-trained coordinator decomposes tasks, dispatches agents, merges results; humans only provide the goal at the start and see the output at the end. In a 300-agent, 4,000-step swarm, a mistaken assumption travels from origin to human visibility through every operation in the entire swarm. Each operation moves the assumption further from its original reasoning and closer to a finished product. When the final result is wrong, the human must reconstruct the context of the entire swarm to locate the problem, and the coordinator’s RL decision logic is uninterpretable. The infinite loops reported in GitHub issues are just one symptom.

Claude DW shortens the blind-spot distance by one step. The human reviews the script before execution. If a flaw in the assumption is visible at the script logic level (cross-validation too weak, branch conditions too vague, task decomposition ignoring a class of edge cases), the human has a chance to intercept it here. But this depends on two conditions. First, the human must have sufficient cognitive bandwidth to seriously review an orchestration script written in JS by Claude, and that script itself was written by Claude and may contain its own mistaken assumptions. Second, if the flaw is not at the script logic level but in the worker execution layer’s cross-validation (agents spawned by the same model share the same reasoning blind spots), the script review cannot block it. Claude DW’s design only shortens the blind-spot distance from “the entire pipeline” to “the worker layer after script review.”

Pi has the shortest blind-spot distance. Mario Zechner’s design removes the orchestration layer, removes sub-agent black boxes, removes plan mode. Four tools, a system prompt of under 1,000 tokens. The user sees the agent’s read, bash, and edit at every step. A mistaken assumption, once generated, gets exposed the fastest, provided the human is actually watching. Pi’s trade-off is not at the architectural level. It is at the human-machine rhythm level: zero blind-spot distance, but the highest frequency of human involvement.

There is no absolute ranking among these three systems. Systems with short blind-spot distance demand high cognitive engagement from humans and depend heavily on human quality. Systems with long blind-spot distance depend on the interception quality of intermediate layers, and when those intermediate layers launder rather than intercept, the system produces not more reliable results, but results that look more reliable.

The 78.7% failure rate in the MAST taxonomy fundamentally measures not coordination quality, but blind-spot distance. When agents communicate freely in natural language, every handoff step is an implicit laundering operation: it summarizes the previous agent’s intent, strips away its uncertainty, and turns it into material the next agent can keep processing. This process is not a design flaw. It is the inherent behavior of natural language communication. You do not need to design a laundering mechanism to launder. You only need to let agents hand off using natural language.

How to Use This Lens for Architecture Decisions

In your pipeline diagram, how many processing stages are in the longest chain between two human checkpoints? Find that number. That number is your architecture’s amplification factor for mistaken assumptions.

When finding this number, do not count automated testing, linting, or type checking as interception. Automated testing catches deviations in code behavior, not deviations in assumptions. If an agent writes correct code based on an incorrect premise, the tests pass. Tests fail on implementation errors, not premise errors. Intercepting premise errors requires judgment that differs from the execution layer: independent review by a different model, human intervention, or explicitly surfacing the assumption itself as an intermediate artifact that can be verified.

If your current workflow is “let the agent run freely for a stretch, then have a human check the result once,” your blind-spot distance is the number of steps in that stretch. If the stretch includes both planning and execution, and these two agents hand off via natural language, that is roughly 2. If there is more than one worker and the workers’ outputs cross-reference each other, each additional layer adds one more processing stage. In Claude DW’s script review mode, if your script review is thorough, the blind-spot distance can be compressed to the layer where workers execute in parallel. Flaws at the script logic level get intercepted during review; assumption biases at the worker layer may still pass through. In Sisyphus’s mode, multiple layers of independent review by different models push the interception rate to its upper bound, but you must accept the additional token cost and longer planning phase.

These choices are not a matter of competing technical approaches. They are about answering a concrete question in your specific context: in the agent’s workflow, at which step do mistaken assumptions typically arise, what is the fastest way to intercept them, and how much are you willing to pay for that interception.

If mistaken assumptions frequently appear right in the task description you give the agent, you were too vague, the agent filled in a nonexistent constraint, then having multiple agents discuss it among themselves will only accelerate the laundering. What you need is a mechanism that forces you to clarify your requirements: Prometheus’s interview mode, or Pi’s “every line is visible” that lets you see the moment the agent veers off course. If mistaken assumptions tend to appear during the agent’s execution process, it understood your goal but chose the wrong implementation path, then having a different-model agent review its intermediate outputs may help. If mistaken assumptions are rare, your harness is already mature, and the agent’s outputs are directly verifiable, then you need no intermediate layers at all: single agent plus a strong harness is the optimal solution.

One scenario warrants special caution: your multi-agent system performs beautifully in demos, but once in production it starts producing errors that are hard to pin down, and the errors are different every time. Check whether your intermediate layers are intercepting or laundering. If an agent’s output, before being passed to the next agent, goes through a natural language summary that strips away qualifiers and expressions of uncertainty, then that summary is not helping the next agent make a better judgment. It is sparing the next agent from having to reconstruct the full context of the previous agent. Saving context is the right thing to do. If what you omit happens to be exactly what needs to be traced later, the confidence level of a claim, the alternatives on a path, the boundary conditions of an assumption, then you are not saving context. You are burning evidence.

The multi-agent discussion comes back to one foundational question: at every step of your pipeline, is it making errors more visible, or is it making errors more presentable. Visible errors get caught by humans. Presentable errors get released.