You send a request to an API endpoint, thinking you’re calling some model. The model on the other end barely answers anything itself. It convenes GPT, Claude, and Gemini, assigns roles — you brainstorm, you execute, you verify — then hands back an answer after a few rounds of discussion and validation. From your side, it’s always been a single API. What you’re calling is a coordinator trained specifically to run meetings.
Fugu, released by Sakana AI, is exactly such a system. Under the hood are two ICLR 2026 papers: TRINITY trains a tiny coordinator using evolutionary strategies, while Conductor trains a 7-billion-parameter orchestrator with reinforcement learning (Conductor arXiv) (TRINITY arXiv). What genuinely sets it apart from existing approaches is this: whom to call, how to divide the work, who verifies — all of it was learned by the model from a massive corpus of problems, with not a single if-else hard-coded by a human. Fugu takes multi-agent orchestration from developer-written workflows and turns it into a capability trained into model weights. Along the evolutionary arc of multi-agent systems, this represents a genuine turning point.
It resembles an AI that learned to manage: whom to deploy, how to divide the work, who signs off — decisions that used to rely on human judgment are now trained into model weights. A learned coordinator is inherently a black box, but at this level, trusting it is no fundamentally different from trusting a human manager. The real question worth pressing is something else: beyond the reasoning black box of the manager, Sakana also hides whom the manager called and what tasks were assigned, concealing more thoroughly than a manager in any real organization would — and this layer is solvable. Fugu is the first product that lets an AI learn how to manage other AIs.
At its core, Fugu is a trained coordinator. When a request arrives, it judges whether one model can handle it alone or whether a team is needed. Which models to call, what roles to assign, who brainstorms, who executes, who verifies, how to synthesize — all decided by the model itself.
The role division follows a separation-of-powers approach. Thinker brainstorms, Worker executes, Verifier checks. If Verifier says no, the task goes back for a redo. This behavior emerges from reinforcement learning, where reward signals compress coordination strategy into weights: solve problems en masse, get rewarded for correct answers, and the model learns on its own what team to assemble for what kind of problem.
Recursive self-calling is the most human-like design in this mechanism. When Fugu is uncertain, it can call its own clone; if the clone is uncertain, it calls another clone. It’s like a manager who, upon receiving a complex request, first calls a meeting; then someone in the meeting feels unsure about a particular part, and the manager says, go hold a breakout session on that. The official documentation explicitly states that Fugu’s agent pool includes itself, supporting recursive calls (Sakana AI).
The user sees only a single API; internally, a coordinator deploys a group of AIs that divide the work and collaborate, synthesizing an answer only after verification passes — the entire process is invisible to the user.
Multi-model tools on the market can be roughly grouped into a few categories. OpenRouter helps you pick a model, calling only one at a time. LangGraph gives you building blocks and has you draw flowcharts to orchestrate agents yourself. OpenRouter’s Fusion follows a fixed pipeline: dispatch in parallel, judge the outputs, aggregate into an answer (OpenRouter). Fugu differs from all of these: the orchestration itself is something it learned, not relying on rules or fixed pipelines. You can’t edit its coordination strategy the way you edit a LangGraph flowchart — you can only retrain it with different data.
A technically inclined reader will naturally push back: GPT can already reason, analyze, and decompose tasks — couldn’t you just write a detailed task description, have it plan and dispatch other models, and call it a day?
What this pushback is really pointing at is where coordination capability lives. Historically, that location has been migrating. Early on, AI did internal division of labor — for example, in Mixture-of-Experts models, different parameter regions handle different computations, but the model only interacts with itself. Later, people started choosing among different pre-trained models: cheap ones for easy questions, expensive ones for hard ones. That’s tool selection, not tool coordination. Still later, developers hand-wrote multi-agent collaboration flowcharts, whose quality depended on the judgment of the person who drew them — change the task, redraw the flowchart. Fugu pushes this arrow to the next location: coordination capability is no longer written in code, but trained into the model itself. Reasoning models (the o1 class) moved reasoning ability out of human prompt engineering and into training; what Fugu does is structurally the same. Using GPT as an orchestrator means coordination intelligence comes from your prompt; Fugu’s coordinator is purpose-trained, its coordination strategy trained into weights. Control over coordination moves from code into weights. This is what actually happened in this round of evolution.
Theoretically, this direction holds up. Both papers also set up control groups directly, and gave the LLM-based orchestrators favorable conditions.
Conductor’s appendix Table 11 compares three orchestrators: GPT-5-as-orchestrator, Gemini-as-orchestrator, and the trained 7B Conductor, all using the same prompting framework (Conductor arXiv). To give LLM orchestrators the best possible showing, they were even granted automatic resampling on formatting failures and doubled output tokens. Results: the trained version averaged 75.65, Gemini-as-orchestrator 71.59, GPT-5-as-orchestrator (managing three models) 70.05. The trained version leads overall, but that lead is not evenly distributed. The advantage is heavily concentrated on coding tasks: LiveCodeBench shows a 13–33 point lead, with a smaller lead on BigCodeBench. On math and science, the two nearly tie: AIME 93.30 for the trained version versus 93.30 for GPT-5-as-orchestrator, GPQA-D 87.50 versus 86–87. The paper’s explanation: prompted orchestrators cannot understand that Claude is weaker on coding while GPT-5 is weaker on another class of problems; they rely on prior biases that don’t match actual downstream performance.
TRINITY’s appendix Table 8 did the same comparison: directly prompting Gemini 2.5 Pro as coordinator, selecting models and roles each round (TRINITY arXiv). The trained version averaged 70.44, the prompted version only 53.76, a 16.7-point gap. The paper explicitly states that prompted LLMs struggle to understand and manage the individual properties of seven agents.
An honest qualifier is needed here. The papers also acknowledge that frontier models themselves already possess decent meta-orchestrator capability: GPT-5 as orchestrator outperforms its own single-model results, and also exceeds the untrained 7B base model. Training widens the gap further on top of this foundation, with the incremental gains concentrated on instructionally complex coding and engineering tasks and on cost efficiency. The accurate formulation is: training can reliably open up a lead in specific dimensions while tying in others. A trained orchestrator does not comprehensively crush a prompted one. If the tasks you care about happen to be coding — instructionally complex and sensitive to individual model differences — the training payoff is real; if you’re just doing simple math and science routing, a prompted orchestrator is already good enough.
Fugu currently comes in two versions, trained differently, each corresponding to one paper.
Standard Fugu follows the TRINITY route (TRINITY arXiv). The coordinator is extremely small: its backbone is Qwen3-0.6B, topped with a tiny head of about 10,000 parameters, bringing total trainable parameters to under 20,000. The head only outputs logits for decision-making and does not generate text, so inference cost is negligible. Training uses evolutionary strategies — specifically, an algorithm called sep-CMA-ES. The idea: randomly generate a large pool of candidate parameters, have them solve problems, keep those that perform well, iterate for dozens of generations — survival of the fittest. The reward signal is stripped to a single rule: solved the problem, get a 1; didn’t solve it, get a 0. Each candidate parameter is evaluated 16 times and the average fitness is taken, with a maximum of 5 coordination rounds. The paper explains why RL and SFT were not chosen: with 10,000 parameters, each one’s contribution to the final reward is vanishingly small — REINFORCE-style per-parameter gradient methods would have a signal-to-noise ratio too low to be stable; SFT would require first generating demonstration labels for multi-round coordination, which the paper estimates would need 87 billion queries in the 5-round setting, making it cost-prohibitive. The paper also ran a direct comparison: sep-CMA-ES comprehensively beats REINFORCE, SFT, and random search across four benchmarks — LiveCodeBench, MATH500, MMLU, and RLPR.
Fugu Ultra follows the Conductor route (Conductor arXiv). The coordinator is Qwen2.5-7B, trained with GRPO (Group Relative Policy Optimization). For each problem, 64 coordination plans are sampled, run, and checked for correctness: full points for correct answers, zero for formatting errors, partial credit for correct format but wrong answer. Training runs for 200 iterations, with the KL penalty turned off, on two H100 GPUs. The coordination workflow is capped at 5 steps, though in practice only 3 steps are used on average. Token consumption is roughly one-sixth that of hand-written multi-agent topologies.
Both research papers use the same pool of 7 models: GPT-5, Gemini-2.5-Pro, Claude-Sonnet-4, plus four open-source models: DeepSeek-R1-Distill-Qwen-32B, Gemma-3-27B, and Qwen3-32B in both direct and reasoning modes (counted as two separate agents). The production-grade Fugu Ultra upgrades this pool to the next generation: Gemini-3.1-Pro, Claude-Opus-4.8, and GPT-5.5 (Fugu Technical Report).
On the data side, neither paper used labeled coordination demonstration datasets. They adopted an RLVR approach: take a batch of benchmark problems with verifiable answers, use answer correctness as the reward signal, and let coordination strategies emerge spontaneously in the process of maximizing reward. Training used LiveCodeBench V1, testing used V6, deliberately separated to prevent leakage. Those emergent behaviors in the papers — trivial questions handled by a single direct attempt, hard questions automatically generating pipelines from planning to execution to verification, independent attempts followed by final debate — all bubbled up from the reward signal, not a single one written into any prompt by a human.
As noted earlier, a learned coordinator is inherently a black box. This statement needs unpacking: where exactly is it opaque, and compared to a human manager, is the degree of opacity really the same?
First layer: Fugu’s coordinator is itself a language model. Why it decides on a team for a given problem instead of handing it to a single model, why it picks model A over model B — these decisions are baked into weights and can’t be traced back one by one. But at this level, it’s no fundamentally different from trusting a human manager. You can’t formalize why a manager pulled three people into a meeting instead of deciding alone either. You judge a manager by outcomes, not by reverse-engineering their decision algorithm. This layer is acceptable.
From the second layer onward, the differences appear. A human manager’s opacity is in reasoning, not in actions. Who the work was assigned to, what budget was approved — these are visible in the organization, recorded, and auditable. Fugu is different: at each internal step, its coordinator produces a coordination sequence — which agent was called, what sub-task was given, how context was shared — all included. This output is machine-readable and could naturally be recorded and passed through (VentureBeat). GIGAZINE’s reporting also confirms users cannot verify which models were actually used for each call (GIGAZINE). But Sakana chooses not to provide it. VentureBeat cites the company’s stance: routing information is proprietary, intentionally hidden from the user by design.
Third layer: what raw content each underlying agent produced, which provider data flowed to, which model contributed which part of the final answer — all of this could technically be surfaced. Sakana likewise chooses to hide it.
What truly blocks debugging, compliance audits, and cost attribution is not the inherently opaque reasoning layer — that layer is the same with a human manager. The problem lies in the second and third layers: Fugu also hides the coordination actions, concealing more thoroughly than a manager in any real organization would. In a normal company, you at least know whom the boss assigned work to and how much it cost. With Fugu, you don’t even know that.
Hiding these two layers is a business choice, not a technical inevitability. Protecting learned coordination strategies and blended pricing from being reverse-engineered — the motive is understandable, but the choice itself is reversible. The value of the manager analogy is in precisely locating the extra layer it hides beyond a human manager — and that layer can be opened. If Sakana gives enterprise customers access to coordination sequence logs and underlying model identities, the compliance hurdle is largely cleared.
A direction that holds up academically doesn’t mean the product is production-ready. Actual performance is a set of mixed signals.
Sakana’s research shows that RL-trained coordination strategies beat hand-written baselines, including Mixture-of-Agents and similar methods. An average of 3-step workflows, token costs about 20x lower than hand-written topologies (VentureBeat). Fugu Ultra scores 93.2 on LiveCodeBench, beating Fable 5’s 89.8; on GPQA-D, 95.5 to 94.6, again Fugu leads (Sakana AI).
On agentic coding, which the industry values more, SWE-Bench Pro is only 73.7, while Fable 5 scores 80. On the hardest reasoning test HLE, Fugu scores 50 to Fable’s 53.3, also trailing (Sakana AI). These two scores in particular require careful interpretation: Fable 5 is not in Fugu’s agent pool. The official page states in black and white that Fable 5 and Mythos are not in the agent pool because they are not publicly accessible (Sakana AI). In other words, Fugu uses its coordinator plus other models in the pool to go up against Fable 5 as a single model — an essentially asymmetric comparison favoring Fable 5. Yet Sakana’s page simultaneously uses shoulder-to-shoulder to describe Fugu’s competitive relationship with Fable 5.
Two critical pieces are missing. First, there is no head-to-head comparison with OpenRouter Fusion. Fusion, using a panel of cheaper models, also claims to exceed frontier performance, and even self-fusing the same model can boost scores (OpenRouter). Second, there is no independent third-party reproduction; all current data is self-reported by Sakana. Latency characteristics have also not been disclosed.
Sakana packages model swappability as a sovereignty selling point. David Ha says Fugu bypasses vendor lock-in through its fully swappable agent pool (VentureBeat), and the official page mentions immunity to export control risks (Sakana AI). The backdrop: Anthropic’s Fable 5 and Mythos were suspended globally under a US Commerce Department export control directive (CNBC). Swappability maps to supply resilience: if a particular model is cut off, the system can route around it and keep working. But it doesn’t answer the sovereignty question: who decides whether models and data continue to exist. The regulated European enterprises this pitch most wants to attract are precisely the ones least well served by the current opacity choices. That said, this tension is a business trade-off, can be addressed layer by layer, and is hardly an irreconcilable deadlock.
Fugu has one thing solidly right: learned orchestration does have an academic edge over hand-written coordination. But whether it’s worth stuffing your workflow into this black box is still unclear — both the Fusion comparison and independent verification are missing. Fugu may not be the final answer to AI multi-model coordination, but the idea of models learning to manage other models isn’t going away. It just needs to address transparency first.