When people build agent pipelines, the default intuition is usually: use the strongest model for the critical steps, and cheaper models for the less important ones. Planning uses Opus, execution uses Haiku, and the final pass uses Sonnet. This allocation matches common sense, and it is also the most common setup in the community today.
The experimental results in AgentOpt (arXiv:2604.06296, April 2026) offer a sharp counterexample to that intuition: in the planner-solver pipeline on HotpotQA, putting Claude Opus in the planner slot yields 31.71% accuracy, near the bottom across 81 model combinations. Put Ministral 8B in the planner slot and Opus in the solver slot, and accuracy reaches 74.27%. A model that is an order of magnitude smaller handles planning, and delivers more than double the performance of Opus-as-planner.
What matters about this result is the decision framework behind it, because it transfers to many other settings: model quality is a function of role and pipeline interaction, not a property you can carry around independent of context.
AgentOpt is a collaboration between Microsoft Research, Columbia
DAPLab, and Cornell. It treats model selection in agent pipelines as a
combinatorial optimization problem: given a multi-step pipeline, try
different models for each role, then use search to find the best
accuracy-cost combination. The paper evaluates 9 models, 4 benchmarks,
and up to 81 combinations. The framework is open source
(pip install agentopt-py) and uses httpx transport
interception to implement framework-agnostic API call interception.
Opus’s reasoning ability did not change. What changed is how the role demands that ability. The planner’s job is to decompose the task and hand subproblems to the solver, but when Opus acts as planner it tends to answer the question itself, skipping the downstream tool call to the solver. In 7 out of 9 cases, the solver was never called. The result is that the planner outputs a mediocre direct answer, and the full reasoning chain breaks.
Ministral 8B performs better as planner for exactly the opposite reason: its capability boundary is clearer. It knows it cannot answer the full question, so it does the boring but correct thing, decomposes the task, calls the tool, and passes the subproblems downstream.
That mechanism is itself a reusable framework for judgment: in a multi-step pipeline, each role does not need the strongest model, it needs the model that behaves best under that role’s constraints. The stronger a model is, the more likely it is to break role boundaries and try to help. In pipeline context, that kind of help is destructive.
This judgment also holds up under cross-checks. In the solver-critic pipeline on MathQA, when the answerer is fixed to Opus, the accuracy spread across 9 critic models is only 2.9 percentage points, the critic role barely moves the result. On BFCL, Opus, Kimi K2.5, and Qwen3 Next all land at 70%, but Qwen3 Next costs only 1/32 of Opus. At a higher level, optimizing model allocation cuts cost by 13x to 32x while preserving accuracy.
AgentOpt is not an isolated result. Over the last few months, multiple independent studies have pointed in the same direction from different angles: the performance bottleneck in agent systems is shifting away from the model itself and toward the system design around the model.
Stanford’s Meta-Harness (arXiv:2603.28052, March 2026) found that optimizing only the harness code around a model can produce a 6x score difference for the exact same model. The model did not change. The prompt did not change. What changed was the working environment the model was placed into. W4S (Nie et al., COLM 2025) uses a weak model as a meta-agent to design workflows for a strong model, and concludes that workflow design quality matters more to final performance than the capability gap between execution models. Select-then-Solve (arXiv:2604.06753, April 8, 2026) focuses on choosing the reasoning paradigm, and again shifts the optimization target away from what model you use and toward the conditions under which the model works.
Put these papers together and the trend is clear: once models are strong enough, the marginal contribution of system design, role allocation, pipeline topology, harness engineering, workflow design, starts to exceed the marginal contribution of model upgrades themselves. AgentOpt’s contribution is to quantify one of those dimensions with controlled experiments.
This is v0.1. It only evaluates 4 QA and function-calling benchmarks, with no coding tasks, no long-horizon agentic tasks, and pipeline topologies limited to 2 or 3 roles. We do not yet know whether the Opus-as-planner failure mode also holds in code generation pipelines. The paper also does not test an obvious engineering fix: force the planner through the system prompt to emit a tool call instead of a direct answer. If a good prompt can repair the behavior, then the root cause is prompt design rather than model selection.
Three directions. First, run one cheap test on your own agent pipeline: replace the planner or orchestrator with a smaller model and see whether performance actually drops. If it does not, you are paying for a capability tier you do not need. Second, for roles where the model must obey a division-of-labor protocol, compliance matters more than raw intelligence. Third, model selection is an ongoing tuning process, not a one-time procurement decision. Once the pipeline design changes, the optimal combination probably changes with it.
The core shift in thinking is this: the unit of analysis for model selection moves from which model is strongest to role × pipeline. AgentOpt validates that shift with data. Even if the coverage of this dataset is still limited, the mindset itself is already ready for day-to-day engineering practice.
pip install agentopt-py