2026-04-24 survey.
Here are two things that can easily burn you right now.
Scenario one. Your agent pipeline handles a lot of long-document work: contract analysis, long codebase reasoning, feeding 8-hour meeting transcripts in to generate structured summaries. You’ve been using Claude Opus 4.6 because it scored 91.9% on the 1M needle-in-haystack test, the best in the industry for long-context retrieval. On April 16 Opus 4.7 ships, price unchanged, leading across SWE-Bench Pro and AA Intelligence Index. You flip production over without thinking twice. Two days later monitoring starts firing: on tasks above 600K tokens, output quality drops noticeably, often missing key facts in the back half of the document. You dig into Anthropic’s System Card and find the line: Opus 4.7 scores 59.2% on the 1M needle-in-haystack, 32 percentage points below 4.6. Anthropic deliberately traded retrieval capacity for agentic reasoning. If your workflow is RAG, this upgrade is a clear regression. The long-document retrieval slot now belongs to GPT-5.5 — Graphwalks BFS 1M went from GPT-5.4’s 9.4% to 45.4%. The reflex of picking the Opus flagship as the default for long-document work no longer holds in spring 2026.
Scenario two. OpenAI presented computer use as one
of the headline new capabilities in the GPT-5.5 launch, with the
eye-catching OSWorld-Verified 78.7% score, and the announcement
specifically called out browser interaction and file operations. Sounds
like something you could drop into your agent product right away. You
decide next week you’ll switch your agent over to GPT-5.5 to let it
drive a user’s browser for automation tasks. Then you open the Responses
API docs and discover that the computer-use-preview tool is
still at the GPT-5.4 capability level — the 78.7% figure for 5.5 has not
been synced into the API. Where does the 78.7% actually live? The answer
is the Codex desktop app on macOS: users have to install Codex, install
a plugin, grant macOS accessibility permissions, and users in the EU and
UK can’t use it at all. You thought you were getting GPT-5.5’s new
capability; what you can actually access is a product form locked to a
specific OS, a specific app, and specific regions.
Both of these are landmines you only discover after using the models. Spring 2026 has seen a dense burst of frontier releases. Each vendor’s strengths, weaknesses, access paths, and pricing break-points are different. Using them well takes time — poking at the edges, cross-referencing system cards, reading release notes, following community signal. This article collects those landmines so you don’t have to walk the path yourself.
Released on 2026-04-23. Positioned as the all-rounder: it’s in the top tier simultaneously on agentic coding, long context, and reasoning, but it takes a clear hit on factual reliability. On launch day it scored 60 on the Artificial Analysis Intelligence Index, ahead of the tie at 57 (AA subsequently re-ran its backlog and currently has GPT-5.4, Opus 4.7, and Gemini 3.1 Pro tied at 57; GPT-5.5’s full AA v4.0 number has not yet been published). Its biggest single-metric lead is in agentic coding loops — Terminal-Bench 2.0 at 82.7%, 13 percentage points above Opus 4.7’s 69.4%. That’s the widest single-benchmark gap across this whole release cycle. Long context became meaningfully stable this time: Graphwalks BFS at 1M scale jumped from 5.4’s 9.4% to 45.4% (OpenAI benchmark table). Pure reasoning is also top-tier — FrontierMath Tier 1-3 at 51.7%, Tier 4 at 35.4%.
The cost is mostly in price. API per-token pricing doubled to $5/$30 per 1M tokens; AA’s measurement has per-task cost rising about 20% versus GPT-5.4 (per-token doubling × a 40% reduction in token consumption). Factual reliability is its biggest weakness. AA-Omniscience measured a hallucination rate of 86%, close to market-bottom, compared to Opus 4.7’s 36% (AA Omniscience report). OpenAI’s own phrasing in System Card §6.1 is that “individual claims are 23% more likely to be factually correct” versus GPT-5.4, but that’s on OpenAI’s chosen evaluation set and points in a different direction from AA’s independent measurement (GPT-5.5 System Card). If you dispatch GPT-5.5 to fact-sensitive tasks, you have to accept the tendency to guess hard when it doesn’t know.
Released on 2026-04-16. Unlike GPT-5.5’s all-around route, Anthropic made a visibly specialist trade this cycle, pushing Opus into the senior code engineer seat: real GitHub issue resolution, tool-use precision, and factual reliability are all market-first, while long-document RAG and general conversational warmth were deliberately sacrificed. SWE-Bench Pro 64.3% is the GA market first, 5.7 percentage points ahead of GPT-5.5’s 58.6%. MCP-Atlas tool use 77.3% beats GPT-5.5’s 75.3%. AA’s GDPval-AA comes in at 1753 Elo, 79 Elo ahead of second-place Sonnet 4.6 at 1674. The most visible change is on hallucination: the rate fell from Opus 4.6’s 61% to 36%. The mechanism has to be split apart here. AA’s report shows accuracy is unchanged — attempt rate dropped from 82% to 70%, which means 12 percentage points of hard-guessing got converted into admitting “I don’t know” (AA Opus 4.7 report). 4.7 versus 4.6’s improvement is about being willing to refuse, not about being more correct.
Opus 4.7 paid two prices for this positioning. First, the 1M long-context retrieval regressed from 91.9% to 59.2% (Anthropic disclosed this in the System Card itself), a deliberate trade-off — scenario one. Second, creative writing and conversational warmth. On launch day Reddit lit up with “Opus 4.7 is dogshit” posts; community feedback was that it’s verbose, templated, and that the old warmth is gone. boringbot ran 5 structured PM tasks in a blind test (PRD, exec summary, RICE, user research, GTM) and Opus 4.7 swept all five. So this isn’t a model regression — it’s Anthropic actively repositioning. If your user base is creative writing, brainstorming, or role-play, this is a negative signal; if it’s structured knowledge work, positive.
Google’s flagship reasoning model for spring 2026. Same 57 tier as Opus 4.7 on the AA Intelligence Index, with cost-performance as the main edge. AA’s Cost to Run Index comes out to only $892, versus $2,304 for GPT-5.2 and $2,486 for Opus 4.6 — 61% cheaper. It’s the clear market first on a handful of specific tasks: Video-MMMU 87.6%, ScreenSpot-Pro 72.7%, OmniDocBench edit distance 0.115. Video understanding, UI screenshot operation, and PDF document extraction are its home turf. BrowseComp search integration at 85.9% is also market first.
Its main flaw is what the community has named “parametric hubris”: when it doesn’t know the answer it almost never refuses, and instead confidently fabricates. AA-Omniscience hallucination rate is 88%, slightly above GPT-5.5’s 86%. But its accuracy is also first (55.9%) — it knows the most, and fabricates the most. The most telling signal comes from behavior: Google’s own Antigravity IDE positions it as a fallback when Claude isn’t available, and developer feedback is that it hallucinates code into ruin over long agent loops (negative Antigravity Medium post). That’s Google’s own product voting with its feet. The other weakness is stability over long-running agent loops — Terminal-Bench 2.0 only 68.5%, and VERTU measured memory leaks in long sessions. Gemini 3.1 Pro fits short tasks, high-value work, and multimodal or search-heavy scenarios. It doesn’t fit 24/7 autonomous agents.
Released on 2026-04-23, same day as GPT-5.5. DeepSeek’s own positioning in the launch material is direct: 3 to 6 months behind the frontier, 9 to 30 times cheaper (via Gizmodo). V4-Pro is priced at $1.74/$3.48 per 1M tokens, V4-Flash at $0.14/$0.28 — 1/9 and 1/30 of GPT-5.5 respectively. Open-sourced under MIT license, self-hostable.
The “3-6 months behind” line averages across all tasks and obscures the real benchmark spread. On short tasks with verifiable answers, V4 is market first — LiveCodeBench 93.5, Codeforces 3206 Elo, beating Opus 4.7, GPT-5.4, and Gemini 3.1 Pro. Chinese is a long-term lead: Chinese-SimpleQA 84.4, with US frontier models around 76, an 8-point gap. The technical report discloses one understated but important data point: V4-Pro uses only 27% of DeepSeek V3.2’s inference FLOPs and 10% of the KV cache at 1M context (DeepSeek V4 Technical Report). In other words, V4’s low price comes in large part from architectural efficiency, not burning money on subsidies.
The real weakness is on long-horizon agentic work. Terminal-Bench 2.0 67.9% versus GPT-5.5’s 82.7% — 15 percentage points, the widest gap across any benchmark. SWE-Bench Pro 55.4% is also nearly 9 points below Opus 4.7’s 64.3%. Vision is absent entirely; the V4 preview is text-only.
V4’s biggest challenge for commercial deployment isn’t benchmarks, it’s compliance. Italy, Taiwan, Australia, and South Korea have banned the DeepSeek app; US NASA and the Navy prohibit employee use; South Korea’s PIPC publicly reported that personal data of 1 million Koreans flowed to China without authorization (CSIS analysis). For finance, healthcare, legal, and government applications, the “we use DeepSeek’s official API” answer won’t pass audit. The issue isn’t the model itself — it’s that the data has to flow under PRC jurisdiction. The open-weights option unlocks self-hosting as a path around this, which is the core difference separating V4 from other Chinese closed-source models: you can run V4-Pro inside your own VPC, data never leaves, and compliance risk drops to the level of any open-weight model.
The capability profile is only half the story. The other half is what you hit when you actually integrate. Here are the landmines that have shown up repeatedly in the community within days of release.
computer use locked to the Codex desktop app.
Already covered in scenario two. GPT-5.5’s OSWorld 78.7% capability
currently only works inside the macOS Codex desktop app. The
computer-use tool in the Responses API is at GPT-5.4 level,
and EU/UK aren’t enabled. If you’re building an API-based agent, you
can’t get to it. Opus 4.7’s computer use, in contrast, goes through a
standard Messages API tool computer_20251124, with 3.75 MP
resolution and 1:1 coordinate mapping — integration cost is much lower.
OSWorld scores differ by less than a point across the two (78.0% versus
78.7%), but whether you can actually use the capability is a completely
different question.
Opus 4.7 doubles per-token pricing above 200K prompt length. Opus 4.7’s sticker price of $5/$25 looks even with GPT-5.5, but above 200K the per-token price jumps to $10/$37.50, while GPT-5.5 stays flat $5/$30 across 1M. So on long prompts, Opus 4.7 output is actually 25% more expensive than GPT-5.5. Anthropic also changed the tokenizer in the same release; the same content is counted as 1.0 to 1.35× more tokens, with code and non-English content expanding most. If your workload involves large documents or large codebases, budget for roughly +20% above the sticker.
Gemini 3.1 Pro’s 200K cost cliff. Gemini is $2/$12 at ≤200K and doubles to $4/$18 above. Verdent AI’s engineering writeup recorded a classic trap: an agent feeds its own outputs back into the next turn’s context, stays under 200K for the first three turns, crosses the line in the fourth, and every subsequent token is billed at double (Verdent AI writeup). It’s not a model problem — it’s a cost break-point that has to be considered explicitly in agent design.
GitHub Copilot’s multiplier is not tied to sticker price. Copilot raised both GPT-5.5 and Opus 4.7 from 1.0× (where GPT-5.4 sits) to 7.5×; Opus 4.6 was at 3.0×. In other words, Copilot’s quota model treats this generation as 7.5× more expensive to serve, not 2× like the sticker would suggest. The typical r/GithubCopilot reaction is “time to move to an Anthropic direct sub”. Middleman platform pricing often reflects real billing better than stickers do.
DeepSeek V4 has three access paths with different risk profiles. The official API (api.deepseek.com) is the cheapest but data resides in China — won’t pass compliance-sensitive scenarios. Third-party inference (OpenRouter, DeepInfra, Fireworks, Together) matches official pricing and routes around Chinese data residency, but you have to verify whether the third-party routing itself crosses multiple providers. Self-hosting (MIT license) is the cleanest compliance path: V4-Flash runs on a single H200 (FP4+FP8 ~158GB), V4-Pro needs 8x H100 80GB (~862GB). WaveSpeed measurement: official API wins below about 50M tokens/day; self-hosting breaks even somewhere above 300M/day.
Opus 4.7 extended thinking removed entirely. The old
thinking={"type": "enabled", "budget_tokens": ...} now
returns a 400 error directly. You have to migrate to
thinking={"type": "adaptive"} +
output_config={"effort": ...}. If your code hardcodes a
thinking budget, you need to rewrite before switching. Caylent
specifically flagged this as a breaking change.
| Scenario | Example | Recommended model | Rationale |
|---|---|---|---|
| Agentic coding loop | 30+ step shell / build / test / fix | Primary: GPT-5.5. Backup: Opus 4.7 | On Terminal-Bench 2.0, GPT-5.5 at 82.7% leads Opus 4.7’s 69.4% by 13 percentage points — the widest single-benchmark lead for GPT-5.5. DeepSeek V4 not dispatched (67.9%, 15 points behind) |
| Real GitHub issue resolution / cross-file refactor / production-grade PR resolution | Live bug regression, dependency upgrades, cross-module rewrites | Primary: Opus 4.7. Backup: GPT-5.5 | SWE-Bench Pro 64.3% is the GA market first, 5.7 points ahead of GPT-5.5’s 58.6%. Real-world behavioral evidence: Cursor’s internal CursorBench 58% → 70%, Rakuten reports 3× more production tasks resolved. DeepSeek V4-Pro as alternate (SWE-Pro 55.4%, 9 points behind, but 9× cheaper) |
| Fact-sensitive reports | Regulatory analysis, financial summaries, medical-text cleanup | Primary: Opus 4.7 | AA-Omniscience hallucination rate 36% is the only one on the market below 50%. GPT-5.5 is 86%, Gemini 3.1 Pro is 88%. The gap is too large to mix |
| Pure reasoning / math | FrontierMath, AIME, proof tasks | Primary: GPT-5.5. Backup: Gemini 3.1 Pro | FrontierMath Tier 4 GPT-5.5 35.4% versus Opus 4.7’s 23%, stable gap. Gemini 3.1 Pro AIME 95% can substitute at the same tier |
| Long-document retrieval / RAG | Contract analysis, long meeting transcript lookup, precise long-codebase locating | Primary: GPT-5.5 or Opus 4.6 | Graphwalks BFS 1M GPT-5.4 9.4% → GPT-5.5 45.4%. Opus 4.7 not dispatched: 1M needle-in-haystack 59.2%, 32 points below 4.6’s 91.9% |
| Long-document reasoning | Multi-hop reasoning across 500K+ context, long-codebase architecture analysis | Primary: Opus 4.7 | Reasoning quality at long context (not retrieval) stays top-tier for Opus 4.7, consistent with the 1753 Elo GDPval-AA agentic score |
| Multimodal | Video comprehension, UI screenshot operation, PDF document extraction | Primary: Gemini 3.1 Pro | Video-MMMU 87.6%, ScreenSpot-Pro 72.7%, OmniDocBench 0.115 — clean leads on all three. GPT-5.5’s launch barely touched multimodal; DeepSeek V4 preview is text-only |
| Search-heavy agentic deep research | Multi-round retrieval + report synthesis | Primary: Gemini 3.1 Pro. Backup: GPT-5.5 | BrowseComp 85.9% first; Search Grounding 5,000 queries/month free is an extra edge. Opus 4.7 not dispatched (BrowseComp 79.3%, the only regression this release) |
| Computer use / RPA / screen automation | Let the agent drive a user’s browser or desktop apps | Primary: Opus 4.7 | OSWorld 78.0% versus GPT-5.5 78.7% is effectively a tie, but Anthropic’s computer use is a standard Messages API tool with low integration cost. GPT-5.5 currently only ships via the macOS Codex desktop app, not portable |
| High-throughput, low-complexity API calls | Classification, summarization, extraction, form filling | Primary: DeepSeek V4-Flash or Sonnet 4.6 | V4-Flash $0.14/$0.28 + 90% cache-hit discount gets effective input cost down to $0.028/M. GPT-5.5 not dispatched: capability over-spec + doubled pricing will eat margin directly |
| Chinese-language tasks | Chinese customer service, Chinese content production, Chinese QA | Primary: DeepSeek V4-Pro or Gemini 3.1 Pro | Chinese-SimpleQA DeepSeek 84.4, Gemini 3.1 Pro 85.9, US frontier models around 76 — the 8-point gap is a long-term gap |
| Cost-sensitive side projects / academic research | Personal projects, hackathons, experiments without commercial margin constraints | Primary: DeepSeek V4 | Open weights, 1M context, $0.14/M input — the cost structure allows for high-throughput experimentation. Cline CEO’s anchor: if Uber used DeepSeek instead of Claude, its 2026 AI budget could last 7 years instead of 4 months |
| Finance / healthcare / legal / government compliance | Production pipelines that need to clear audit, GDPR, HIPAA, SOC 2 | Primary: Opus 4.7 via Bedrock, GPT-5.5 via Azure, Gemini via Vertex | DeepSeek V4 official API not dispatched: banned by multiple governments, GDPR/CCPA cross-border risk, audit won’t pass. If DeepSeek is a must, the only route is self-hosting inside your own VPC |
| Creative writing / dialogue / brainstorming | Fiction, role-play, ideation | Primary: Gemini 3 Pro or stay on Opus 4.6 | Opus 4.7 not dispatched: Zvi Mowshowitz’s “literal instruction following” observation, the Reddit “dogshit” threads, and boringbot’s PM blind test all point to the same conclusion — 4.7 is strong on structured output and weak on conversational warmth |
What’s most informative about the table above isn’t any individual dispatch rule — it’s the pattern you see when you look at the whole thing: in spring 2026, no single model is optimal across every scenario. GPT-5.5 is first on agentic coding and reasoning, but near market-bottom on hallucination. Opus 4.7 is first on long-horizon coding and factual reliability, but regressed on 1M retrieval and on creative writing. Gemini 3.1 Pro is first on multimodal and browsing and has the best cost-performance, but shares the hallucination problem. DeepSeek V4 is first on price and short-task reasoning, but 15 points behind on long agentic and hits compliance as a hard blocker.
This distribution looks less like temporary market noise and more like a necessary consequence of the capability level these models have reached. Each vendor’s training budget, data, product positioning, and compliance boundaries push them to different trade-off points. OpenAI turned Codex into a superapp and locked computer use inside the desktop app. Anthropic swapped 1M retrieval capacity for agentic reasoning. Google pushed pricing to a third of competitors while sacrificing factual reliability. DeepSeek differentiates on 9-30× cheaper plus MIT license. Each of these is a deliberate position, not a bug: every vendor is defining the task types most profitable to them and then building the product, pricing, and compliance combo around it.
Binding all your AI needs to any one vendor means continuously carrying risk on the dimensions you didn’t select. If 80% of your pipeline is long-horizon coding plus real issue resolution, Opus 4.7 is obviously the main horse; but the remaining 20% of multimodal, search, and high-throughput batch work — forcing Opus 4.7 through those is just burning money. Conversely, hanging everything on GPT-5.5 means accepting an 86% hallucination rate on fact-sensitive tasks, paying doubled sticker on high-throughput work, and essentially giving up on multimodal.
The right move is to treat this like assembling a team. Different tasks go to different models, and each model does the slice it’s best at: long-horizon coding to Opus 4.7, agentic shell to GPT-5.5, multimodal and browsing to Gemini, high-throughput batch to DeepSeek V4-Flash, Chinese or self-hosted sensitive data to DeepSeek V4-Pro. This isn’t over-engineering — it’s the natural fit given each model’s strengths and weaknesses. The price is maintaining multi-provider integration complexity. The return is using market-best on every task type, instead of paying extra or taking quality hits in some vendor’s weak zone.
Adopting this line of thinking changes two concrete things. One, your cost model has to be layered by task type, not computed on a single per-token basis. A penny-per-call high-frequency low-complexity task and a five-dollar-per-call long-horizon agent are completely different things and can’t share one budget. Two, evaluation has to be built in. Define 5 to 10 representative cases per task type and run them any time you consider switching models. This beats looking at AA Intelligence Index or SWE-Bench Pro numbers — those benchmarks express an average across general cases, and your production traffic distribution likely doesn’t match any of them.
First, this jagged frontier (Ethan Mollick’s term) isn’t going to converge in the short term. The four vendors’ trade-off points point in different directions, and each is actively reinforcing its differentiation: OpenAI’s superapp route, Anthropic’s agentic-worker route, Google’s cost-performance + multimodal route, DeepSeek’s open-weight + price route. By 2027 the distribution will be clearer, but it won’t return to a “one model rules all” state.
Second, therefore your system architecture needs an abstraction layer. Don’t hardwire your business code into any single vendor’s SDK, and don’t hardcode a specific model in your product. By “abstraction layer” I don’t mean a specific tool — it’s an architectural principle: your system needs switching room on the model-selection dimension, with an explicit routing layer between task type and model, where each task type can be evaluated and switched independently. This kind of principle-level decision matters more than picking today’s winner. Which model is strongest today is something you have to re-evaluate every three to six months; needing to be able to switch by task is something you only have to architect once and can live with for years.
The teams that do this well are the ones that, every time a new release drops, first run it through their own task matrix, then decide which tasks to switch and which not to. That’s the baseline skill for AI selection in 2026.
Official primary sources: - OpenAI: Introducing GPT-5.5 - GPT-5.5 System Card (PDF) - Anthropic: Claude Opus 4.7 release - Anthropic Opus 4.7 System Card (PDF) - Google DeepMind: Gemini 3.1 Pro model card - DeepSeek V4 Technical Report (PDF) - DeepSeek V4 API pricing
Independent benchmarks and analysis: - Artificial Analysis: GPT-5.5 is the new leading AI model - Artificial Analysis: Opus 4.7 deep dive - Artificial Analysis: Gemini 3.1 Pro Preview - Artificial Analysis: DeepSeek V4-Pro - LLM-Stats: GPT-5.5 vs Opus 4.7 - Vellum: Opus 4.7 benchmarks explained - Epoch AI: Opus 4.7 ECI score
Independent technical commentary: - Simon Willison: GPT-5.5 - Simon Willison: DeepSeek V4 — almost on the frontier - Ethan Mollick: Sign of the future - Zvi Mowshowitz: Opus 4.7 Part 1 - Zvi Mowshowitz: Opus 4.7 Part 2 - Jake Handy: DeepSeek V4
Ecosystem and compliance: - CSIS: Delving into Dangers of DeepSeek - WIRED: DeepSeek data flow to China - GitHub Copilot premium request billing table - Lovable production data - Decrypt: DeepSeek V4 + Cline CEO Uber anchor - Verdent AI: Gemini 3.1 Pro engineering writeup