AI Products & PlatformsAI CodingGovernance & Compliance

GPT-5.5, Claude Opus 4.7, DeepSeek V4: Which Model for Which Task

Published Apr 24, 2026

2026-04-24 survey.

Two scenarios where you might get burned

Here are two things that can easily burn you right now.

Scenario one. Your agent pipeline handles a lot of long-document work: contract analysis, long codebase reasoning, feeding 8-hour meeting transcripts in to generate structured summaries. You’ve been using Claude Opus 4.6 because it scored 91.9% on the 1M needle-in-haystack test, the best in the industry for long-context retrieval. On April 16 Opus 4.7 ships, price unchanged, leading across SWE-Bench Pro and AA Intelligence Index. You flip production over without thinking twice. Two days later monitoring starts firing: on tasks above 600K tokens, output quality drops noticeably, often missing key facts in the back half of the document. You dig into Anthropic’s System Card and find the line: Opus 4.7 scores 59.2% on the 1M needle-in-haystack, 32 percentage points below 4.6. Anthropic deliberately traded retrieval capacity for agentic reasoning. If your workflow is RAG, this upgrade is a clear regression. The long-document retrieval slot now belongs to GPT-5.5 — Graphwalks BFS 1M went from GPT-5.4’s 9.4% to 45.4%. The reflex of picking the Opus flagship as the default for long-document work no longer holds in spring 2026.

Scenario two. OpenAI presented computer use as one of the headline new capabilities in the GPT-5.5 launch, with the eye-catching OSWorld-Verified 78.7% score, and the announcement specifically called out browser interaction and file operations. Sounds like something you could drop into your agent product right away. You decide next week you’ll switch your agent over to GPT-5.5 to let it drive a user’s browser for automation tasks. Then you open the Responses API docs and discover that the computer-use-preview tool is still at the GPT-5.4 capability level — the 78.7% figure for 5.5 has not been synced into the API. Where does the 78.7% actually live? The answer is the Codex desktop app on macOS: users have to install Codex, install a plugin, grant macOS accessibility permissions, and users in the EU and UK can’t use it at all. You thought you were getting GPT-5.5’s new capability; what you can actually access is a product form locked to a specific OS, a specific app, and specific regions.

Both of these are landmines you only discover after using the models. Spring 2026 has seen a dense burst of frontier releases. Each vendor’s strengths, weaknesses, access paths, and pricing break-points are different. Using them well takes time — poking at the edges, cross-referencing system cards, reading release notes, following community signal. This article collects those landmines so you don’t have to walk the path yourself.

Capability profiles

GPT-5.5

Released on 2026-04-23. Positioned as the all-rounder: it’s in the top tier simultaneously on agentic coding, long context, and reasoning, but it takes a clear hit on factual reliability. On launch day it scored 60 on the Artificial Analysis Intelligence Index, ahead of the tie at 57 (AA subsequently re-ran its backlog and currently has GPT-5.4, Opus 4.7, and Gemini 3.1 Pro tied at 57; GPT-5.5’s full AA v4.0 number has not yet been published). Its biggest single-metric lead is in agentic coding loops — Terminal-Bench 2.0 at 82.7%, 13 percentage points above Opus 4.7’s 69.4%. That’s the widest single-benchmark gap across this whole release cycle. Long context became meaningfully stable this time: Graphwalks BFS at 1M scale jumped from 5.4’s 9.4% to 45.4% (OpenAI benchmark table). Pure reasoning is also top-tier — FrontierMath Tier 1-3 at 51.7%, Tier 4 at 35.4%.

The cost is mostly in price. API per-token pricing doubled to $5/$30 per 1M tokens; AA’s measurement has per-task cost rising about 20% versus GPT-5.4 (per-token doubling × a 40% reduction in token consumption). Factual reliability is its biggest weakness. AA-Omniscience measured a hallucination rate of 86%, close to market-bottom, compared to Opus 4.7’s 36% (AA Omniscience report). OpenAI’s own phrasing in System Card §6.1 is that “individual claims are 23% more likely to be factually correct” versus GPT-5.4, but that’s on OpenAI’s chosen evaluation set and points in a different direction from AA’s independent measurement (GPT-5.5 System Card). If you dispatch GPT-5.5 to fact-sensitive tasks, you have to accept the tendency to guess hard when it doesn’t know.

Claude Opus 4.7

Released on 2026-04-16. Unlike GPT-5.5’s all-around route, Anthropic made a visibly specialist trade this cycle, pushing Opus into the senior code engineer seat: real GitHub issue resolution, tool-use precision, and factual reliability are all market-first, while long-document RAG and general conversational warmth were deliberately sacrificed. SWE-Bench Pro 64.3% is the GA market first, 5.7 percentage points ahead of GPT-5.5’s 58.6%. MCP-Atlas tool use 77.3% beats GPT-5.5’s 75.3%. AA’s GDPval-AA comes in at 1753 Elo, 79 Elo ahead of second-place Sonnet 4.6 at 1674. The most visible change is on hallucination: the rate fell from Opus 4.6’s 61% to 36%. The mechanism has to be split apart here. AA’s report shows accuracy is unchanged — attempt rate dropped from 82% to 70%, which means 12 percentage points of hard-guessing got converted into admitting “I don’t know” (AA Opus 4.7 report). 4.7 versus 4.6’s improvement is about being willing to refuse, not about being more correct.

Opus 4.7 paid two prices for this positioning. First, the 1M long-context retrieval regressed from 91.9% to 59.2% (Anthropic disclosed this in the System Card itself), a deliberate trade-off — scenario one. Second, creative writing and conversational warmth. On launch day Reddit lit up with “Opus 4.7 is dogshit” posts; community feedback was that it’s verbose, templated, and that the old warmth is gone. boringbot ran 5 structured PM tasks in a blind test (PRD, exec summary, RICE, user research, GTM) and Opus 4.7 swept all five. So this isn’t a model regression — it’s Anthropic actively repositioning. If your user base is creative writing, brainstorming, or role-play, this is a negative signal; if it’s structured knowledge work, positive.

Gemini 3.1 Pro Preview

Google’s flagship reasoning model for spring 2026. Same 57 tier as Opus 4.7 on the AA Intelligence Index, with cost-performance as the main edge. AA’s Cost to Run Index comes out to only $892, versus $2,304 for GPT-5.2 and $2,486 for Opus 4.6 — 61% cheaper. It’s the clear market first on a handful of specific tasks: Video-MMMU 87.6%, ScreenSpot-Pro 72.7%, OmniDocBench edit distance 0.115. Video understanding, UI screenshot operation, and PDF document extraction are its home turf. BrowseComp search integration at 85.9% is also market first.

Its main flaw is what the community has named “parametric hubris”: when it doesn’t know the answer it almost never refuses, and instead confidently fabricates. AA-Omniscience hallucination rate is 88%, slightly above GPT-5.5’s 86%. But its accuracy is also first (55.9%) — it knows the most, and fabricates the most. The most telling signal comes from behavior: Google’s own Antigravity IDE positions it as a fallback when Claude isn’t available, and developer feedback is that it hallucinates code into ruin over long agent loops (negative Antigravity Medium post). That’s Google’s own product voting with its feet. The other weakness is stability over long-running agent loops — Terminal-Bench 2.0 only 68.5%, and VERTU measured memory leaks in long sessions. Gemini 3.1 Pro fits short tasks, high-value work, and multimodal or search-heavy scenarios. It doesn’t fit 24/7 autonomous agents.

DeepSeek V4

Released on 2026-04-23, same day as GPT-5.5. DeepSeek’s own positioning in the launch material is direct: 3 to 6 months behind the frontier, 9 to 30 times cheaper (via Gizmodo). V4-Pro is priced at $1.74/$3.48 per 1M tokens, V4-Flash at $0.14/$0.28 — 1/9 and 1/30 of GPT-5.5 respectively. Open-sourced under MIT license, self-hostable.

The “3-6 months behind” line averages across all tasks and obscures the real benchmark spread. On short tasks with verifiable answers, V4 is market first — LiveCodeBench 93.5, Codeforces 3206 Elo, beating Opus 4.7, GPT-5.4, and Gemini 3.1 Pro. Chinese is a long-term lead: Chinese-SimpleQA 84.4, with US frontier models around 76, an 8-point gap. The technical report discloses one understated but important data point: V4-Pro uses only 27% of DeepSeek V3.2’s inference FLOPs and 10% of the KV cache at 1M context (DeepSeek V4 Technical Report). In other words, V4’s low price comes in large part from architectural efficiency, not burning money on subsidies.

The real weakness is on long-horizon agentic work. Terminal-Bench 2.0 67.9% versus GPT-5.5’s 82.7% — 15 percentage points, the widest gap across any benchmark. SWE-Bench Pro 55.4% is also nearly 9 points below Opus 4.7’s 64.3%. Vision is absent entirely; the V4 preview is text-only.

V4’s biggest challenge for commercial deployment isn’t benchmarks, it’s compliance. Italy, Taiwan, Australia, and South Korea have banned the DeepSeek app; US NASA and the Navy prohibit employee use; South Korea’s PIPC publicly reported that personal data of 1 million Koreans flowed to China without authorization (CSIS analysis). For finance, healthcare, legal, and government applications, the “we use DeepSeek’s official API” answer won’t pass audit. The issue isn’t the model itself — it’s that the data has to flow under PRC jurisdiction. The open-weights option unlocks self-hosting as a path around this, which is the core difference separating V4 from other Chinese closed-source models: you can run V4-Pro inside your own VPC, data never leaves, and compliance risk drops to the level of any open-weight model.

Practical landmines in the field

The capability profile is only half the story. The other half is what you hit when you actually integrate. Here are the landmines that have shown up repeatedly in the community within days of release.

computer use locked to the Codex desktop app. Already covered in scenario two. GPT-5.5’s OSWorld 78.7% capability currently only works inside the macOS Codex desktop app. The computer-use tool in the Responses API is at GPT-5.4 level, and EU/UK aren’t enabled. If you’re building an API-based agent, you can’t get to it. Opus 4.7’s computer use, in contrast, goes through a standard Messages API tool computer_20251124, with 3.75 MP resolution and 1:1 coordinate mapping — integration cost is much lower. OSWorld scores differ by less than a point across the two (78.0% versus 78.7%), but whether you can actually use the capability is a completely different question.

Opus 4.7 doubles per-token pricing above 200K prompt length. Opus 4.7’s sticker price of $5/$25 looks even with GPT-5.5, but above 200K the per-token price jumps to $10/$37.50, while GPT-5.5 stays flat $5/$30 across 1M. So on long prompts, Opus 4.7 output is actually 25% more expensive than GPT-5.5. Anthropic also changed the tokenizer in the same release; the same content is counted as 1.0 to 1.35× more tokens, with code and non-English content expanding most. If your workload involves large documents or large codebases, budget for roughly +20% above the sticker.

Gemini 3.1 Pro’s 200K cost cliff. Gemini is $2/$12 at ≤200K and doubles to $4/$18 above. Verdent AI’s engineering writeup recorded a classic trap: an agent feeds its own outputs back into the next turn’s context, stays under 200K for the first three turns, crosses the line in the fourth, and every subsequent token is billed at double (Verdent AI writeup). It’s not a model problem — it’s a cost break-point that has to be considered explicitly in agent design.

GitHub Copilot’s multiplier is not tied to sticker price. Copilot raised both GPT-5.5 and Opus 4.7 from 1.0× (where GPT-5.4 sits) to 7.5×; Opus 4.6 was at 3.0×. In other words, Copilot’s quota model treats this generation as 7.5× more expensive to serve, not 2× like the sticker would suggest. The typical r/GithubCopilot reaction is “time to move to an Anthropic direct sub”. Middleman platform pricing often reflects real billing better than stickers do.

DeepSeek V4 has three access paths with different risk profiles. The official API (api.deepseek.com) is the cheapest but data resides in China — won’t pass compliance-sensitive scenarios. Third-party inference (OpenRouter, DeepInfra, Fireworks, Together) matches official pricing and routes around Chinese data residency, but you have to verify whether the third-party routing itself crosses multiple providers. Self-hosting (MIT license) is the cleanest compliance path: V4-Flash runs on a single H200 (FP4+FP8 ~158GB), V4-Pro needs 8x H100 80GB (~862GB). WaveSpeed measurement: official API wins below about 50M tokens/day; self-hosting breaks even somewhere above 300M/day.

Opus 4.7 extended thinking removed entirely. The old thinking={"type": "enabled", "budget_tokens": ...} now returns a 400 error directly. You have to migrate to thinking={"type": "adaptive"} + output_config={"effort": ...}. If your code hardcodes a thinking budget, you need to rewrite before switching. Caylent specifically flagged this as a breaking change.

Task-based dispatch matrix

Scenario	Example	Recommended model	Rationale
Agentic coding loop	30+ step shell / build / test / fix	Primary: GPT-5.5. Backup: Opus 4.7	On Terminal-Bench 2.0, GPT-5.5 at 82.7% leads Opus 4.7’s 69.4% by 13 percentage points — the widest single-benchmark lead for GPT-5.5. DeepSeek V4 not dispatched (67.9%, 15 points behind)
Real GitHub issue resolution / cross-file refactor / production-grade PR resolution	Live bug regression, dependency upgrades, cross-module rewrites	Primary: Opus 4.7. Backup: GPT-5.5	SWE-Bench Pro 64.3% is the GA market first, 5.7 points ahead of GPT-5.5’s 58.6%. Real-world behavioral evidence: Cursor’s internal CursorBench 58% → 70%, Rakuten reports 3× more production tasks resolved. DeepSeek V4-Pro as alternate (SWE-Pro 55.4%, 9 points behind, but 9× cheaper)
Fact-sensitive reports	Regulatory analysis, financial summaries, medical-text cleanup	Primary: Opus 4.7	AA-Omniscience hallucination rate 36% is the only one on the market below 50%. GPT-5.5 is 86%, Gemini 3.1 Pro is 88%. The gap is too large to mix
Pure reasoning / math	FrontierMath, AIME, proof tasks	Primary: GPT-5.5. Backup: Gemini 3.1 Pro	FrontierMath Tier 4 GPT-5.5 35.4% versus Opus 4.7’s 23%, stable gap. Gemini 3.1 Pro AIME 95% can substitute at the same tier
Long-document retrieval / RAG	Contract analysis, long meeting transcript lookup, precise long-codebase locating	Primary: GPT-5.5 or Opus 4.6	Graphwalks BFS 1M GPT-5.4 9.4% → GPT-5.5 45.4%. Opus 4.7 not dispatched: 1M needle-in-haystack 59.2%, 32 points below 4.6’s 91.9%
Long-document reasoning	Multi-hop reasoning across 500K+ context, long-codebase architecture analysis	Primary: Opus 4.7	Reasoning quality at long context (not retrieval) stays top-tier for Opus 4.7, consistent with the 1753 Elo GDPval-AA agentic score
Multimodal	Video comprehension, UI screenshot operation, PDF document extraction	Primary: Gemini 3.1 Pro	Video-MMMU 87.6%, ScreenSpot-Pro 72.7%, OmniDocBench 0.115 — clean leads on all three. GPT-5.5’s launch barely touched multimodal; DeepSeek V4 preview is text-only
Search-heavy agentic deep research	Multi-round retrieval + report synthesis	Primary: Gemini 3.1 Pro. Backup: GPT-5.5	BrowseComp 85.9% first; Search Grounding 5,000 queries/month free is an extra edge. Opus 4.7 not dispatched (BrowseComp 79.3%, the only regression this release)
Computer use / RPA / screen automation	Let the agent drive a user’s browser or desktop apps	Primary: Opus 4.7	OSWorld 78.0% versus GPT-5.5 78.7% is effectively a tie, but Anthropic’s computer use is a standard Messages API tool with low integration cost. GPT-5.5 currently only ships via the macOS Codex desktop app, not portable
High-throughput, low-complexity API calls	Classification, summarization, extraction, form filling	Primary: DeepSeek V4-Flash or Sonnet 4.6	V4-Flash $0.14/$0.28 + 90% cache-hit discount gets effective input cost down to $0.028/M. GPT-5.5 not dispatched: capability over-spec + doubled pricing will eat margin directly
Chinese-language tasks	Chinese customer service, Chinese content production, Chinese QA	Primary: DeepSeek V4-Pro or Gemini 3.1 Pro	Chinese-SimpleQA DeepSeek 84.4, Gemini 3.1 Pro 85.9, US frontier models around 76 — the 8-point gap is a long-term gap
Cost-sensitive side projects / academic research	Personal projects, hackathons, experiments without commercial margin constraints	Primary: DeepSeek V4	Open weights, 1M context, $0.14/M input — the cost structure allows for high-throughput experimentation. Cline CEO’s anchor: if Uber used DeepSeek instead of Claude, its 2026 AI budget could last 7 years instead of 4 months
Finance / healthcare / legal / government compliance	Production pipelines that need to clear audit, GDPR, HIPAA, SOC 2	Primary: Opus 4.7 via Bedrock, GPT-5.5 via Azure, Gemini via Vertex	DeepSeek V4 official API not dispatched: banned by multiple governments, GDPR/CCPA cross-border risk, audit won’t pass. If DeepSeek is a must, the only route is self-hosting inside your own VPC
Creative writing / dialogue / brainstorming	Fiction, role-play, ideation	Primary: Gemini 3 Pro or stay on Opus 4.6	Opus 4.7 not dispatched: Zvi Mowshowitz’s “literal instruction following” observation, the Reddit “dogshit” threads, and boringbot’s PM blind test all point to the same conclusion — 4.7 is strong on structured output and weak on conversational warmth

Why you need a team, not a single model

What’s most informative about the table above isn’t any individual dispatch rule — it’s the pattern you see when you look at the whole thing: in spring 2026, no single model is optimal across every scenario. GPT-5.5 is first on agentic coding and reasoning, but near market-bottom on hallucination. Opus 4.7 is first on long-horizon coding and factual reliability, but regressed on 1M retrieval and on creative writing. Gemini 3.1 Pro is first on multimodal and browsing and has the best cost-performance, but shares the hallucination problem. DeepSeek V4 is first on price and short-task reasoning, but 15 points behind on long agentic and hits compliance as a hard blocker.

This distribution looks less like temporary market noise and more like a necessary consequence of the capability level these models have reached. Each vendor’s training budget, data, product positioning, and compliance boundaries push them to different trade-off points. OpenAI turned Codex into a superapp and locked computer use inside the desktop app. Anthropic swapped 1M retrieval capacity for agentic reasoning. Google pushed pricing to a third of competitors while sacrificing factual reliability. DeepSeek differentiates on 9-30× cheaper plus MIT license. Each of these is a deliberate position, not a bug: every vendor is defining the task types most profitable to them and then building the product, pricing, and compliance combo around it.

Binding all your AI needs to any one vendor means continuously carrying risk on the dimensions you didn’t select. If 80% of your pipeline is long-horizon coding plus real issue resolution, Opus 4.7 is obviously the main horse; but the remaining 20% of multimodal, search, and high-throughput batch work — forcing Opus 4.7 through those is just burning money. Conversely, hanging everything on GPT-5.5 means accepting an 86% hallucination rate on fact-sensitive tasks, paying doubled sticker on high-throughput work, and essentially giving up on multimodal.

The right move is to treat this like assembling a team. Different tasks go to different models, and each model does the slice it’s best at: long-horizon coding to Opus 4.7, agentic shell to GPT-5.5, multimodal and browsing to Gemini, high-throughput batch to DeepSeek V4-Flash, Chinese or self-hosted sensitive data to DeepSeek V4-Pro. This isn’t over-engineering — it’s the natural fit given each model’s strengths and weaknesses. The price is maintaining multi-provider integration complexity. The return is using market-best on every task type, instead of paying extra or taking quality hits in some vendor’s weak zone.

Adopting this line of thinking changes two concrete things. One, your cost model has to be layered by task type, not computed on a single per-token basis. A penny-per-call high-frequency low-complexity task and a five-dollar-per-call long-horizon agent are completely different things and can’t share one budget. Two, evaluation has to be built in. Define 5 to 10 representative cases per task type and run them any time you consider switching models. This beats looking at AA Intelligence Index or SWE-Bench Pro numbers — those benchmarks express an average across general cases, and your production traffic distribution likely doesn’t match any of them.

Summary

First, this jagged frontier (Ethan Mollick’s term) isn’t going to converge in the short term. The four vendors’ trade-off points point in different directions, and each is actively reinforcing its differentiation: OpenAI’s superapp route, Anthropic’s agentic-worker route, Google’s cost-performance + multimodal route, DeepSeek’s open-weight + price route. By 2027 the distribution will be clearer, but it won’t return to a “one model rules all” state.

Second, therefore your system architecture needs an abstraction layer. Don’t hardwire your business code into any single vendor’s SDK, and don’t hardcode a specific model in your product. By “abstraction layer” I don’t mean a specific tool — it’s an architectural principle: your system needs switching room on the model-selection dimension, with an explicit routing layer between task type and model, where each task type can be evaluated and switched independently. This kind of principle-level decision matters more than picking today’s winner. Which model is strongest today is something you have to re-evaluate every three to six months; needing to be able to switch by task is something you only have to architect once and can live with for years.

The teams that do this well are the ones that, every time a new release drops, first run it through their own task matrix, then decide which tasks to switch and which not to. That’s the baseline skill for AI selection in 2026.

Sources

Official primary sources: - OpenAI: Introducing GPT-5.5 - GPT-5.5 System Card (PDF) - Anthropic: Claude Opus 4.7 release - Anthropic Opus 4.7 System Card (PDF) - Google DeepMind: Gemini 3.1 Pro model card - DeepSeek V4 Technical Report (PDF) - DeepSeek V4 API pricing

Independent benchmarks and analysis: - Artificial Analysis: GPT-5.5 is the new leading AI model - Artificial Analysis: Opus 4.7 deep dive - Artificial Analysis: Gemini 3.1 Pro Preview - Artificial Analysis: DeepSeek V4-Pro - LLM-Stats: GPT-5.5 vs Opus 4.7 - Vellum: Opus 4.7 benchmarks explained - Epoch AI: Opus 4.7 ECI score

Independent technical commentary: - Simon Willison: GPT-5.5 - Simon Willison: DeepSeek V4 — almost on the frontier - Ethan Mollick: Sign of the future - Zvi Mowshowitz: Opus 4.7 Part 1 - Zvi Mowshowitz: Opus 4.7 Part 2 - Jake Handy: DeepSeek V4

Ecosystem and compliance: - CSIS: Delving into Dangers of DeepSeek - WIRED: DeepSeek data flow to China - GitHub Copilot premium request billing table - Lovable production data - Decrypt: DeepSeek V4 + Cline CEO Uber anchor - Verdent AI: Gemini 3.1 Pro engineering writeup