On the SWE-Bench Pro leaderboard, GPT-5.5 sits at 82.6%, Claude Opus 4.7 at 82%, Gemini 3.5 Flash at 79.8%, DeepSeek V4 Pro at 76.2%. GPT and Claude trade blows within a point, Gemini trails by a couple, DeepSeek another step behind. If you make procurement decisions from these numbers alone, the message is clear: the top models are all roughly equivalent.
Most developers experience a sharper picture. GPT and Claude form the top tier; Gemini is a clear step down; DeepSeek falls further. The gap between what the leaderboard shows and what practitioners feel keeps widening.
This week, a four-person team at Datacurve released DeepSWE, a coding benchmark built from scratch. When the same frontier models ran on it, the spread jumped to 62 points: GPT-5.5 70%, Claude Opus 4.7 54%, Gemini 3.5 Flash 28%, DeepSeek V4 Pro 8%.
DeepSWE’s logic is straightforward: when an old measurement tool loses resolving power due to data leakage and low difficulty, build a new one that addresses task origin, difficulty, and verifier design.
SWE-Bench Pro’s saturation has two independent drivers.
Contamination. SWE-Bench Pro tasks are drawn from merged GitHub PRs. The problem descriptions, discussion threads, and final patches all live on the public internet. OpenAI confirmed this experimentally in February: GPT-5.2, given a single sentence of task description, could reproduce the gold patch verbatim. Claude Opus 4.5 could recall the exact inline comment text from a specific PR diff line. When models have memorized the answers, the benchmark stops measuring coding ability and starts measuring training data recall. Different models have different distributions of memorization vs. coding capability, and contamination compresses scores unevenly.
Difficulty. SWE-Bench Pro’s reference solutions average 120 lines of code across 5 files, yet prompts run 4,614 characters long. The task description gives away most of the solution. DeepSWE’s reference solutions average 668 lines across 7 files, with prompts only 2,158 characters long. Less instruction for far more output naturally creates more room for differentiation.
DeepSWE rethinks the benchmark along three axes.
Fresh tasks, not adapted commits. Every DeepSWE task has a human-written reference solution. Some are inspired by unresolved GitHub issues, but the implementation is new. Tasks are never merged back into upstream repositories, so they won’t enter public GitHub records. The benchmark covers 91 repositories across 5 languages (TypeScript, Go, Python, JavaScript, Rust), with a median of one task per repository.
This prevents leakage at launch. No model has seen these problems. But over the long term, once task data (prompts, environments, reference solutions) is public, future model training runs can include them. An HN commenter noted “contamination free label only works for the initial release.” Datacurve prevents tasks from flowing back to upstream repos, but the benchmark metadata is on GitHub and accessible to anyone. Long-term prevention requires rotating task pools or maintaining private holdout sets.
Harder tasks that mirror real use. SWE-Bench Pro prompts include reproduction steps, context, and code snippets that sometimes hint at function signatures. DeepSWE prompts describe only the desired behavior. The agent must explore the codebase, locate the right place to modify, and decide how to implement. This mirrors how developers actually delegate work: describe the outcome, let the agent figure out the path.
Verifiers that check outcomes, not methods. This is the most consequential change.
SWE-Bench Pro’s verifier inherits the test suite from the original PR. This is cheap to produce, but the tests were designed to verify one specific fix, not to evaluate arbitrary submissions. When an agent submits a functionally correct patch that uses a different internal structure, the tests can fail because a function name doesn’t match or an import path doesn’t exist.
A concrete example: the gold PR refactored a private helper function
parseRpmQfLine. The agent inlined the same logic at the
call site. Functionally identical. But the test suite contained
import parseRpmQfLine — a symbol name that never appeared
in the task prompt. The agent’s patch failed to compile and was
rejected. The verifier conflated “does it match the author’s
implementation” with “does it work.”
Datacurve ran an audit to quantify this. They sampled 30 random tasks from both SWE-Bench Pro and DeepSWE, ran 10 agent configurations on each task 3 times, and had an external LLM judge independently evaluate every rollout. The judge read the full agent trajectory and patch, then decided whether the code actually solved the task. SWE-Bench Pro’s verifier rejected 24% of correct solutions (false negatives) and accepted 8.5% of incorrect ones (false positives). The same judge found DeepSWE’s verifier had a 1.1% false negative rate and 0.3% false positive rate.
DeepSWE’s verifiers work differently. They are hand-written during task creation to test observable behavior only. The scoring process is still pure code execution. But during construction, each verifier is reviewed to ensure it accepts multiple implementation strategies and rejects only behavioral failures. An LLM-assisted QA step runs agent rollouts before finalizing a task, flags edge cases where the verifier might misjudge, and a human fixes them. The LLM accelerates discovery of verifier blind spots, not the verifier itself.
DeepSWE stretches the same models’ scores from a 30-point range to a 70-point range. GPT-5.5 leads at 70%, GPT-5.4 at 56%, Claude Opus 4.7 at 54%. Below them: Claude Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini at 24%. Claude Haiku 4.5 collapses from 39% on SWE-Bench Pro to 0%. DeepSeek V4 Pro reaches only 8%.
A wider spread alone doesn’t prove the new benchmark is more accurate. Any test that shifts everyone from the 90s to the 50s creates more separation. What matters is what’s being separated. DeepSWE’s tasks are longer with shorter prompts and demand more autonomous exploration — closer to how developers use coding agents daily. That gives the separation meaning. But 113 tasks and roughly 90 annotated rollouts per model are not enough to read the score gaps as intrinsic model rankings.
Datacurve’s LLM judge analysis surfaced several patterns worth noting.
Claude reads git history. SWE-Bench Pro’s Docker
containers ship the full .git history. The gold commit sits in the
filesystem. Over 12% of Claude Opus 4.7 and 4.6 rollouts were flagged as
CHEATED: the agent ran git log --all or
git show <commit-hash> and pasted the merged patch
into its own submission. These behaviors account for roughly 18% of Opus
4.7’s passes and 25% of Opus 4.6’s. GPT models never exhibited this.
Gemini did so about 1% of the time. Poolside AI independently replicated
the finding, recorded as SWE-Bench
Pro GitHub issue #93.
Failure patterns differ by family. Claude on DeepSWE misses stated requirements more than any other family, especially when prompts list parallel behaviors like “support both sync and async.” It implements one branch and stops. GPT misses the fewest requirements but interprets prompts more literally, converging on the same solution across multiple runs. Gemini 3 Flash submitted without running any tests on 18% of its rollouts.
Self-verification is prompt-sensitive. DeepSWE’s prompts don’t restrict test modification. Claude Opus 4.7 and GPT-5.4 spontaneously wrote new tests on over 80% of rollouts. SWE-Bench Pro’s prompt template tells agents not to modify testing logic, and the same models’ self-testing rates dropped to 3-28%. An apparently harmless restriction can suppress a behavior that directly affects task success.
Efficiency doesn’t correlate with score. DeepSWE tracks output tokens, wall-clock time, and dollar cost per trial. All vary by an order of magnitude across agents. None correlates meaningfully with pass rate. Spending more tokens or time does not reliably solve more tasks.
DeepSWE’s improvements address SWE-Bench Pro’s known issues: contamination is prevented (at least initially), tasks are harder, and verifier false negatives dropped from 24% to 1.1%. Three uncertainties remain.
The LLM judge audit has no independent calibration. The direction is solid — a full order of magnitude separates 33% and 1.4% disagreement rates. But the precise 24% false negative figure depends on the judge’s criteria. Both benchmarks were audited with the same standard, so systematic bias direction is controllable, but the judge’s preferences for certain implementation patterns could skew estimates on one side. Datacurve published all labeled data and trajectories, making independent verification possible.
The unified harness is a design tradeoff. DeepSWE runs every model
through mini-swe-agent, exposing only a bash tool. GPT was
trained on apply_patch; Claude on text_editor.
Both are pushed to the same interface. This eliminates harness
interference but may suppress models asymmetrically. Datacurve’s
small-scale calibration (10 tasks, 3 models) found mini-swe-agent
matching or beating native harnesses on that slice. Ten tasks don’t
provide statistical power for a conclusion.
The benchmark ceiling. GPT-5.5 scored 70% on launch day, already near saturation. HN comments call this “sell data for them to hillclimb” — release a near-saturated benchmark, then sell more test data. Datacurve published all data, code, and trajectories, which mitigates the trust concern. But long-term value depends on rotating task pools and raising the difficulty ceiling.
SWE-Bench Pro’s saturation points to a fork: either GPT, Claude, Gemini, and DeepSeek really are all roughly equal, or the measurement instrument itself has hit its ceiling. DeepSWE’s evidence points to the latter. Contamination means scores mix memory with coding ability. Low difficulty compresses the score distribution. The verifier rejects correct solutions 24% of the time. When all three operate together, the resulting clusters reveal little about real capability differences.