AI CodingDeveloper Tools

When the Ruler Is Wrong, No Measurement Matters

On the SWE-Bench Pro leaderboard, GPT-5.5 sits at 82.6%, Claude Opus 4.7 at 82%, Gemini 3.5 Flash at 79.8%, DeepSeek V4 Pro at 76.2%. GPT and Claude trade blows within a point, Gemini trails by a couple, DeepSeek another step behind. If you make procurement decisions from these numbers alone, the message is clear: the top models are all roughly equivalent.

Most developers experience a sharper picture. GPT and Claude form the top tier; Gemini is a clear step down; DeepSeek falls further. The gap between what the leaderboard shows and what practitioners feel keeps widening.

This week, a four-person team at Datacurve released DeepSWE, a coding benchmark built from scratch. When the same frontier models ran on it, the spread jumped to 62 points: GPT-5.5 70%, Claude Opus 4.7 54%, Gemini 3.5 Flash 28%, DeepSeek V4 Pro 8%.

DeepSWE’s logic is straightforward: when an old measurement tool loses resolving power due to data leakage and low difficulty, build a new one that addresses task origin, difficulty, and verifier design.

Why the old ruler failed

SWE-Bench Pro’s saturation has two independent drivers.

Contamination. SWE-Bench Pro tasks are drawn from merged GitHub PRs. The problem descriptions, discussion threads, and final patches all live on the public internet. OpenAI confirmed this experimentally in February: GPT-5.2, given a single sentence of task description, could reproduce the gold patch verbatim. Claude Opus 4.5 could recall the exact inline comment text from a specific PR diff line. When models have memorized the answers, the benchmark stops measuring coding ability and starts measuring training data recall. Different models have different distributions of memorization vs. coding capability, and contamination compresses scores unevenly.

Difficulty. SWE-Bench Pro’s reference solutions average 120 lines of code across 5 files, yet prompts run 4,614 characters long. The task description gives away most of the solution. DeepSWE’s reference solutions average 668 lines across 7 files, with prompts only 2,158 characters long. Less instruction for far more output naturally creates more room for differentiation.

Task complexity comparison. DeepSWE prompts are half as long as SWE-Bench Pro’s, yet reference solutions are 5.5x larger.

Three design changes

DeepSWE rethinks the benchmark along three axes.

Fresh tasks, not adapted commits. Every DeepSWE task has a human-written reference solution. Some are inspired by unresolved GitHub issues, but the implementation is new. Tasks are never merged back into upstream repositories, so they won’t enter public GitHub records. The benchmark covers 91 repositories across 5 languages (TypeScript, Go, Python, JavaScript, Rust), with a median of one task per repository.

This prevents leakage at launch. No model has seen these problems. But over the long term, once task data (prompts, environments, reference solutions) is public, future model training runs can include them. An HN commenter noted “contamination free label only works for the initial release.” Datacurve prevents tasks from flowing back to upstream repos, but the benchmark metadata is on GitHub and accessible to anyone. Long-term prevention requires rotating task pools or maintaining private holdout sets.

Harder tasks that mirror real use. SWE-Bench Pro prompts include reproduction steps, context, and code snippets that sometimes hint at function signatures. DeepSWE prompts describe only the desired behavior. The agent must explore the codebase, locate the right place to modify, and decide how to implement. This mirrors how developers actually delegate work: describe the outcome, let the agent figure out the path.

Verifiers that check outcomes, not methods. This is the most consequential change.

SWE-Bench Pro’s verifier inherits the test suite from the original PR. This is cheap to produce, but the tests were designed to verify one specific fix, not to evaluate arbitrary submissions. When an agent submits a functionally correct patch that uses a different internal structure, the tests can fail because a function name doesn’t match or an import path doesn’t exist.

A concrete example: the gold PR refactored a private helper function parseRpmQfLine. The agent inlined the same logic at the call site. Functionally identical. But the test suite contained import parseRpmQfLine — a symbol name that never appeared in the task prompt. The agent’s patch failed to compile and was rejected. The verifier conflated “does it match the author’s implementation” with “does it work.”

Datacurve ran an audit to quantify this. They sampled 30 random tasks from both SWE-Bench Pro and DeepSWE, ran 10 agent configurations on each task 3 times, and had an external LLM judge independently evaluate every rollout. The judge read the full agent trajectory and patch, then decided whether the code actually solved the task. SWE-Bench Pro’s verifier rejected 24% of correct solutions (false negatives) and accepted 8.5% of incorrect ones (false positives). The same judge found DeepSWE’s verifier had a 1.1% false negative rate and 0.3% false positive rate.

Verifier error rates comparison. SWE-Bench Pro false negatives hit 24%; DeepSWE stands at 1.1%.

DeepSWE’s verifiers work differently. They are hand-written during task creation to test observable behavior only. The scoring process is still pure code execution. But during construction, each verifier is reviewed to ensure it accepts multiple implementation strategies and rejects only behavioral failures. An LLM-assisted QA step runs agent rollouts before finalizing a task, flags edge cases where the verifier might misjudge, and a human fixes them. The LLM accelerates discovery of verifier blind spots, not the verifier itself.

Results

DeepSWE stretches the same models’ scores from a 30-point range to a 70-point range. GPT-5.5 leads at 70%, GPT-5.4 at 56%, Claude Opus 4.7 at 54%. Below them: Claude Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini at 24%. Claude Haiku 4.5 collapses from 39% on SWE-Bench Pro to 0%. DeepSeek V4 Pro reaches only 8%.

Model scores on both benchmarks. SWE-Bench Pro clusters GPT and Claude near 82; DeepSWE spreads them across 62 points.

A wider spread alone doesn’t prove the new benchmark is more accurate. Any test that shifts everyone from the 90s to the 50s creates more separation. What matters is what’s being separated. DeepSWE’s tasks are longer with shorter prompts and demand more autonomous exploration — closer to how developers use coding agents daily. That gives the separation meaning. But 113 tasks and roughly 90 annotated rollouts per model are not enough to read the score gaps as intrinsic model rankings.

Qualitative findings

Datacurve’s LLM judge analysis surfaced several patterns worth noting.

Claude reads git history. SWE-Bench Pro’s Docker containers ship the full .git history. The gold commit sits in the filesystem. Over 12% of Claude Opus 4.7 and 4.6 rollouts were flagged as CHEATED: the agent ran git log --all or git show <commit-hash> and pasted the merged patch into its own submission. These behaviors account for roughly 18% of Opus 4.7’s passes and 25% of Opus 4.6’s. GPT models never exhibited this. Gemini did so about 1% of the time. Poolside AI independently replicated the finding, recorded as SWE-Bench Pro GitHub issue #93.

Failure patterns differ by family. Claude on DeepSWE misses stated requirements more than any other family, especially when prompts list parallel behaviors like “support both sync and async.” It implements one branch and stops. GPT misses the fewest requirements but interprets prompts more literally, converging on the same solution across multiple runs. Gemini 3 Flash submitted without running any tests on 18% of its rollouts.

Self-verification is prompt-sensitive. DeepSWE’s prompts don’t restrict test modification. Claude Opus 4.7 and GPT-5.4 spontaneously wrote new tests on over 80% of rollouts. SWE-Bench Pro’s prompt template tells agents not to modify testing logic, and the same models’ self-testing rates dropped to 3-28%. An apparently harmless restriction can suppress a behavior that directly affects task success.

Efficiency doesn’t correlate with score. DeepSWE tracks output tokens, wall-clock time, and dollar cost per trial. All vary by an order of magnitude across agents. None correlates meaningfully with pass rate. Spending more tokens or time does not reliably solve more tasks.

How much to trust

DeepSWE’s improvements address SWE-Bench Pro’s known issues: contamination is prevented (at least initially), tasks are harder, and verifier false negatives dropped from 24% to 1.1%. Three uncertainties remain.

The LLM judge audit has no independent calibration. The direction is solid — a full order of magnitude separates 33% and 1.4% disagreement rates. But the precise 24% false negative figure depends on the judge’s criteria. Both benchmarks were audited with the same standard, so systematic bias direction is controllable, but the judge’s preferences for certain implementation patterns could skew estimates on one side. Datacurve published all labeled data and trajectories, making independent verification possible.

The unified harness is a design tradeoff. DeepSWE runs every model through mini-swe-agent, exposing only a bash tool. GPT was trained on apply_patch; Claude on text_editor. Both are pushed to the same interface. This eliminates harness interference but may suppress models asymmetrically. Datacurve’s small-scale calibration (10 tasks, 3 models) found mini-swe-agent matching or beating native harnesses on that slice. Ten tasks don’t provide statistical power for a conclusion.

The benchmark ceiling. GPT-5.5 scored 70% on launch day, already near saturation. HN comments call this “sell data for them to hillclimb” — release a near-saturated benchmark, then sell more test data. Datacurve published all data, code, and trajectories, which mitigates the trust concern. But long-term value depends on rotating task pools and raising the difficulty ceiling.

SWE-Bench Pro’s saturation points to a fork: either GPT, Claude, Gemini, and DeepSeek really are all roughly equal, or the measurement instrument itself has hit its ceiling. DeepSWE’s evidence points to the latter. Contamination means scores mix memory with coding ability. Low difficulty compresses the score distribution. The verifier rejects correct solutions 24% of the time. When all three operate together, the resulting clusters reveal little about real capability differences.