Long Context Benchmarks: All Three Hit 1M — Now What?

Date: 2026-03-15

By March 2026, all three frontier model providers have finally crossed the 1M context window threshold. Google was the earliest — Gemini 1.5 Pro has supported 1M since February 2024. Anthropic held back for a full two years, only opening 1M beta on Claude Opus 4.6 in February 2026, with GA following on March 13. OpenAI’s path was the most winding: GPT-4.1 briefly touched 1M as an API-only model, but GPT-5.2 pulled back to 256K, and it wasn’t until GPT-5.4 (March 2026) that 1M returned via a premium tier — the first time a production model available to ChatGPT users joined the 1M club.

The context window arms race is, for now, settled. But when we pull the long context benchmark numbers side by side, a fact worth documenting emerges: the same nominal 1M means very different things in practice. On MRCR v2 8-needle (currently the most demanding long context retrieval test), Claude Opus 4.6 scores 76% at 1M, GPT-5.4 scores 36.6% in the 512K–1M range, and Gemini 3 Pro manages only 24.5% at 1M.

MRCR v2 Performance Curve

This chart is perhaps the most intuitive explanation. GPT-5.2’s needle-in-haystack scores previously looked near-perfect (98.2% in the 4K–8K range), and I initially assumed it was untouchable in this dimension. A closer look reveals that its context window caps at 256K — the curve simply ends there. Only after GPT-5.4 extended to 1M did the degradation beyond 256K become visible. The red shaded area represents the territory GPT-5.2 was never tested on.

This report compiles cross-model data from MRCR v2, Graphwalks, LongBench, and other benchmarks to answer one question: when it comes to 1M context windows, how far has each provider gotten from “having it” to “it actually working”?

Data and Methodology

Primary Benchmark: MRCR v2 (Multi-Round Coreference Resolution)

MRCR v2 is the current gold standard for long context cross-evaluation, developed and open-sourced by OpenAI (GitHub). The test inserts multiple identically formatted “needle” user requests and responses into a large volume of distractor text (haystack), then asks the model to accurately retrieve the content of the nth needle.

The 8-needle variant is the hardest configuration: the model must simultaneously track and distinguish 8 different target pieces of information. Compared to 2-needle or 4-needle, this places far greater demands on multi-attention span and precise retrieval. All frontier model providers (OpenAI, Anthropic, Google) use 8-needle as their core evaluation standard for long context capability.

Data Sources

This report draws from three tiers of sources, ranked by reliability:

An important caveat: MRCR v2 scores from different sources cannot be directly compared. OpenAI’s official numbers use xhigh reasoning effort, Anthropic uses max thinking, and contextarena.ai has its own standardized settings. This report notes the source and evaluation conditions for each comparison.

Deconstructing the “Invincible” Illusion: GPT-5.2’s 256K Boundary

The opening chart deserves a second look. The two curves nearly overlap in the 0–256K range, showing that GPT-5.2 and GPT-5.4 have no fundamental performance difference in this interval. The real information is in the red shaded area: once GPT-5.4 enters the 256K+ zone, scores drop sharply from 79.3% to 57.5% at 256K–512K, and further to 36.6% at 512K–1M. This isn’t GPT-5.4 regressing — it’s a challenge GPT-5.2 never had to face.

GPT-5.2 vs GPT-5.4 by Context Range

Context Range GPT-5.2 GPT-5.4 Delta
4K–8K 98.2% 97.3% -0.9
8K–16K 89.3% 91.4% +2.1
16K–32K 95.3% 97.2% +1.9
32K–64K 92.0% 90.5% -1.5
64K–128K 85.6% 86.0% +0.4
128K–256K 77.0% 79.3% +2.3
256K–512K N/A 57.5%
512K–1M N/A 36.6%

Source: OpenAI GPT-5.2 Blog, OpenAI GPT-5.4 Blog

In the overlapping range (4K–256K), the two models differ by no more than ±2%, confirming this is not a generational regression. GPT-5.2 simply was never tested beyond 256K. GPT-5.4 performs nearly identically within 256K, but extending to 1M exposes the long context degradation problem that all models face.

Cross-Model Comparison: Who Can Actually Use 1M?

Cross-model Comparison

The chart above draws from Table 2.16.A in the Anthropic Sonnet 4.6 system card (Claude models use internal evaluation + max thinking; Gemini and GPT data come from contextarena.ai third-party evaluation), OpenAI GPT-5.4 official data, and the Google Gemini 2.5 technical report.

Aggregate Scores at 256K

Model Score Source
Claude Opus 4.6 93.0% Anthropic system card (max thinking)
Claude Sonnet 4.6 90.3% Anthropic system card (max thinking)
GPT-5.2 70.0% OpenAI self-reported (xhigh reasoning)
Gemini 3 Flash 58.5% contextarena.ai
Gemini 3 Pro 45.4% contextarena.ai
Claude Sonnet 4.5 10.8% Anthropic system card

Aggregate Scores at 1M

Model Score Source
Claude Opus 4.6 76.0% Anthropic system card (max thinking)
Claude Sonnet 4.6 65.8% Anthropic system card (max thinking)
GPT-5.4 36.6% OpenAI official (512K–1M range)
Gemini 3 Flash 32.6% contextarena.ai
Gemini 3 Pro 24.5% contextarena.ai
Claude Sonnet 4.5 18.5% Anthropic system card
Gemini 2.5 Pro 16.4% Google technical report (PDF)

Claude Opus 4.6’s 76.0% at 1M represents a qualitative shift. Compared to its predecessor Sonnet 4.5 at 18.5%, this is a 4x improvement. Anthropic described it in the Opus 4.6 launch announcement as “a qualitative shift in how much context a model can actually use while maintaining peak performance.”

Context Window ≠ Context Reliability

Context Window vs Reliability

This scatter plot visualizes a counterintuitive reality: a larger context window does not necessarily mean better performance.

GPT-5.2 has only a 256K context window, yet scores 77% at its farthest tested point (128K–256K). Meanwhile, Gemini 2.5 Pro, despite claiming a 1M context window, scores only 16.4% at 1M. Gemini 3 Pro has improved steadily but still sits at just 24.5% at 1M. In contrast, Claude Opus 4.6 is the only model maintaining 70%+ performance at 1M.

This phenomenon has been systematically validated in academic research. The Michelangelo evaluation in arXiv 2409.12640 found that GPT and Claude models perform better at short contexts (below 8K) but degrade faster, while Gemini models start lower at short contexts but degrade more gradually, potentially catching up at ultra-long contexts (1M). This “crossover effect” reveals an inherent tension between short-context and long-context performance — models struggle to excel at both.

Long Context Strategy Retrospectives

Anthropic: Conservative but Effective — “Last to Move, First to Arrive”

Anthropic’s context window expansion timeline:

Date Model Context Window Notes
2024.3 Claude 3 Opus/Sonnet 200K Technically supported 1M but not offered
2024.6 Claude 3.5 Sonnet 200K Still 200K
2025.8 Claude Sonnet 4 200K + 1M beta First 1M beta, Tier 4 users only
2025.9 Claude Sonnet 4.5 200K + 1M beta Poor 1M performance (MRCR 18.5%)
2025.11 Claude Opus 4.5 200K Opus tier still no 1M
2026.2.5 Claude Opus 4.6 200K + 1M beta First Opus-tier 1M, MRCR 76%
2026.2.17 Claude Sonnet 4.6 200K + 1M beta MRCR 65.8% at 1M
2026.3.13 Opus 4.6 + Sonnet 4.6 1M GA 1M generally available, standard pricing (source)

Source: Anthropic Release Notes, Claude Opus 4.6 launch

From March 2024 to March 2026, Anthropic stayed at 200K for a full two years. A September 2025 Anthropic engineering blog post explicitly laid out their philosophy:

“Waiting for larger context windows might seem like an obvious tactic. But it’s likely that for the foreseeable future, context windows of all sizes will be subject to context pollution and information relevance concerns. The solution isn’t more capacity; it’s better management of existing capacity.”

This signaled Anthropic’s belief that expanding the context window wasn’t the answer — context management was. In retrospect, the strategy paid off: when they finally shipped 1M, performance (76%) far exceeded Gemini’s (24.5%), which had offered 1M for years.

Google Gemini: The Pioneer’s Dilemma

Date Model Context Window MRCR v2 8-needle
2024.2 Gemini 1.5 Pro 1M (later expanded to 2M) 8-needle not reported
2025.2 Gemini 2.0 Flash 1M ≤128K: 18.4%, >1M: 10.2%
2025.6 Gemini 2.5 Pro 1M (2M planned) ≤128K: 58.0%, >1M: 16.4%
2025.11 Gemini 3 Pro 1M 128K: 77.0%, 1M: 26.3%
2026.2 Gemini 3.1 Pro 1M 128K: 84.9%, 1M: 26.3%

Source: Google Gemini Blog, Gemini 2.5 Tech Report, Gemini 3.1 Pro Model Card

Google was the first to offer 1M context, and for a time the only one. But its 8-needle MRCR v2 performance has been consistently underwhelming. Gemini 2.0 Flash scored only 10.2% at 1M; by Gemini 3 Pro this improved to 26.3%, but the gap to Claude Opus 4.6’s 76% remains enormous.

Interestingly, Gemini’s improvement at 128K has been rapid: from 58.0% with 2.5 Pro to 84.9% with 3.1 Pro. At 128K, Gemini 3.1 Pro ties Claude Sonnet 4.6 (84.9%), indicating convergence at moderate context lengths (source). But the chasm at 1M persists.

The developer community has also voiced frustrations with Gemini’s long context reliability. Reddit users report noticeable performance degradation with Gemini 3 Pro after using just 15–20% of the nominal context window (source).

Notably, Latenode claims Gemini 2.5 Pro achieves 100% recall within 530K tokens and 99.7% at 1M (source). However, this data could not be verified in Google’s official technical report and likely comes from a different test methodology (e.g., 2-needle rather than 8-needle, or simple passkey retrieval rather than MRCR).

OpenAI: Precision First, Range Second

OpenAI’s strategy resembles Anthropic’s — conservative. GPT-5.2 opted for a 256K context window (400K in some configurations) and focused on maximizing performance within that range. When GPT-5.4 expanded to 1.1M, 272K is the standard mode and 1M is a paid premium mode (2x pricing) (source).

GPT-5.4 supports 1M context window in Codex (experimental), requiring explicit configuration via model_context_window and model_auto_compact_token_limit.

Additional Long Context Benchmarks

MRCR v2 is the most important long context benchmark today, but not the only one. Key findings from other benchmarks:

Graphwalks (multi-hop graph reasoning): Claude Opus 4.6 and GPT-5.2 are nearly tied on the Parents task (71.1% vs 72.0% at 1M), but both struggle on BFS tasks (~40% at 1M). Source: Sonnet 4.6 System Card

LongBench v2 (long document understanding, 503 questions): Gemini 2.5 Pro leads at 63.3%, surpassing the human baseline (53.7%). GPT-4o 46.0%, Claude 3.5 Sonnet 41.0%. Source: LongBench v2 Leaderboard

RULER (retrieval/aggregation/reasoning, 13 tasks): At 128K, only Gemini 1.5 Pro (94.4%) and Jamba-1.5-large (95.1%) maintain 90%+. GPT-4 drops to 81.2%. Source: NVIDIA RULER

LongBench Pro (updated long document evaluation): Top three are Gemini 2.5 Pro (73.42), GPT-5 (72.61), and Claude-4-Sonnet (69.87). Gemini 2.5 Pro shows remarkable insensitivity to context length: its 256K score (71.77) is nearly identical to its 8K score (74.50). Source: arXiv 2601.02872

Methodological Caveats

Evaluation Condition Differences

The same benchmark can yield very different scores under different evaluation conditions. Take GPT-5.2’s MRCR v2 256K 8-needle as an example:

Same model, same benchmark, three sources, three different numbers. Reasons include: different reasoning effort settings, different aggregation methods (range average vs sample-weighted average), temperature parameter differences, and more.

8-needle vs Other Variants

Google’s frequently cited “MRCR 91.5% at 128K” figure in Gemini blog posts is likely from a 4-needle variant or MRCR v1, not 8-needle. In Google’s own technical report, Gemini 2.5 Pro scores only 54.3–58.0% on 8-needle at ≤128K. This discrepancy shows that needle count has a massive impact on difficulty — when comparing MRCR data from different sources, it is essential to confirm the variant.

Conclusion

The “context window arms race” framing is obsolete. The March 2026 Frontier article puts it well:

“The context-window arms race is over. The context-reliability race is the real story now. And it’s a harder problem. Stuffing a million tokens into a window is engineering. Getting the model to actually use what’s buried at token 600,000 is science.”

Industry research supports this: Awesome Agents’ analysis notes that a model’s effective context capacity is typically only 60–70% of its nominal value — beyond that range, performance degradation becomes non-trivial.

For practitioners, the key takeaways: