Retrieval & Knowledge SystemsModel ArchitectureInference & Performance

Long Context Benchmarks: All Three Hit 1M — Now What?

Published Mar 15, 2026

Survey Date: 2026-03-15

1M context is no longer rare. What is rare is the ability to reliably find the right information within those 1M tokens. Claude Opus 4.6 is currently the only model that maintains high reliability at 1M. Gemini and GPT degrade more noticeably. The long-context competition has shifted from a capacity race to a reliability race.

March 2026 marks the moment all three frontier model providers finally crossed the 1M context window threshold. Google got there first: Gemini 1.5 Pro has supported 1M since February 2024. Anthropic held off for two full years, only opening 1M beta on Claude Opus 4.6 in February 2026 and graduating to general availability on March 13. OpenAI’s path was more winding. GPT-4.1 briefly entered 1M as an API-only model, but the follow-up GPT-5.2 pulled back to 256K. GPT-5.4, released in March 2026, restored 1M support as a premium tier, making it the first production model available to ChatGPT users in the 1M club.

The context window arms race has thus reached a natural pause. But when we pull up the long-context benchmark data across providers, a fact worth documenting emerges: models claiming the same 1M label deliver wildly different levels of actual reliability. On MRCR v2 8-needle, currently the most stringent long-context retrieval test, Claude Opus 4.6 scores 76% at 1M. GPT-5.4 scores 36.6% in the 512K-1M range. Gemini 3 Pro scores 24.5% at 1M.

This chart may be the most direct explanation. GPT-5.2 previously posted results on needle-in-a-haystack tests that looked nearly perfect (98.2% in the 4K-8K range). I once assumed it was unbeatable on this metric. A closer look revealed that its context window was only 256K, and the curve stopped right there. When GPT-5.4 expanded to 1M, the decay beyond 256K finally came into view. The red shaded region marks the territory GPT-5.2 was never tested on.

This survey compiles cross-model benchmark data from MRCR v2, Graphwalks, LongBench, and others, aiming to answer one question: for long-context models moving from “having” 1M to “using” 1M, how far has each provider actually gotten.

Data and Methods

Primary Benchmark: MRCR v2 (Multi-Round Coreference Resolution)

MRCR v2 is the central benchmark for long-context cross-model evaluation, developed and open-sourced by OpenAI (GitHub). The testing method inserts multiple identically formatted “needle” user requests and responses into a large body of distractor text (the haystack) and asks the model to accurately retrieve the content of the n-th needle.

The 8-needle variant is the most difficult configuration: the model must track and distinguish eight different target pieces of information simultaneously. Compared to 2-needle or 4-needle variants, this places far higher demands on multi-span attention and precise retrieval. All frontier model providers (OpenAI, Anthropic, Google) use 8-needle as their core evaluation standard for long-context capability.

Data Sources

The data in this report comes from three categories of sources, ordered by credibility:

Official vendor benchmarks: OpenAI official blogs (GPT-5.2, GPT-5.4), Anthropic system card (Sonnet 4.6 System Card), Google Gemini technical report (arxiv 2507.06261)
Independent third-party evaluations: contextarena.ai (tests models in a unified environment), Artificial Analysis
Synthesis articles: The March 2026 Frontier, Awesome Agents Long Context Leaderboard

One important caveat: MRCR v2 scores from different sources cannot be directly compared. OpenAI’s official data uses xhigh reasoning effort, Anthropic uses max thinking, and contextarena.ai applies its own standardized settings. This report notes the data source and evaluation conditions in each comparison.

Deconstructing the “Invincible” Illusion: GPT-5.2’s 256K Boundary

The opening chart deserves a second look. The two curves nearly overlap in the 0-256K range, indicating no fundamental capability difference between GPT-5.2 and GPT-5.4 within that window. The real information lives in the red shaded zone. Once GPT-5.4 moves past 256K, the 256K-512K range drops sharply from 79.3% to 57.5%, and 512K-1M falls further to 36.6%. This is not GPT-5.4 regressing. It is GPT-5.2 never having faced these challenges.

GPT-5.2 vs. GPT-5.4 by Context Range

Context Range	GPT-5.2	GPT-5.4	Difference
4K-8K	98.2%	97.3%	-0.9
8K-16K	89.3%	91.4%	+2.1
16K-32K	95.3%	97.2%	+1.9
32K-64K	92.0%	90.5%	-1.5
64K-128K	85.6%	86.0%	+0.4
128K-256K	77.0%	79.3%	+2.3
256K-512K	N/A	57.5%	—
512K-1M	N/A	36.6%	—

Source: OpenAI GPT-5.2 Blog, OpenAI GPT-5.4 Blog

In the overlapping range (4K-256K), the two models differ by at most ±2%, confirming this is not a generational regression but simply the fact that GPT-5.2 was never tested beyond 256K. GPT-5.4 stays essentially on par with GPT-5.2 within 256K, but extending to 1M reveals the long-context decay that all models face.

Cross-Model Comparison: Who Can Actually Use 1M?

The chart above uses data from Anthropic’s Sonnet 4.6 System Card Table 2.16.A (Claude models using internal evaluation + max thinking; Gemini and GPT data from contextarena.ai third-party evaluation), as well as OpenAI’s official GPT-5.4 data and Google’s Gemini 2.5 technical report.

256K Aggregate Scores

Model	Score	Source
Claude Opus 4.6	93.0%	Anthropic System Card (max thinking)
Claude Sonnet 4.6	90.3%	Anthropic System Card (max thinking)
GPT-5.2	70.0%	OpenAI self-reported (xhigh reasoning)
Gemini 3 Flash	58.5%	contextarena.ai
Gemini 3 Pro	45.4%	contextarena.ai
Claude Sonnet 4.5	10.8%	Anthropic System Card

1M Aggregate Scores

Model	Score	Source
Claude Opus 4.6	76.0%	Anthropic System Card (max thinking)
Claude Sonnet 4.6	65.8%	Anthropic System Card (max thinking)
GPT-5.4	36.6%	OpenAI official (512K-1M range)
Gemini 3 Flash	32.6%	contextarena.ai
Gemini 3 Pro	24.5%	contextarena.ai
Claude Sonnet 4.5	18.5%	Anthropic System Card
Gemini 2.5 Pro	16.4%	Google Technical Report (PDF)

Claude Opus 4.6’s 76.0% at 1M represents a qualitative shift. Compared to the previous Sonnet 4.5’s 18.5%, that is a 4x improvement. Anthropic’s release announcement for Opus 4.6 described it as “a qualitative shift in how much context a model can actually use while maintaining peak performance.”

Context Window ≠ Context Reliability

This scatter plot makes a counterintuitive reality visually obvious: a larger context window does not necessarily mean better performance.

GPT-5.2 has only a 256K context window, yet its farthest test point within that range (128K-256K) still reaches 77%. Gemini 2.5 Pro nominally supports a 1M context window but scores only 16.4% at 1M. Gemini 3 Pro has been steadily improving, but still only reaches 24.5% at 1M. In contrast, Claude Opus 4.6 is the only model that sustains 70%+ at 1M.

This phenomenon has been systematically validated in academic research. The Michelangelo evaluation in arXiv 2409.12640 found that GPT and Claude models perform better at short contexts (under 8K) but decay more rapidly, while Gemini models start from a lower baseline at short contexts but decay more gradually, potentially overtaking others at ultra-long contexts (1M). This “crossover effect” suggests an inherent tension between short-context and long-context performance. Models seem to struggle to excel at both.

Provider-by-Provider Long-Context Strategy Review

Anthropic: Conservative but Effective “Late Launch, Early Lead”

Anthropic’s context window expansion timeline:

Date	Model	Context Window	Notes
2024.3	Claude 3 Opus/Sonnet	200K	Technically capable of 1M but not enabled
2024.6	Claude 3.5 Sonnet	200K	Still at 200K
2025.8	Claude Sonnet 4	200K + 1M beta	First 1M beta, Tier 4 only
2025.9	Claude Sonnet 4.5	200K + 1M beta	Poor 1M performance (MRCR 18.5%)
2025.11	Claude Opus 4.5	200K	Opus tier still no 1M
2026.2.5	Claude Opus 4.6	200K + 1M beta	First Opus-level 1M, MRCR 76%
2026.2.17	Claude Sonnet 4.6	200K + 1M beta	MRCR 65.8% at 1M
2026.3.13	Opus 4.6 + Sonnet 4.6	1M GA	1M general availability, standard pricing (source)

Sources: Anthropic Release Notes, Claude Opus 4.6 Release

From March 2024 to March 2026, Anthropic stayed at 200K for two full years. A September 2025 article on their engineering blog laid out their philosophy explicitly:

“Waiting for larger context windows might seem like an obvious tactic. But it’s likely that for the foreseeable future, context windows of all sizes will be subject to context pollution and information relevance concerns. The solution isn’t more capacity; it’s better management of existing capacity.”

This passage signals that Anthropic viewed context window expansion not as a solution but as a secondary concern. Context management was the real problem. In retrospect, the strategy worked. When they finally shipped 1M, the performance (76%) far exceeded that of Gemini (24.5%), which had offered 1M years earlier.

Google Gemini: First Mover’s Dilemma

Date	Model	Context Window	MRCR v2 8-needle
2024.2	Gemini 1.5 Pro	1M (later expanded to 2M)	8-needle not published
2025.2	Gemini 2.0 Flash	1M	≤128K: 18.4%, >1M: 10.2%
2025.6	Gemini 2.5 Pro	1M (2M planned)	≤128K: 58.0%, >1M: 16.4%
2025.11	Gemini 3 Pro	1M	128K: 77.0%, 1M: 26.3%
2026.2	Gemini 3.1 Pro	1M	128K: 84.9%, 1M: 26.3%

Sources: Google Gemini Blog, Gemini 2.5 Tech Report, Gemini 3.1 Pro Model Card

Google was the first to ship a 1M context window in early 2024 and was for a time the only player in the field. But its 8-needle MRCR v2 results have consistently fallen short. Gemini 2.0 Flash scored only 10.2% at 1M. Gemini 3 Pro improved to 26.3%, but the gap with Claude Opus 4.6’s 76% remains massive.

What is interesting is how quickly Gemini has improved at 128K: from 2.5 Pro’s 58.0% to 3.1 Pro’s 84.9%. At 128K, Gemini 3.1 Pro ties Claude Sonnet 4.6 (84.9%), suggesting that medium-context performance has converged across providers (source). But the 1M gap remains.

The developer community has voiced frustrations with Gemini’s long-context reliability. Reddit users have reported that Gemini 3 Pro exhibits noticeable performance degradation after using only 15-20% of its nominal context window (source).

It is worth noting that Latenode claims Gemini 2.5 Pro achieves 100% recall within 530K tokens and 99.7% at 1M (source). However, this data could not be verified in Google’s official technical report and likely comes from a different testing methodology (such as 2-needle rather than 8-needle, or simple passkey retrieval rather than MRCR).

OpenAI: Precision-First Tiered Approach

OpenAI’s strategy resembles Anthropic’s conservative approach. GPT-5.2 chose a 256K context window (400K in some configurations), focusing on making performance as close to perfect as possible within that range. When GPT-5.4 expanded to 1.1M, 272K became the standard mode and 1M a paid premium tier (2x billing) (source).

GPT-5.4 supports a 1M context window (experimental) in Codex, enabled through model_context_window and model_auto_compact_token_limit configuration.

Supplementary Long-Context Benchmarks

MRCR v2 is the most important long-context cross-model benchmark today, but it is not the only one. Below are key findings from other benchmarks.

Graphwalks (multi-hop graph reasoning): Claude Opus 4.6 and GPT-5.2 are nearly tied on the Parents task (71.1% vs. 72.0% at 1M), but both struggle on the BFS task (~40% at 1M). Source: Sonnet 4.6 System Card

LongBench v2 (long-document understanding, 503 questions): Gemini 2.5 Pro leads at 63.3%, surpassing the human baseline (53.7%). GPT-4o 46.0%, Claude 3.5 Sonnet 41.0%. Source: LongBench v2 Leaderboard

RULER (retrieval, aggregation, reasoning; 13 tasks): At 128K, only Gemini 1.5 Pro (94.4%) and Jamba-1.5-large (95.1%) maintain above 90%. GPT-4 drops to 81.2%. Source: NVIDIA RULER

LongBench Pro (updated long-document evaluation): Top three are Gemini 2.5 Pro (73.42), GPT-5 (72.61), Claude-4-Sonnet (69.87). Gemini 2.5 Pro shows remarkable insensitivity to context length: its 256K score (71.77) is nearly flat against its 8K score (74.50). Source: arXiv 2601.02872

Methodological Caveats

Evaluation Condition Differences

The same benchmark can yield very different scores under different evaluation conditions. Take GPT-5.2’s MRCR v2 256K 8-needle results as an example:

OpenAI official (xhigh reasoning, 128K-256K range): 77.0%
Anthropic System Card citing contextarena.ai data (256K aggregate): 63.9%
OpenAI self-reported (xhigh reasoning, 256K aggregate): 70.0%

Same model, same benchmark, three sources, three different numbers. The reasons include differing reasoning effort settings, different aggregation methods (range-level averaging vs. sample-weighted averaging), temperature parameter differences, and more.

8-Needle vs. Other Variants

The figure Google frequently cites in Gemini blog posts, “MRCR 91.5% at 128K,” may come from a 4-needle variant or MRCR v1 rather than 8-needle. In Google’s own technical report, Gemini 2.5 Pro’s 8-needle ≤128K score is only 54.3-58.0%. This discrepancy shows how dramatically the needle count affects difficulty. When comparing MRCR data across sources, one must confirm they refer to the same variant.

Conclusion

The “context window arms race” narrative has run its course. As the March 2026 Frontier article (source) aptly put it:

“The context-window arms race is over. The context-reliability race is the real story now. And it’s a harder problem. Stuffing a million tokens into a window is engineering. Getting the model to actually use what’s buried at token 600,000 is science.”

Industry research supports this assessment. An analysis from Awesome Agents (source) points out that a model’s effective context capacity is typically only 60-70% of its nominal value, beyond which performance degradation becomes impossible to ignore.

For practitioners, the key takeaways are:

If your context stays within 128K, the frontier models are close to parity. Gemini 3.1 Pro and Claude Sonnet 4.6 both sit around 85%.
If you need reliable 256K-1M context usage, Claude Opus 4.6 is currently the only model maintaining 70%+ performance.
Don’t be misled by nominal context window sizes. Gemini’s 1M and Claude’s 1M are separated by a 3x gap on MRCR v2 8-needle (24.5% vs. 76.0%).
GPT-5.2 is genuinely strong within 256K, but part of why it “looked perfect” is that it was never tested above 256K.

Visualization Files: imgs/mrcr_v2_performance_curve.png, imgs/mrcr_v2_cross_model.png, imgs/context_window_vs_reliability.png

Visualization Script: assets/long_context_benchmark_viz.py