Survey Date: 2026-03-15
1M context is no longer rare. What is rare is the ability to reliably find the right information within those 1M tokens. Claude Opus 4.6 is currently the only model that maintains high reliability at 1M. Gemini and GPT degrade more noticeably. The long-context competition has shifted from a capacity race to a reliability race.
March 2026 marks the moment all three frontier model providers finally crossed the 1M context window threshold. Google got there first: Gemini 1.5 Pro has supported 1M since February 2024. Anthropic held off for two full years, only opening 1M beta on Claude Opus 4.6 in February 2026 and graduating to general availability on March 13. OpenAI’s path was more winding. GPT-4.1 briefly entered 1M as an API-only model, but the follow-up GPT-5.2 pulled back to 256K. GPT-5.4, released in March 2026, restored 1M support as a premium tier, making it the first production model available to ChatGPT users in the 1M club.
The context window arms race has thus reached a natural pause. But when we pull up the long-context benchmark data across providers, a fact worth documenting emerges: models claiming the same 1M label deliver wildly different levels of actual reliability. On MRCR v2 8-needle, currently the most stringent long-context retrieval test, Claude Opus 4.6 scores 76% at 1M. GPT-5.4 scores 36.6% in the 512K-1M range. Gemini 3 Pro scores 24.5% at 1M.
This chart may be the most direct explanation. GPT-5.2 previously posted results on needle-in-a-haystack tests that looked nearly perfect (98.2% in the 4K-8K range). I once assumed it was unbeatable on this metric. A closer look revealed that its context window was only 256K, and the curve stopped right there. When GPT-5.4 expanded to 1M, the decay beyond 256K finally came into view. The red shaded region marks the territory GPT-5.2 was never tested on.
This survey compiles cross-model benchmark data from MRCR v2, Graphwalks, LongBench, and others, aiming to answer one question: for long-context models moving from “having” 1M to “using” 1M, how far has each provider actually gotten.
MRCR v2 is the central benchmark for long-context cross-model evaluation, developed and open-sourced by OpenAI (GitHub). The testing method inserts multiple identically formatted “needle” user requests and responses into a large body of distractor text (the haystack) and asks the model to accurately retrieve the content of the n-th needle.
The 8-needle variant is the most difficult configuration: the model must track and distinguish eight different target pieces of information simultaneously. Compared to 2-needle or 4-needle variants, this places far higher demands on multi-span attention and precise retrieval. All frontier model providers (OpenAI, Anthropic, Google) use 8-needle as their core evaluation standard for long-context capability.
The data in this report comes from three categories of sources, ordered by credibility:
One important caveat: MRCR v2 scores from different sources cannot be directly compared. OpenAI’s official data uses xhigh reasoning effort, Anthropic uses max thinking, and contextarena.ai applies its own standardized settings. This report notes the data source and evaluation conditions in each comparison.
The opening chart deserves a second look. The two curves nearly overlap in the 0-256K range, indicating no fundamental capability difference between GPT-5.2 and GPT-5.4 within that window. The real information lives in the red shaded zone. Once GPT-5.4 moves past 256K, the 256K-512K range drops sharply from 79.3% to 57.5%, and 512K-1M falls further to 36.6%. This is not GPT-5.4 regressing. It is GPT-5.2 never having faced these challenges.
| Context Range | GPT-5.2 | GPT-5.4 | Difference |
|---|---|---|---|
| 4K-8K | 98.2% | 97.3% | -0.9 |
| 8K-16K | 89.3% | 91.4% | +2.1 |
| 16K-32K | 95.3% | 97.2% | +1.9 |
| 32K-64K | 92.0% | 90.5% | -1.5 |
| 64K-128K | 85.6% | 86.0% | +0.4 |
| 128K-256K | 77.0% | 79.3% | +2.3 |
| 256K-512K | N/A | 57.5% | — |
| 512K-1M | N/A | 36.6% | — |
Source: OpenAI GPT-5.2 Blog, OpenAI GPT-5.4 Blog
In the overlapping range (4K-256K), the two models differ by at most ±2%, confirming this is not a generational regression but simply the fact that GPT-5.2 was never tested beyond 256K. GPT-5.4 stays essentially on par with GPT-5.2 within 256K, but extending to 1M reveals the long-context decay that all models face.
The chart above uses data from Anthropic’s Sonnet 4.6 System Card Table 2.16.A (Claude models using internal evaluation + max thinking; Gemini and GPT data from contextarena.ai third-party evaluation), as well as OpenAI’s official GPT-5.4 data and Google’s Gemini 2.5 technical report.
| Model | Score | Source |
|---|---|---|
| Claude Opus 4.6 | 93.0% | Anthropic System Card (max thinking) |
| Claude Sonnet 4.6 | 90.3% | Anthropic System Card (max thinking) |
| GPT-5.2 | 70.0% | OpenAI self-reported (xhigh reasoning) |
| Gemini 3 Flash | 58.5% | contextarena.ai |
| Gemini 3 Pro | 45.4% | contextarena.ai |
| Claude Sonnet 4.5 | 10.8% | Anthropic System Card |
| Model | Score | Source |
|---|---|---|
| Claude Opus 4.6 | 76.0% | Anthropic System Card (max thinking) |
| Claude Sonnet 4.6 | 65.8% | Anthropic System Card (max thinking) |
| GPT-5.4 | 36.6% | OpenAI official (512K-1M range) |
| Gemini 3 Flash | 32.6% | contextarena.ai |
| Gemini 3 Pro | 24.5% | contextarena.ai |
| Claude Sonnet 4.5 | 18.5% | Anthropic System Card |
| Gemini 2.5 Pro | 16.4% | Google Technical Report (PDF) |
Claude Opus 4.6’s 76.0% at 1M represents a qualitative shift. Compared to the previous Sonnet 4.5’s 18.5%, that is a 4x improvement. Anthropic’s release announcement for Opus 4.6 described it as “a qualitative shift in how much context a model can actually use while maintaining peak performance.”
This scatter plot makes a counterintuitive reality visually obvious: a larger context window does not necessarily mean better performance.
GPT-5.2 has only a 256K context window, yet its farthest test point within that range (128K-256K) still reaches 77%. Gemini 2.5 Pro nominally supports a 1M context window but scores only 16.4% at 1M. Gemini 3 Pro has been steadily improving, but still only reaches 24.5% at 1M. In contrast, Claude Opus 4.6 is the only model that sustains 70%+ at 1M.
This phenomenon has been systematically validated in academic research. The Michelangelo evaluation in arXiv 2409.12640 found that GPT and Claude models perform better at short contexts (under 8K) but decay more rapidly, while Gemini models start from a lower baseline at short contexts but decay more gradually, potentially overtaking others at ultra-long contexts (1M). This “crossover effect” suggests an inherent tension between short-context and long-context performance. Models seem to struggle to excel at both.
Anthropic’s context window expansion timeline:
| Date | Model | Context Window | Notes |
|---|---|---|---|
| 2024.3 | Claude 3 Opus/Sonnet | 200K | Technically capable of 1M but not enabled |
| 2024.6 | Claude 3.5 Sonnet | 200K | Still at 200K |
| 2025.8 | Claude Sonnet 4 | 200K + 1M beta | First 1M beta, Tier 4 only |
| 2025.9 | Claude Sonnet 4.5 | 200K + 1M beta | Poor 1M performance (MRCR 18.5%) |
| 2025.11 | Claude Opus 4.5 | 200K | Opus tier still no 1M |
| 2026.2.5 | Claude Opus 4.6 | 200K + 1M beta | First Opus-level 1M, MRCR 76% |
| 2026.2.17 | Claude Sonnet 4.6 | 200K + 1M beta | MRCR 65.8% at 1M |
| 2026.3.13 | Opus 4.6 + Sonnet 4.6 | 1M GA | 1M general availability, standard pricing (source) |
Sources: Anthropic Release Notes, Claude Opus 4.6 Release
From March 2024 to March 2026, Anthropic stayed at 200K for two full years. A September 2025 article on their engineering blog laid out their philosophy explicitly:
“Waiting for larger context windows might seem like an obvious tactic. But it’s likely that for the foreseeable future, context windows of all sizes will be subject to context pollution and information relevance concerns. The solution isn’t more capacity; it’s better management of existing capacity.”
This passage signals that Anthropic viewed context window expansion not as a solution but as a secondary concern. Context management was the real problem. In retrospect, the strategy worked. When they finally shipped 1M, the performance (76%) far exceeded that of Gemini (24.5%), which had offered 1M years earlier.
| Date | Model | Context Window | MRCR v2 8-needle |
|---|---|---|---|
| 2024.2 | Gemini 1.5 Pro | 1M (later expanded to 2M) | 8-needle not published |
| 2025.2 | Gemini 2.0 Flash | 1M | ≤128K: 18.4%, >1M: 10.2% |
| 2025.6 | Gemini 2.5 Pro | 1M (2M planned) | ≤128K: 58.0%, >1M: 16.4% |
| 2025.11 | Gemini 3 Pro | 1M | 128K: 77.0%, 1M: 26.3% |
| 2026.2 | Gemini 3.1 Pro | 1M | 128K: 84.9%, 1M: 26.3% |
Sources: Google Gemini Blog, Gemini 2.5 Tech Report, Gemini 3.1 Pro Model Card
Google was the first to ship a 1M context window in early 2024 and was for a time the only player in the field. But its 8-needle MRCR v2 results have consistently fallen short. Gemini 2.0 Flash scored only 10.2% at 1M. Gemini 3 Pro improved to 26.3%, but the gap with Claude Opus 4.6’s 76% remains massive.
What is interesting is how quickly Gemini has improved at 128K: from 2.5 Pro’s 58.0% to 3.1 Pro’s 84.9%. At 128K, Gemini 3.1 Pro ties Claude Sonnet 4.6 (84.9%), suggesting that medium-context performance has converged across providers (source). But the 1M gap remains.
The developer community has voiced frustrations with Gemini’s long-context reliability. Reddit users have reported that Gemini 3 Pro exhibits noticeable performance degradation after using only 15-20% of its nominal context window (source).
It is worth noting that Latenode claims Gemini 2.5 Pro achieves 100% recall within 530K tokens and 99.7% at 1M (source). However, this data could not be verified in Google’s official technical report and likely comes from a different testing methodology (such as 2-needle rather than 8-needle, or simple passkey retrieval rather than MRCR).
OpenAI’s strategy resembles Anthropic’s conservative approach. GPT-5.2 chose a 256K context window (400K in some configurations), focusing on making performance as close to perfect as possible within that range. When GPT-5.4 expanded to 1.1M, 272K became the standard mode and 1M a paid premium tier (2x billing) (source).
GPT-5.4 supports a 1M context window (experimental) in Codex, enabled
through model_context_window and
model_auto_compact_token_limit configuration.
MRCR v2 is the most important long-context cross-model benchmark today, but it is not the only one. Below are key findings from other benchmarks.
Graphwalks (multi-hop graph reasoning): Claude Opus 4.6 and GPT-5.2 are nearly tied on the Parents task (71.1% vs. 72.0% at 1M), but both struggle on the BFS task (~40% at 1M). Source: Sonnet 4.6 System Card
LongBench v2 (long-document understanding, 503 questions): Gemini 2.5 Pro leads at 63.3%, surpassing the human baseline (53.7%). GPT-4o 46.0%, Claude 3.5 Sonnet 41.0%. Source: LongBench v2 Leaderboard
RULER (retrieval, aggregation, reasoning; 13 tasks): At 128K, only Gemini 1.5 Pro (94.4%) and Jamba-1.5-large (95.1%) maintain above 90%. GPT-4 drops to 81.2%. Source: NVIDIA RULER
LongBench Pro (updated long-document evaluation): Top three are Gemini 2.5 Pro (73.42), GPT-5 (72.61), Claude-4-Sonnet (69.87). Gemini 2.5 Pro shows remarkable insensitivity to context length: its 256K score (71.77) is nearly flat against its 8K score (74.50). Source: arXiv 2601.02872
The same benchmark can yield very different scores under different evaluation conditions. Take GPT-5.2’s MRCR v2 256K 8-needle results as an example:
Same model, same benchmark, three sources, three different numbers. The reasons include differing reasoning effort settings, different aggregation methods (range-level averaging vs. sample-weighted averaging), temperature parameter differences, and more.
The figure Google frequently cites in Gemini blog posts, “MRCR 91.5% at 128K,” may come from a 4-needle variant or MRCR v1 rather than 8-needle. In Google’s own technical report, Gemini 2.5 Pro’s 8-needle ≤128K score is only 54.3-58.0%. This discrepancy shows how dramatically the needle count affects difficulty. When comparing MRCR data across sources, one must confirm they refer to the same variant.
The “context window arms race” narrative has run its course. As the March 2026 Frontier article (source) aptly put it:
“The context-window arms race is over. The context-reliability race is the real story now. And it’s a harder problem. Stuffing a million tokens into a window is engineering. Getting the model to actually use what’s buried at token 600,000 is science.”
Industry research supports this assessment. An analysis from Awesome Agents (source) points out that a model’s effective context capacity is typically only 60-70% of its nominal value, beyond which performance degradation becomes impossible to ignore.
For practitioners, the key takeaways are:
Visualization Files:
imgs/mrcr_v2_performance_curve.png,
imgs/mrcr_v2_cross_model.png,
imgs/context_window_vs_reliability.png
Visualization Script:
assets/long_context_benchmark_viz.py