Industry & CompetitionResearch & Tech Frontier

Shocking! Google AI Weather Forecasting Dominates the Leaderboard! Three Years Later, How Much Was Real?

Published Jun 29, 2026

DeepMind published the GraphCast paper in Science on November 14, 2023, and proudly declared on its blog that the system had achieved world-leading accuracy for 10-day global weather forecasting. As soon as the news broke, Chinese tech media ignited a frenzy of coverage the very next day. ScienceNet ran with the headline Outperforming Supercomputers! Google’s New AI Model Predicts Weather Faster and More Accurately. 36Kr and Synced followed suit with Crushing Industry SOTA. QbitAI and IT Home claimed it surpassed the strongest human models on 90% of metrics, and another 36Kr variant even wrote that it defeated the world’s best forecasting system. Of the fifteen Chinese-language reports we reviewed, eleven carried headlines saturated with combative, competitive vocabulary — crush, dominate, defeat, outperform — as if AI had once again effortlessly disrupted an entire traditional industry. By contrast, Xinhua News Agency’s headline Google-affiliated team develops AI model for medium-range weather forecasting was remarkably restrained, making no use of such attention-grabbing adversarial metaphors. Yet precisely because of this restraint, the piece barely registered a ripple in the information stream — after all, measured and objective voices rarely gain traction on traffic-driven social networks.

Three years have now passed. If we assume the default position that AI truly delivered a leapfrogging breakthrough over traditional weather forecasting back then, how much has the quality of weather forecasts we actually experience in daily life improved?

The most direct way to answer this is to look at operational data. Weather forecasting is one of the very few fields with a clear ground truth and fully public data. We don’t need to take anyone’s word for it — we can simply let the historical curves speak for themselves.

Weather forecasting accuracy has been tracked all along, and nothing changed in 2023

Global weather forecasting skill has been continuously monitored by international authoritative bodies, and the ledgers are open to the public. The European Centre for Medium-Range Weather Forecasts (ECMWF), for example, has maintained a rigorous benchmark for effective forecast days since the 1980s: when the anomaly correlation coefficient between predicted and observed 500 hPa geopotential height falls below 80%, check how many days in advance this forecast could be issued. This metric is not some internal closed-door assessment — it is a core strategic figure published openly, maintained in near real-time at charts.ecmwf.int, and disclosed every year in their Technical Memorandum with extraordinarily detailed comparison results. The World Meteorological Organization (WMO) and the U.S. National Oceanic and Atmospheric Administration (NOAA) also operate similar open monitoring platforms that the public can access and audit at any time.

Looking at this accuracy curve: in the early 1980s it sat at around 6 to 7 days, and by 2024 it had gently climbed to between 10 and 11 days (data sourced from ECMWF TM 918). That works out to an incremental gain of roughly 0.15 days per year on average. Across these four decades, the curve has been remarkably smooth, with virtually no sharp inflection points to be found anywhere. There is a well-recognized physical ceiling in meteorological science: the inherent limit of atmospheric predictability due to chaotic dynamics is about 14 days (see Lorenz’s 1963 research). It took four decades of painstaking work to push from 6 days to 10 days, which means that any further progress toward the remaining 4 days will face extraordinarily heavy physical resistance at every step.

Even when the curve touched a historical high in 2023, the ECMWF’s technical report offered a matter-of-fact explanation: this was attributable to several consecutive months of strong performance that year, not a sudden jump at any single point in time. In Figures 17 and 18 of Technical Memorandum TM 918, the researchers plotted GraphCast, Huawei’s Pangu-Weather, and their own AIFS system as separate traces embedded within the ACC trend of the IFS system from 2018 to 2024. The results showed that the AI solutions indeed operated above the traditional high-resolution system (HRES) over the 2023–2024 period, yet HRES’s own evolutionary trajectory continued its gentle upward climb at the same pace as before, without any steep acceleration or discontinuity triggered by the entry of these AI competitors. This new technology did not alter the established gradual upward trend of real-world operational metrics; it merely appeared as a parallel reference line on the chart, not a physical inflection point in the curve itself.

Another corroborating data point comes from the U.S. GFS system: by 2025, its effective forecast days had slipped back to 2019 levels, effectively erasing six years of accumulated technical progress (for a detailed analysis, see Balanced Weather). If AI-based weather models had truly restructured the industry landscape from the ground up, such a drastic regression on the operational GFS side would have been theoretically impossible. In essence, the reality never departed from the mundane: a long, gently rising mainline, overlaid with several parallel dashed lines annotated in new colors.

Time series of operational weather forecasting skill

Why readers fell for it back then

Look at the end metric, not the intermediate one

When GraphCast reported its test results, the framing it adopted was that it outperformed its competitor on 90% of 1,380 specific test targets, rather than demonstrating that it had extended the actual effective forecast lead time by a number of days. Those 1,380 test targets are derived from six core meteorological variables multiplied by 37 different pressure levels, further multiplied by various forecast lead times (see arXiv 2212.12794 for details). This means that an advantage achieved on a single variable gets counted repeatedly across multiple dimensions in the final scorecard. The way the evaluation system is constructed fundamentally determines how the win-rate numbers are expressed. To offer a vivid example: suppose a prediction system is better than the baseline only for temperature, but lags behind across nine other variables — wind direction, humidity, pressure, and so on. If we design the evaluation set so that the temperature term dominates the weighting, and then multiply this sole temperature advantage across 37 vertical levels and 10 time horizons to tally it 370 times, the final report card can still show a perfect 90% win rate — even though the system underperforms on 90% of the variables that truly matter for operations.

By contrast, real and usable forecast days are a hard metric immune to statistical sleight of hand. Its measurement is straightforward: how many days into the future can the forecasting service provide reliable predictions? In day-to-day life and production, what people need to know is whether tomorrow’s weather forecast is dependable — not how many micro-victories a computer racked up across over a thousand finely sliced sub-dimensions. The most compelling counter-evidence was already displayed in the chart earlier: if a 90% win rate really represented a revolutionary breakthrough, the upward trajectory of global operational forecast accuracy should have experienced a step change in 2023. Yet the real-world trajectory showed no change whatsoever. Going forward, whenever we encounter gaudy win rates in promotional materials, it’s worth asking ourselves two questions: how exactly was the denominator constructed, and how much of that advantage translates into real-world, on-the-ground improvements?

Don’t just look at what’s being said — look at who’s saying it

Among the fifteen core Chinese-language tech media articles we reviewed, not a single one introduced skeptical commentary from weather scientists with no vested interests. The third-party citations that did not originate from Google itself, upon closer inspection, came from only three categories of people. The first was Matthew Chantry from ECMWF, but his institution served in this story both as the technical benchmark being targeted and as a key research collaborator on the paper. The second was Aditya Grover, a computer science researcher at UCLA, whose academic domain has nothing to do with day-to-day operational weather forecasting. The third was Jacob Radford from Colorado, but his sole quoted perspective in the media was limited to praising the model’s computational efficiency. In other words, at the media level at the time, there was absolutely no voice that could scrutinize DeepMind’s astonishing conclusions from the standpoint of practical meteorology. Whether it was self-media powerhouses like QbitAI and 36Kr, or serious outlets like ScienceNet and Xinhua, they all displayed an unexpectedly uniform approach to sourcing — obediently drifting along within the PR framework pre-established by the research team.

This absence of critical perspective is not unique to the Chinese-language ecosystem. Among twelve representative English-language outlets, eleven likewise shut the door on input from professional weather forecasting experts. The sole outlier was New Scientist, which interviewed Ian Renfrew, a professor of meteorology at the University of East Anglia. He incisively pointed out that in numerical weather prediction, the most computationally intensive step — data assimilation — accounts for one-half to two-thirds of the total cost, and GraphCast itself does not possess the capability to assimilate real-time observations. It is entirely freeloading on high-quality data fields that other systems have already painstakingly assembled. A correction rate of one in twelve is roughly equivalent to expecting to spot the truth while fishing for a needle in the ocean. During those few days in November 2023, the meteorological communication channels in effect formed a tightly closed causal chain: information flowed one-way from the R&D workshop outward, then technical partners added their endorsements, and by the time it reached the general audience, the chain had never once encountered an objective scrutinizer. The absence of external scrutiny with zero conflicts of interest — that is precisely what should raise the loudest alarm.

From now on, when we come across promotional material declaring yet another epoch-defining AI breakthrough, we should proactively do a headcount: among the interviewees quoted in the article, how many faces have no private ties to the tech company in question? If the tally turns out to be zero, it is best to put the excitement on hold and maintain rational restraint.

Stay wary of claims of sudden technological leaps

Moving the weather forecasting system’s skill from 6 days to 10 days took a full four decades, accumulating a mere 0.15 days of improvement per year. This kind of incremental, gradual process is in fact the most common pattern of evolution in hard technology. Narratives that sell a silver-bullet story — some brilliant algorithm upending an entire legacy industry overnight — may sound gratifying, but by objective statistical probability, they are almost always falsified.

It is undeniable that AI has had its share of genuinely dramatic highlights. The emergence of GPT-3, for instance, brought large language models into millions of households, and AlphaGo’s defeat of top Go players is treated as a watershed moment of technological explosion. But when we shift our gaze to history, we see that both of these sensations relied on extraordinarily deep technical foundations laid long in advance: the evolution of underlying architectures, the exploration of massive-scale pre-training regimes, and the practical deployment of layered reinforcement learning — each puzzle piece had been fitted into place by engineering teams years before the dust settled. What is called a leap is nothing more than a concentrated phase transition of accumulated micromechanical forces once a critical threshold is reached; it is rarely a creation ex nihilo detached from material foundations. Any change genuinely worthy of being called a milestone will invariably leave traces: the domain in question must possess a rigid operational benchmark that has been maintained year after year without interruption, and that benchmark will display a visually intuitive change in its slope on the time axis.

Yet in retrospect, the story of GraphCast tells a different tale. DeepMind issued an absolute judgment in its official announcement, declaring it the world’s most accurate 10-day global weather prediction system. But when we now flip through three years of this industry’s aggregate ledgers, we simply cannot detect any trace of this transformation at the macro level. The critical point here is not that it failed to deliver a sudden leap — it is that it does not even qualify as a modest contributor to the steady upward micro-gains, yet was sculpted purely through clever narrative packaging into a cross-generational miracle. Tellingly, when the research team subsequently released GenCast in 2024 and WeatherNext 2 in 2025, they replicated the exact same promotional playbook, each time deploying superlatives like state-of-the-art and frontier. The win probability climbed from 90% to 97%, and then to an astonishing 99%. But a closer breakdown reveals that the opponent used as the reference had been quietly downgraded from ECMWF’s real operational network to the team’s own previous-generation model. Swapping in ever-weaker opponents to produce a straight-line ascent in win-rate figures — that, in itself, is already a highly instructive detail.

The next time you see “crush” or “dominate,” here are three things you can do

GraphCast’s role in advancing scientific exploration is undeniable. It genuinely holds solid evidence in benchmark evaluations, and its generously open-sourced code has tangibly catalyzed the entire trend of AI-powered weather modeling. However, the spillover effects at the research level are often not on par with the real-world outcomes delivered. Outperforming an old algorithm on an academic dataset and making the forecasts we receive in daily life more accurate are two things separated by a long and complex engineering distance — encompassing operational adoption, the difficulty of integrating into complex systems, and the ultimate end-to-end real-world accuracy. The research team’s marketing efforts artificially blurred the boundary between these two, and domestic media workers, in the process of transmission, unthinkingly layered on even more combative vocabulary like crush and dominate.

The reason we have been able to reconstruct the truth of this story piece by piece owes largely to the fact that weather forecasting happens to possess a decades-long open operational benchmark. In the vast majority of AI innovation PR we encounter, the public has no such convenient verification mechanism. The relevant domain-specific technical metrics are often either absent, not publicly disclosed, or monopolized by the single developing vendor. We cannot, by replicating the present approach, dig into the win rates of intelligent legal assistants in actual judicial practice, or the missed-diagnosis rates of AI diagnostic software in real outpatient settings — because on these tracks there is currently no objective, neutral institution willing to take on the responsibility of conducting multi-decade longitudinal tracking under a consistent methodology.

Precisely for this reason, when we stumble upon a case that does have highly transparent and publicly available verification sources, thoroughly deconstructing it yields a clearly identifiable set of discernment methods. Going forward, when we once again encounter headlines brimming with claims of crushing, sweeping, or surpassing the state of the art, we have at least three concrete things we can do to verify: first, dig to the root and seek out the tangible end-performance metric, rather than fixating on fabricated percentage win rates; second, unpack the quoted sources and tally how many objective observers with no business entanglements are actually present; and third, verify whether this particular track has publicly accessible monitoring data that can be called up at any time. As long as these three points are applied in practice, most of the PR bubbles inflated by wordplay on the market will collapse on their own under the weight of common sense.