Model ArchitectureIndustry & Competition

Reasoning Models Weren't Born in 2024: The Four-Year Lineage Behind o1 and R1

Published Jun 17, 2026

In September 2024, OpenAI released o1. You ask it a competition math problem in ChatGPT, it goes silent for a dozen seconds, lines of “thinking” scroll across the screen, then it gives the answer. The first time many people saw this, the judgment that formed in their heads was: the model has learned to reason, and seemingly overnight. Five months later DeepSeek R1 went open source, pushing the story to a climax. Reasoning models became an industry standard overnight, and even market share curves got rewritten.

But stretch the timeline out and something counterintuitive appears. Reasoning capability was invented by neither o1 nor R1. From “getting the model to write out its solution steps” to “getting the model to spend compute specifically on reasoning,” a full four years of evolution lie in between. What actually changed in 2024 was not that models suddenly learned to reason, but something else most people overlooked.

Here is the conclusion up front. What gets lumped together as “reasoning” is actually three things that should be kept apart. The first is the model’s ability to do multi-step reasoning, a layer that was already being systematically amplified in 2022. The second is the training method of using reinforcement learning to train models to reason, which had academic prototypes by 2023. The third is turning reasoning into a billable, schedulable resource and packaging it as a product to sell you. The real watershed is at this third layer, in the second half of 2024, and even this layer had clear predecessors. More counterintuitively, the part hyped the most today, pure reinforcement learning letting models spontaneously learn to reason, is exactly the part with the weakest evidence and the most controversy.

A Lineage That Ran for Four Years

The story can start in January 2022. That year Google’s Jason Wei and colleagues published Chain-of-Thought Prompting, whose core finding was this: give the model a few examples with “step-by-step reasoning” in the prompt, and it follows suit by outputting intermediate steps, dramatically improving scores on math and commonsense reasoning. A few months later, Kojima et al. found you don’t even need examples, just append “Let’s think step by step” to the question and the model expands its reasoning on its own. This is the popular starting point for “chain of thought.”

But the line goes back earlier than 2022. The Scratchpad work from late 2021 was already doing fine-tuning, getting the model to write out intermediate computation steps. The 2022 STaR (Self-Taught Reasoner) is more pivotal still: it used the model’s own correctly-generated reasoning chains to fine-tune itself, the seed of “training reasoning with reasoning” and the embryo of the later self-improvement idea behind reinforcement learning. In other words, the crossing over from purely inducing reasoning through prompting to baking reasoning capability into weights through training happened in 2022, not 2024.

Above that sits the evaluation side. The most-cited process reward model paper (a reward model that scores each step) is OpenAI’s Let’s Verify Step by Step from May 2023, which open-sourced the 800,000-step PRM800K dataset and is widely seen as o1’s technical predecessor. But the first systematic comparison of this concept was done by DeepMind’s Uesato et al. back in November 2022, the actual origin of formally opposing “looking only at the final answer” versus “looking at intermediate steps.” Interestingly, that paper’s authors later noted the work was “performed at DeepMind, now at OpenAI,” people and ideas migrating together.

By August 2024 the last piece appeared. Snell et al.’s Scaling LLM Test-Time Compute Optimally (a collaboration of Stanford, Google DeepMind, and CMU) gave a quantified conclusion: spending more compute at inference can beat a model 14 times larger in parameters. This is the direct academic footnote to o1’s “think longer” narrative.

So by mid-2024 all the ingredients were in place. Reasoning capability had been amplified by the chain-of-thought family, step-level verifiers had been paved by PRM, and “spend more compute at inference to buy accuracy” had been proven by the test-time compute scaling law. What was missing was someone to assemble these parts into a product and train it at scale with reinforcement learning.

What o1 Actually Changed

OpenAI deliberately avoided academic terminology in its official blog, describing o1 with only two product-level verbs: think and reason. What it really did was two things.

The first was training the model’s chain of thought with large-scale reinforcement learning. Before o1, a model’s reasoning ability came mainly from prompting or light fine-tuning. o1 turned “running RL on the chain of thought, with verifiable rewards (math problems have standard answers, code runs and passes) as the feedback signal” into a large-scale training pipeline. This training paradigm was later termed RLVR (Reinforcement Learning with Verifiable Rewards), freeing it from RLHF’s dependence on human annotation.

The second, and the more critical paradigm shift: turning reasoning into a billable, schedulable resource. o1 introduced the concept of reasoning tokens in the API, tokens billed as output and occupying the context window, but whose content is hidden from the user. Developers see only the count, not what the model actually thought. Then o3-mini made reasoning_effort into a low/medium/high parameter, letting you control how much compute the model spends thinking. Reasoning became, for the first time, a knob you could turn.

One LessWrong technical primer frames the weight of this shift using Sutton’s Bitter Lesson. Sutton’s famous Bitter Lesson says both search and learning are driven by compute, but for the past decade the entire industry only scaled learning (pretraining), leaving the search line unconnected. o1 connected inference-time search, effectively opening a second axis for the scaling law. This wasn’t a capability breakthrough, it was a resource-dimension breakthrough.

Worth noting is why o1 hid the reasoning process. OpenAI gave three reasons: user experience, competitive advantage, and use for safety monitoring. Independent readings generally agree the second is the real one, preventing competitors from distilling their own models off the exposed chain of thought. Simon Willison publicly expressed frustration, arguing that for developers who depend on interpretability, transparency is everything.

The Part of the Magic That’s Overstated

The story so far is missing one piece, and it’s the piece hyped the loudest, and the one most in need of discounting.

DeepSeek told a rather romantic story in the R1 paper. They trained R1-Zero, running pure reinforcement learning directly on a pretrained base model with no supervised data, and the model spontaneously produced behaviors like reflection, verification, and long reasoning, a moment the paper even called the “aha moment.” The Nature publication confirmed these descriptions. If this story holds as told, it is indeed a miracle of creation from nothing.

But independent research quickly pushed back. Sea AI Lab titled their post directly “There May Not be Aha Moment in R1-Zero-like Training.” They systematically tested a batch of base models including Qwen2.5, DeepSeek-Math, and Llama-3.x, and found an awkward fact: the so-called aha moment appears at epoch 0, meaning the completely untrained base model already self-corrects. What reinforcement learning does is simply raise the frequency of these behaviors. The Tsinghua team went further, showing that RLVR only optimizes sampling efficiency without expanding the model’s reasoning boundary. There’s another counterintuitive result: on Qwen2.5-Math-7B, running RL with random rewards still improved MATH-500 scores by 21%, close to 29% with real rewards, hinting that part of the gain may just be “a side effect of any reinforcement learning training.”

Piecing this evidence together, the more accurate statement is: reinforcement learning does not create reasoning capability from nothing, it releases and sharpens the reasoning fragments already baked into the model’s weights during pretraining. R1-Zero’s starting point was DeepSeek-V3-Base, a base model itself pretrained on massive amounts of math, code, and chain-of-thought data. The claim of “pure reinforcement learning giving birth to reasoning” has to be understood against this backdrop.

This is not to deny R1’s engineering value. It did pull performance up to o1’s level, and far more cheaply. What’s genuinely worth admiring isn’t “reinforcement learning created reasoning,” but “reinforcement learning plus verifiable rewards is a remarkably cheap way to mine and amplify capability already present in the model.”

Industry-Wide Convergence in Five Months

From o1’s release in September 2024 to Anthropic’s Claude 3.7 extended thinking in February 2025 was only about five and a half months. In that window nearly every major lab shipped a reasoning model. Alibaba’s Qwen QwQ went open source in November 2024, Google’s Gemini 2.0 Flash Thinking came online in December, Moonshot’s Kimi k1.5 and DeepSeek R1’s full release landed on the same day, Zhipu’s GLM-Zero followed at year’s end, and xAI’s Grok 3 appeared in February 2025 with a fully visible thought process. By the end of 2025, OpenRouter data showed reasoning models already accounting for about half of all token usage. From “only o1” to “half the market” took one year.

Why did convergence happen so fast. o1 defined the paradigm and proved the path works. But what truly lowered the barrier was DeepSeek R1. Before R1, the outside world had zero visibility into o1’s training method, the community could only guess. R1, with a detailed paper, open weights, six distilled models, and an explicit GRPO training recipe, turned the reasoning model from “OpenAI’s mysterious formula” into “an engineering problem any team with a base model and reinforcement-learning engineering capability can attempt.” That is its deepest significance, not inventing reasoning, but making the know-how of reasoning a public good.

Two overlooked details deserve a mention here.

First, China actually entered this race earlier than the West. The general impression is “OpenAI first, China follows,” but DeepSeek’s R1-Lite preview shipped on November 20, 2024, a full month before Google’s Gemini Thinking.

Second, among all the players OpenAI is the only one that hides the reasoning process. Nearly every other vendor makes reasoning visible. This isn’t accidental, it reflects two divergent views on safety. Anthropic has been systematically studying chain-of-thought faithfulness since 2023, and they found a counterintuitive phenomenon: the larger the model, the less faithful its chain of thought, meaning it increasingly rationalizes after the fact. In April 2025 they confirmed that when Claude 3.7 and R1 were secretly handed hints, their chains of thought admitted using the hint only 25% and 39% of the time respectively. Based on this research line, Anthropic chose to make the thinking process visible and budget-adjustable, giving developers audit ability while candidly acknowledging that the thinking text isn’t the same as the model’s actual computation. This and OpenAI’s hiding of reasoning tokens are two opposite product philosophies.

There’s also a deeper evolutionary trend. From late 2024 into early 2025, every reasoning model was a standalone pure-reasoning model. But within 2025 a clear shift toward hybrid convergence appeared: Claude 3.7 made thinking a toggle on the same model, calling itself “the first hybrid reasoning model”; Gemini 2.5 made thinking default and non-disableable; Qwen3 and DeepSeek V3.1 both built switchable dual modes. The industry quickly reached consensus on whether to build reasoning models, then moved from divergence to convergence on whether reasoning is a standalone model or a mode of a single model.

The Dimension Where It Actually Changed

Back to the original question. Was the reasoning model a shock that burst forth.

If the criterion is “can it reason,” the answer is a clear no. Reasoning capability had been systematically amplified since 2022 and was quite mature by mid-2024. o1’s gains on math and coding benchmarks were real, but they were the accumulated result of a continuous line of evolution, not a leap from nowhere. If the criterion is “training method,” the path of training chain of thought with reinforcement learning also had the 2023 PRM and test-time compute papers as academic predecessors.

What genuinely broke in 2024 was the productization layer. Reasoning, for the first time, became a resource that could be billed, scheduled, and turned into an adjustable parameter. The significance is that it opened a second axis for scaling. For the past decade the industry sprinted along the pretraining axis, and Karpathy in his 2025 review calls RLVR the new major training stage, noting that most of 2025’s capability gains came from the industry digesting the backlog of “low-hanging fruit” from this new stage.

The deeper lesson is perhaps this. Capability rarely genuinely shocks a field. What shocks is when someone packages years of accumulated research into a resource others can buy, schedule, and build on. The reasoning model wasn’t a moment of invention. It was the moment a four-year research line became infrastructure. And the most-hyped part of the magic, pure reinforcement learning conjuring reasoning from nothing, is the part that holds up least under scrutiny. That might be a ruler worth carrying whenever you look at any technological “breakthrough”: first separate whether what you’re seeing is the birth of a capability, or the packaging of one.