Anyone who has used a reasoning model has probably seen this. You give DeepSeek or GPT a medium-difficulty coding problem, and it spends more than 1,000 tokens walking through each constraint you gave it, then another 2,000 tokens listing three possible implementation approaches and comparing the trade-offs, and finally uses 500 tokens to write the answer. The answer is correct, but you did not want to read the first 3,000-plus tokens, and you still have to pay for them.
This is not a quirk of one specific model. It is the default way reasoning models work. After OpenAI released o1 in 2024, the whole industry converged on one conclusion: if you let the model think longer, it can do harder things. Accuracy and reasoning token count are positively correlated, and there is solid data behind that relationship. So DeepSeek’s R1 let long chains of reasoning emerge freely during RL training, Google put major engineering effort into extending thinking time in Gemini Deep Think, and Anthropic added extended thinking to Claude. Everyone has been competing along the same axis: how to make models think longer.
But by the middle of 2025, things started to shift.
Multiple research groups discovered almost at the same time that if you tell the model during training to write as little as possible when accuracy is the same, the model does not get dumber. In some cases, it actually gets better.
NVIDIA used a simple length penalty with the right optimization algorithm and cut response length by more than 70% with almost no change in accuracy. Another paper, Draft-Thinking, implemented two modes: the fast mode cut 76.7% of tokens with less than a 2% accuracy loss, while the careful mode improved accuracy by 14.68% and still reduced token usage by 42.7%.
The intuition behind these numbers is actually simple. Models write that much not because they need that many tokens to reason, but because nobody ever told them they did not have to. The training objective rewards accuracy, not verbosity. The model found the easiest way to improve accuracy: think for more steps. Some of that extra reasoning is genuinely useful, but a meaningful share of it is repeated self-checking, over-explaining obvious points, and repeatedly weighing several equivalent options.
Once you add a length penalty, the model is forced to balance accuracy against conciseness. Then it starts doing something it did not do before: identifying which reasoning steps are truly necessary and which ones can be skipped.
Meta showed a more complete version of this effect in Muse Spark’s technical blog post. They ran experiments on AIME, the American Invitational Mathematics Examination, and gradually increased the weight of the thinking time penalty during RL training. What they saw was a three-stage dynamic.
The first stage matched expectations exactly: the model improved accuracy by thinking longer. That lines up with the consensus from the o1 era.
The second stage introduced a turning point. Once the penalty weight crossed a certain threshold, the model suddenly switched strategies. It started solving the same problems with fewer tokens and no drop in accuracy. This was not gradual optimization. It was a jump. Meta calls this thought compression.
The third stage is even more interesting. After the compression phase finished, the model started expanding its reasoning chain again, but this time from a higher accuracy baseline. The end result was better performance with fewer total tokens.
The human analogy is straightforward. When you first learn calculus, you need to write every derivation step on paper because you are not fluent yet. After enough practice, you can look at an integral and know the answer immediately, skipping all the intermediate steps. But when you run into a genuinely new problem, you pull out paper again and write out the derivation. The difference is that now you do it much more efficiently because you have already internalized the basic techniques.
This three-stage pattern has shown up in other research as well. Wang et al. described a similar phase transition from a theoretical perspective: early in RL training, the model optimizes procedural correctness, but after a critical point the bottleneck shifts to strategy exploration. The famous aha moment in DeepSeek-R1-Zero is the behavioral signature of that transition.
At this point you might be thinking: GPT has a reasoning_effort parameter with low, medium, and high. Claude has extended thinking. Both let users control how long the model thinks. How is that different from thought compression? The difference is the layer where it happens. reasoning_effort is an inference-time knob. With the same model, you can ask it to think less or think more, but the model’s underlying reasoning efficiency has not changed. Thought compression happens during training. The model is penalized for redundant reasoning during RL training and learns to skip unnecessary steps, so its reasoning naturally becomes shorter. Muse Spark’s Instant, Thinking, and Contemplating modes look similar to reasoning_effort on the surface, but underneath, one reflects efficiency learned in training and the other is just budget allocation at inference time. For end users, the experience may feel similar. The real difference shows up in token billing. If Muse Spark can actually reach the same accuracy with fewer tokens, the cost advantage will matter. But Meta has not released a public API or pricing yet, so that advantage cannot be independently verified.
If redundant reasoning can be compressed away, where should the saved compute budget go?
Meta’s other answer is Contemplating mode. Instead of letting 1 agent think for 60 seconds, let 16 agents think for 10 seconds in parallel and then combine the results. That mode reached 58.4% on Humanity’s Last Exam.
The intuition here is also direct. Suppose you are facing a hard problem and have two options. One is to sit alone at your desk and think for an hour. The other is to send the problem to 16 coworkers at the same time, let each person think for 5 minutes, and then vote on the best answer. For many problems, the second strategy works better because different people attack the problem from different angles, while one person thinking too long can get stuck in a dead end.
But there is one critical question here: who does the choosing?
Majority voting works well in areas like math where answers are deterministic. The correct answer is simply correct, and the one that appears most often is probably right. But in open-ended problems, all 16 answers may sound plausible, or all 16 may be wrong, and they may even be wrong in the same direction. That is where majority voting breaks down. You need a stronger judge.
That is exactly why one of the most active research directions over the past year has been training a dedicated verifier to replace majority voting.
DeepSeek’s Efficient Reasoning via Reward Model is a representative example. They trained a Conciseness Reward Model that specifically scores how concise a reasoning path is. The paper points out a problem with naive length penalties: if the penalty is too weak, nothing changes; if it is too strong, the model starts skipping critical steps just to save tokens, which leads to length collapse, or the training process can collapse entirely. Their fix is to train a reward model that explicitly understands what good concise reasoning looks like. What gets removed should be redundant steps, not essential reasoning.
DeepSeekMath-V2 approaches the problem from a different angle. It trains an independent verification model to evaluate the quality of math proofs and then uses that model to guide the training of the proof generator. That creates a new problem: if the generator becomes too strong, the verifier loses its ability to discriminate. Their solution is to keep training the verifier on harder proofs so the capability gap between the two models remains useful.
Another paper, Self-PRM, found an interesting side effect. In pure RL training, DeepSeek-R1 implicitly learned not just how to solve problems, but also how to judge the quality of a solution process. Generation ability and verification ability seem to share underlying mechanisms. But the accuracy of this implicit verifier stays below 10% on hard problems, which means a lot more training is still needed if you want to separate verification from generation and make it reliably useful.
These papers all point in the same direction: generation is getting cheap, and verification is becoming the bottleneck. Once a large model can easily generate 16 reasoning paths, final accuracy no longer depends mainly on whether one path is better than the others. It depends on whether you can reliably identify which path is best.
Reasoning token billing is one of the biggest variable costs in API usage. If your application calls a reasoning model 10,000 times a day and each call burns an extra 3,000 redundant tokens on average, that adds up to a real expense at current pricing. So reasoning efficiency is not an academic topic. It directly affects your product economics.
But the deeper change is that reasoning efficiency is shifting from being an intrinsic model property to being a system property you can design for. You have many more choices now than you did a year ago.
For routine tasks with moderate complexity, you can use compression-trained models or draft modes and cut most reasoning tokens with only a small accuracy hit. For truly difficult tasks, instead of letting one model think for 60 seconds, you are often better off combining parallel sampling with a verifier and getting better results with lower latency. For core business scenarios, it may even make sense to train a dedicated verifier. It does not need to be huge. It only needs to reliably judge reasoning quality on the task types you actually care about.
Snell et al. provide a useful quantitative reference point: a small model with a verifier, using parallel sampling, can outperform a model that is 14 times larger. For teams with limited budgets, that means it may be more effective to optimize the reasoning architecture than to chase ever-larger models. Multiple cheap samples plus one strong verifier may beat a direct call to the most expensive frontier model on both cost and quality.
The industry consensus in 2024 was to make models think longer. In 2025, it became making models think less while staying just as good. In 2026, it is becoming making models think smarter instead of longer. The direction of travel is already clear.