模型架构产业与竞争

What "Distillation" Actually Does for Chinese AI Companies

Survey date: April 14, 2026 Sources: Academic papers (ACL, EMNLP, AAAI, ICLR, ICML, Nature), technical reports, independent benchmarks


In early 2026, Anthropic and OpenAI accused Chinese AI companies of using “distillation” to extract model capabilities at scale. Anthropic’s report cited roughly 16 million API calls across 24,000 accounts. Media coverage was extensive, and “distillation” became a buzzword in the U.S.-China AI rivalry.

But if you know a bit about machine learning, something about this accusation feels off.

Two contradictions

“Distillation” has a well-established meaning in machine learning. Hinton’s 2015 knowledge distillation is about training a smaller model on the full probability distribution of a larger model, not just its final answer. For example, when a large model classifies an image as “cat,” its internal probability output might be {cat: 0.7, leopard: 0.15, dog: 0.1, horse: 0.05}. This distribution itself carries rich information: cats and leopards are more similar than cats and horses. Hinton called this “dark knowledge.” The widely cited DistilBERT was built this way: it used BERT’s full probability distributions, intermediate-layer features, and even directly copied a subset of its weights, ultimately achieving a 40% parameter reduction while retaining 97% of the accuracy.

But when Chinese companies call Claude’s or GPT-4’s API, what do they get? Only the final text response. No probability distributions, no intermediate layers, no weights. DistilBERT’s method simply does not apply here.

This is the first contradiction: a conceptual one. What people call “distillation” and what actually happened are two different things. What actually happened is closer to copying homework at scale: collecting frontier model responses and using them to train your own model. The academic terms for this are imitation learning or SFT on synthetic data, which are technically almost unrelated to Hinton’s distillation.

The second contradiction is about effectiveness. If all that happened was “collecting a batch of high-quality Q&A data and training on it,” what did the latecomers actually save? Pretraining still has to happen (or you use an open-source model), and training compute scales with model size, regardless of data source. More critically, an ICLR 2024 paper, “The False Promise of Imitating Proprietary LLMs”, ran the experiment directly: scaling imitation data from 25M to 150M tokens produced essentially no improvement for a 13B model on MMLU, HumanEval, or GSM8K. The conclusion was that imitation learning primarily transfers style and tone, not reasoning ability.

If classical distillation techniques don’t apply and pure imitation has limited effectiveness, how exactly did “distillation” help latecomers?

First, let’s distinguish three different operations

Before answering, we need to draw some lines, because the word “distillation” is used to describe three operations with entirely different mechanisms.

The first: classical distillation. This requires opening up the teacher model’s internals to access full probability distributions and intermediate states. DistilBERT used this approach. It cannot be done with closed-source APIs.

The second: output imitation. This uses only the teacher model’s final text responses for training. Alpaca ($500 cost, 52K GPT-3 outputs), Vicuna ($300 cost, 70K ChatGPT conversations), and OpenAI’s 2024 “distillation” product feature all fall into this category. As noted above, this approach primarily transfers style, not capability. This is consistent with what I found in my earlier fine-tuning article: fine-tuning is good for teaching a model your tone of voice, but it cannot teach it new skills.

The third: chain-of-thought transfer. The training data includes not just the teacher’s final answer but the full reasoning process. The teacher doesn’t just say “the answer is 42”; it says “first we know X, from which we derive Y, and combining with condition Z, we get 42.” Microsoft’s Orca and DeepSeek R1’s distilled variants use this approach. This direction has evolved rapidly in 2025-2026, moving from one-shot answer copying followed by training (off-policy) to an iterative method where the student first attempts the problem and then the teacher grades it (on-policy). The latter works substantially better: Qwen3’s technical report shows on-policy distillation reaching 74.4% on AIME’24 compared to 55.0% for off-policy, a gap of nearly 20 points, at one-tenth the compute cost of RL (Qwen3 Technical Report). Microsoft Research’s GAD (2025) further demonstrated that this on-policy method works under fully black-box conditions, using a discriminator to distinguish student and teacher outputs in lieu of access to the teacher’s internal states.

Based on the behavioral patterns Anthropic described (bulk extraction of reasoning chains, tool-use patterns, code generation data), the operations at the center of the U.S.-China distillation controversy are primarily a mix of the second and third types.

What latecomers actually gained

Most articles list a string of benefits: saved training costs, free alignment, acquired reasoning ability. But examining each one individually reveals that their actual value varies widely. Some are overestimated. One is underestimated.

Response format and tone: acquired, but cheap

A raw pretrained model (say, Llama’s base version) that receives an instruction will most likely continue the text rather than answer the question. Ask it to “write a poem about autumn leaves” and it might continue with “write an essay about winter snow, write a poem about spring flowers…” because that is the pattern of continuous text it saw during pretraining.

After one round of training on ChatGPT responses, the model learns: “oh, when a human asks me a question, I should answer the question.” This sounds basic, but it genuinely needs to be learned. InstructGPT’s data showed that just learning to “answer questions instead of continuing text” allowed a 1.3B model to surpass the 175B GPT-3 in user satisfaction (Ouyang et al., NeurIPS 2022).

So latecomers do acquire this layer through imitation: instruction following, conversational structure, Markdown output formatting, multi-turn context management. These have nothing to do with values alignment; they are part of basic product usability.

But this layer is very cheap. The LIMA paper (Zhou et al., NeurIPS 2023) ran an extreme experiment: training with only 1,000 curated examples, a 65B LLaMA was rated better than GPT-4 in 43% of blind comparisons. The paper’s conclusion is blunt: a model’s knowledge and capabilities are almost entirely acquired during pretraining; subsequent training merely teaches it “what format to use when replying to users.” Something that 1,000 examples can teach is not worth large-scale API calls to acquire.

Safety alignment: mostly irrelevant to latecomers

Many articles cite “skipping alignment costs” as a major benefit of distillation. Frontier models do invest heavily in safety alignment (InstructGPT used roughly 40 annotators, 33K-50K preference comparisons; frontier RLHF costs run in the $5-20M range). But there is an overlooked premise: that money was spent on safety alignment within a Western political and cultural context. Chinese models need content filtering for an entirely different set of topics, and that work has to be done from scratch. Conversely, Western models’ over-moderation may actually need to be removed for Chinese deployments.

For latecomers competing primarily on performance and price, fine-grained safety alignment ranks behind core capability development. So this $5-20M alignment cost is something latecomers neither need nor can save through distillation.

Skipping thinking trace construction: the truly underestimated benefit

As discussed above, chain-of-thought transfer is far more effective than pure output imitation. But its real value goes beyond “learning problem-solving strategies.” The greater value is that latecomers skip the entire R&D process of thinking trace construction.

What is thinking trace construction? It is the process of teaching a model to reason from scratch. DeepSeek R1’s technical report (published in Nature) documents how difficult this is: they ran pure RL (GRPO) on a 671B-parameter base model, generating 512 candidate responses per step over thousands of steps, until the model spontaneously developed self-verification and reflection (the so-called “aha moment”). They then used this RL model’s outputs for rejection sampling, filtering down to 600K reasoning samples and 200K general samples. Then another round of RL fine-tuning. The entire four-stage pipeline requires a world-class RL infrastructure, a 671B-parameter base model, and extensive trial-and-error engineering.

Latecomers can skip all of this through distillation.

DeepSeek’s own data provides the most compelling evidence. They compared two training paths for the same 32B base model (DeepSeek R1 paper, Section 4.1):

Distillation outperformed RL-from-scratch by 25 points, at a fraction of the compute cost. Even more extreme data comes from Hu et al. (arXiv:2505.21067): with only 920 distilled samples, a distilled 32B model surpassed multiple 32B models trained with RL from scratch (DAPO-32B at 34.8%, SimpleRL-32B at 9.4%).

An intuition: distillation is like hiring an exceptional tutor and copying down all their problem-solving methods. Frontier labs invested enormous R&D resources (RL infrastructure, 671B models, four-stage pipelines) to teach their models to reason. Latecomers can obtain these reasoning chains directly through distillation, skipping the entire R&D process. This is a far more specific and convincing narrative than “saved labeling costs” or “got alignment for free.”

But distillation carries a fundamental cost: the absence of generalization

If the story ended here, distillation would indeed be a silver bullet. But an important 2025 paper revealed a fundamental limitation.

Chu et al. published “SFT Memorizes, RL Generalizes” at ICML 2025 (Google / Berkeley). Their experimental design is straightforward: compare SFT (the underlying mechanism of distillation) and RL on the same set of tasks, then test both on out-of-distribution (OOD) tasks.

The results are clear: RL models showed 3.5-11% positive transfer on OOD tasks, while SFT models degraded by up to 79.5% on OOD.

Back to the tutoring analogy: a student who copied problem-solving methods may score higher within the exam’s scope than a self-taught student (DeepSeek’s data confirms this). But if the questions go beyond the exam’s scope, the self-taught student holds up better. Distilled models memorize solution templates without acquiring the underlying ability to generalize.

DeepSeek’s own data corroborates this limitation. The full R1 (RL-trained) scores 79.8% on AIME 2024; the distilled version tops out at 72.6% (32B). The gap widens on LiveCodeBench and GPQA Diamond. Moreover, the distilled version drops 6-14% in accuracy when problem phrasing is slightly altered (MathGPT.AI analysis), a classic signature of memorization rather than understanding.

A June 2025 study further confirmed the domain limitation: R1’s distilled version scored 14.7% lower than Qwen-Base on medical tasks (arXiv:2506.02126). Mathematical reasoning frameworks can be transferred, but medical knowledge cannot.

Mistral’s Magistral (2025) provides an illustrative contrast: pure RL training (no distillation) achieved 73.6% on AIME 2024, and capabilities gained from math RL training spontaneously transferred to the coding domain. This kind of cross-domain transfer is something distillation cannot achieve.

Agentic data: the new battleground of 2026

The analysis above has focused on reasoning chains, but Anthropic’s accusations also mentioned another category of data: tool use, agent reasoning, and code generation. This agentic data has become especially important in 2026, as agent capabilities have emerged as a primary competitive dimension for model providers.

Why agentic trajectories differ from reasoning chains

Reasoning chains teach “how to decompose a math problem.” Agentic trajectories teach “how to interact with the external world”: when to call a search engine, how to decide the next step after receiving results, how to handle errors and branching in multi-step execution, how to orchestrate across multiple tools. This is a different type of capability.

Moreover, producing high-quality agentic training data is itself a full engineering pipeline. Kimi K2’s technical report (arXiv:2602.02276) documents how Moonshot did it: building synthetic tool environments with persistent state, creating simulated users with diverse communication styles, generating multi-turn tool_call/tool_response interaction trajectories, filtering with an LLM judge to keep only successful trajectories, then using this data for SFT + RL. The entire pipeline is independent of any external API.

If latecomers can collect millions of tool-use trajectories directly from frontier models’ APIs, they can skip building this entire pipeline. The logic mirrors reasoning-chain distillation: distilling reasoning chains lets you skip the R&D cost of RL; distilling agentic trajectories lets you skip the cost of building synthetic environments and trajectory pipelines.

Timeline cross-check: three companies, three very different stories

Anthropic’s report groups three companies together, but a close reading of the data reveals that what they were doing differs substantially. More importantly, aligning the accusation timelines with each company’s model release dates shows very different strengths of correlation.

MiniMax: strongest correlation. MiniMax accounted for 81% of total call volume (13 million calls), primarily extracting agentic coding and task orchestration capabilities. The behavioral patterns Anthropic described are highly specific: they caught MiniMax’s distillation activity before M2.5 had been released. Within 24 hours of Claude Opus 4.6 going live on February 5, MiniMax shifted half its traffic to the new model. M2.5 was ultimately released on February 12 and was characterized as “near Claude Opus 4.6 performance at much lower cost.” The scale, timing, and behavioral patterns of MiniMax’s distillation activity show a direct temporal correlation with M2.5’s development.

Moonshot: independent pipeline documented, but timeline overlap. Moonshot’s 3.4 million calls targeted agentic reasoning, tool use, and computer vision. However, Kimi K2’s technical report (published on arXiv in July 2025) documents a self-developed trajectory generation pipeline in detail. K2.5 was released in January 2026, featuring Agent Swarm (100 parallel sub-agents) and PARL (Parallel-Agent Reinforcement Learning), both independent innovations absent from Claude. Anthropic’s report states the distillation activity “followed K2.5’s January release,” meaning this data was likely feeding subsequent models rather than K2.5 itself. Moonshot’s situation looks more like: independent R&D capability exists, but they were also supplementing with data from Claude.

DeepSeek: weakest correlation. DeepSeek’s call volume was the smallest (150,000 calls, less than 1% of the total), and the target was reasoning chains and RL reward signals rather than agentic data. More critically, the timeline: DeepSeek’s major capability jumps (V3 in December 2024, R1 in January 2025) occurred before the alleged Anthropic distillation activity began (around July 2025). The R1 paper was published in Nature in September 2025, with peer review confirming GRPO as an independent methodological innovation. The progression of subsequent releases, V3.1 (August 2025) and V3.2 (December 2025, with Thinking-in-Tool-Use), in both magnitude and timing, aligns more closely with continuous iteration than distillation-driven leaps.

OpenAI’s investigation is a separate story: Microsoft detected suspicious activity from DeepSeek-associated accounts on OpenAI’s systems in fall 2024. This is a different event, a different timeline, and a different target from Anthropic’s accusations.

The window for agentic distillation

Viewing the three companies’ stories together, a pattern emerges: the moment when distilling agentic data is most valuable is the phase before a latecomer’s own agentic product has accumulated enough user data. Once a company has its own product and user traffic, it has its own source of trajectory data.

Cursor is a positive example. It took the open-source Kimi K2.5 as a base and used its own user coding data for RL post-training (Phil Schmid analysis, 2026). It did not need to distill Claude’s or GPT-4’s agentic trajectories because it already had millions of user coding sessions as training data. This is the legitimate path to agentic data: generating training data from your own product’s traffic.

By early 2026, Kimi K2.5 had Agent Swarm, GLM-5 reached 77.8% on SWE-Bench, and DeepSeek V3.2 had Thinking-in-Tool-Use. These companies already had their own agentic products and user bases. The value of distilling frontier models’ agentic trajectories as a cold-start strategy had diminished significantly by this point.

So what was actually saved

Back to the original question. How distillation helps latecomers is different from what most articles describe.

The commonly cited benefits mostly don’t hold up: saving training compute (it doesn’t), getting alignment for free (safety alignment dimensions don’t match, and behavioral formatting is too cheap to matter). These conclusions stand.

But one benefit is larger than initially expected: skipping the entire R&D process of thinking trace construction and agentic trajectory pipeline development. Frontier labs made enormous investments to teach models reasoning and tool use; latecomers can obtain these results directly through distillation. Chain-of-thought distillation from just 920 samples can outperform models trained with RL from scratch. On-policy methods (the latest advances from 2025-2026) have further improved effectiveness, enabling a 14B student model to approach GPT-5 teacher-level performance. And Anthropic’s own accusation data shows that agentic data (tool use, agent reasoning) was the primary target of distillation activity.

But this benefit has two fundamental limitations.

The first is the absence of generalization: distilled models perform well within the training distribution but become brittle outside it. SFT memorizes templates; RL learns to generalize.

The second is temporal decay: the value of distillation as a cold-start strategy diminishes as latecomers’ own products mature. Once a company has its own user traffic and agentic products, it has its own source of trajectory data. The differentiated timelines of the three companies in Anthropic’s accusations bear this out: DeepSeek’s capability leap occurred before the alleged distillation (Nature confirmed independent innovation), Moonshot has publicly documented self-developed pipelines, and MiniMax shows the strongest correlation but its story also illustrates that distillation provides catch-up acceleration rather than sustained advantage.

This means distillation is a rapid catch-up strategy, not a long-term competitive strategy. The choice latecomers face is not “to distill or not to distill,” but “distill first to catch up quickly, then at what point to start building your own RL pipeline and agentic data flywheel.”

DeepSeek itself is the best example of this trajectory. Its R1 paper was confirmed by Nature as independent innovation. Its competitive advantage comes not from distilling anyone’s outputs but from its own post-training methodological innovations (GRPO, four-stage pipeline, R1-Zero experiments). Distillation can help you catch up to the current frontier, but to move beyond the frontier, you need your own RL pipeline.


Source index

Academic papers (peer-reviewed)

Technical blogs and independent analyses