The Base Model Controversy Behind Composer 2, and Model Strategy in AI Coding Tools

2026-03-20

Cursor’s Technical Trajectory Through Three Blog Posts

To understand the base model controversy around Composer 2, you need to look at what Cursor has published over the past five months. Three blog posts and one accompanying research note form a complete technical evolution timeline.

In October 2025, Composer 1 was released. It was a MoE architecture model, and Cursor never publicly disclosed its base model origin. When someone directly asked Sasha Rush (Cursor’s head of research) whether the model was fine-tuned from an open-source base model, Rush’s response was evasive: our focus is on RL post-training, which we believe is the best path to turning a model into a strong interactive agent. Composer 1’s entire technical narrative revolved around RL: in an agent harness simulating Cursor’s production environment, the model was given access to tools like file editors, terminals, and semantic search, then trained via RL to make more efficient tool-calling decisions. The model itself had no chain-of-thought, speed was the main selling point, and most interactions completed within 30 seconds.

In February 2026, Composer 1.5 was released. The base model was identical to version 1, with no continued pretraining whatsoever. All changes came from post-training: RL compute was scaled up 20x, and post-training consumed more compute than the base model pretraining itself. That’s a number worth remembering. Two new training behaviors were introduced simultaneously: thinking tokens (adaptive-depth reasoning — fast responses for simple questions, extended reasoning chains for complex ones) and self-summarization (automatic compression of history when context approaches length limits, with the compressed result itself participating in RL reward signals). The key judgment in the blog post: “RL for coding can be continually scaled with predictable intelligence improvements.” At this stage, Cursor’s core argument was that RL scaling laws hold in the coding domain — invest more post-training compute, and model intelligence will keep improving.

On March 17, 2026, Cassano and Rush published a research note on self-summarization, detailing the compression mechanism’s implementation and training integration. Two days later, Composer 2 was released.

Composer 2 has a fundamental difference from the previous two versions: it introduced continued pretraining. The blog post states: “these quality improvements come from our first continued pretraining run, which provides a far stronger base to scale our reinforcement learning.” This sentence implies two things. First, Cursor acknowledges that Composer 2 has a pretraining base, and that this base underwent continued pretraining. Second, this is their first time doing continued pretraining, meaning Composer 1 and 1.5 did not take this step.

The benchmark data shows this move paid off significantly. CursorBench improved by 6.2 points from 1 to 1.5 (38.0 → 44.2), and by 17.1 points from 1.5 to 2 (44.2 → 61.3) — nearly three times the former. Considering that 20x more RL compute was already invested between 1 and 1.5, this comparison suggests a possibility: the RL-only route was approaching diminishing returns at the 1.5 stage, while continued pretraining provided a new starting point that allowed subsequent RL to achieve higher marginal returns again. Cursor hasn’t published ablation studies to prove this, but the data points in a consistent direction.

The Evidence Chain for the Base Model Controversy

With the technical trajectory established, the base model controversy has a framework for discussion.

When someone probed OpenAI’s base URL in Cursor, they saw this internal path:

accounts/anysphere/models/kimi-k2p5-rl-0317-s515-fast

kimi-k2p5 points to the Kimi K2.5 model family, rl corresponds to reinforcement learning, and 0317 and s515 resemble date plus training step internal markers. This kind of naming is common for intermediate artifact tagging in engineering systems.

Meanwhile, Moonshot’s pretraining lead Yulun Du publicly commented on tokenizer similarity and questioned licensing compliance, although the relevant posts were later deleted. Two other Moonshot employees confirmed that Cursor had not obtained authorization for this type of use (also deleted).

But on March 20, the official Kimi / Moonshot account added what is now the strongest first-party evidence. The post states directly that “Kimi-k2.5 provide[s] the foundation,” and further says that Cursor accesses Kimi K2.5 through Fireworks AI’s hosted RL and inference platform as part of an authorized commercial partnership. That adds three things at once. First, Kimi K2.5 as the base of Composer 2 is no longer just a community-side inference; it becomes a public acknowledgment from the model provider. Second, Cursor’s method is described as continued pretraining plus high-compute RL, which matches Cursor’s own technical narrative. Third, Fireworks AI is named for the first time as the infrastructure and hosting intermediary.

Taken together: Cursor officially confirms the continued pretraining path, the community-discovered internal path strongly points to Kimi K2.5, and Moonshot’s early insider reactions and later official account statement point in the same direction. The Kimi K2.5 lineage has now moved beyond high-confidence inference into something closer to semi-official confirmation. The boundary that still matters is that the X post was edited, the original version is not publicly visible, and “authorized commercial partnership” is PR language rather than a disclosed legal structure.

There’s a common oversimplification that needs correcting. Saying Composer 2 is just Kimi K2.5 plus RL omits the continued pretraining step. In Cursor’s own technical narrative, this step is precisely the core change in Composer 2 relative to 1.5. Continued pretraining adjusts the model’s task distribution and capability focus — it determines where subsequent RL starts from. Leave it out, and you can’t understand why the performance jump from 1.5 to 2 is so large.

This Route Is Not an Isolated Case

The route Cursor is taking has already produced at least two direct parallel cases in the AI coding tool space.

Cognition’s SWE-1.5 (the model behind Windsurf) adopted an almost identical approach. Their blog states: “after careful evals and ablations, we selected a strong open-source model as the base for our post-training.” Community analysis points to Zhipu’s GLM-4.6. Like Cursor, they didn’t do continued pretraining, running RL directly on the base model, and the RL environment was Windsurf’s own Cascade agent harness — the model was already using the product’s specific tools during training. Cognition showed an even more granular approach in another blog post: they separately trained a small model, SWE-grep-mini (based on Phi-3-mini), specifically optimizing parallel tool-calling capability in the context retrieval stage. This shows that component-level RL targeted optimization is also emerging.

The base model providers’ strategies are shifting too. Zhipu adopted a fully open MIT license for GLM-5, with no user volume limits, encouraging downstream products to build on top. This is more of an infrastructure-layer strategy: gaining ecosystem position through open source, capturing commercial value through inference APIs and enterprise deployments. Moonshot’s Kimi K2 chose a modified MIT license, requiring interface attribution for derivative products exceeding 100 million MAU or $20 million monthly revenue. Two licensing strategies correspond to two ecosystem positioning approaches.

Why This Route Works

A natural question: why does RL on an existing base model produce such large capability gains, rather than just learning surface patterns?

An ICML 2025 paper from HKU, UC Berkeley, NYU, and Google DeepMind provides a useful explanatory framework. The paper is titled SFT Memorizes, RL Generalizes. The core finding is that SFT tends to memorize training samples and has limited generalization under distribution shift, while outcome-reward-based RL promotes deeper capability generalization — up to 61% out-of-distribution improvement on vision tasks. RL improves underlying perception and reasoning capabilities, not just task performance.

Another related finding is RL’s implicit regularization effect. Among multiple high-reward solutions, on-policy RL naturally favors solutions close to the base model in KL divergence. This means RL stacks domain skills while preserving the base model’s general capabilities, rather than overwriting them. This property makes RL post-training on open-source base models an efficient capability stacking method, not simple behavioral cloning.

Moonshot’s own Kimi K2 technical report has a clear statement on the relationship between pretraining and post-training: “Pre-training is the crucial foundation for Agentic Intelligence, establishing the priors that makes reinforcement learning exploration tractable, efficient, and generalizable.” Pretraining establishes priors, RL performs efficient exploration over those priors. This framework explains why Composer 2’s continued pretraining delivered such large returns: it changed the quality of RL’s exploration starting point.

From a compute allocation trend perspective, this route aligns with industry direction. FundaAI’s analysis notes that by 2025, OpenAI was allocating 70-80% of training compute to mid-training and RL, rather than the pretraining stage. The center of gravity in training is shifting backward.

The Boundary Between Licensing and Governance

Kimi K2.5’s modified MIT license has a clear trigger condition: when a derivative product exceeds 100 million MAU or $20 million monthly revenue, it must prominently display the Kimi K2.5 identifier in its interface. Cursor’s ARR is estimated at around $2 billion. Composer 2’s interface shows no Kimi-related attribution.

At this point, the earlier framing of “if the Kimi K2.5 lineage is confirmed” no longer needs to carry as much weight. Moonshot’s official account has publicly described the arrangement as an authorized commercial partnership and named Fireworks AI as the hosting platform. That shifts part of the discussion: the key open question is no longer simply whether the base is Kimi K2.5, but what the exact technical and commercial boundaries of the arrangement are.

Several critical pieces of information are still missing from public view: whether Cursor and Moonshot signed a separate commercial agreement, whether Fireworks is acting only as a hosting layer or also as part of the authorization path, whether the product of continued pretraining plus RL still legally counts as a derivative work under the original license, and what Cursor’s own official position is. With those details absent, characterizing this as a governance risk is still more accurate than calling it a license violation outright. The new X post weakens the “unauthorized use” reading, but it does not fully clarify the legal structure.

Windsurf’s situation provides an interesting contrast. Cognition is believed to use Zhipu’s GLM-4.6, and GLM uses a standard MIT license with no user volume thresholds. Base model selection directly determines the licensing constraint strength faced by downstream products. This makes base model selection a decision with both technical and commercial dimensions.

Back to Composer 2

Putting the technical evolution, base model evidence, industry parallels, and research support together, Composer 2 can be fairly accurately described as: starting from a Kimi K2.5-level MoE base model, undergoing targeted continued pretraining to adjust task distribution, then training long-task behavior through long-horizon RL and self-summarization, and finally deep integration into Cursor’s editor toolchain and agent execution environment.

Over the five months from Composer 1 to 2, Cursor’s technical narrative underwent a noteworthy shift. In the 1 and 1.5 stages, the claim was that RL post-training was the sole source of differentiation — RL scaling laws held, and investing more post-training compute would continuously improve intelligence. At stage 2, continued pretraining was introduced, delivering nearly triple the benchmark improvement of the previous iteration. This at least suggests that the RL-only route encountered diminishing returns in the coding domain, and the amplification effect of base model quality on post-training is being re-emphasized.

This observation has reference value for the entire AI coding tool space. It means the relative weight of four components — base model selection, continued pretraining, RL post-training, and product integration — may be continuously shifting with product maturity. Early on, RL can quickly create differentiation, but at a certain stage, the base model’s own quality and targeted modifications become the bottleneck again. The ability to continuously iterate on the combination of these four components may matter more than pouring more compute into any single layer.

A few things worth watching going forward: whether Cursor will do full base model pretraining in a future version, whether Windsurf and other competitors will also shift from RL-only to continued pretraining, and as more products share the same batch of open-source base models, how frequently base model attribution, licensing compliance, and source transparency will appear in discussions of product competition.