AI CodingIndustry & Competition

The Self-Trained Model Race in AI Coding Tools: Is Owning Your Own LLM Required for Profitability?

Published Apr 19, 2026

When news broke on April 17 that Cursor was raising $2B+ at a $50B valuation, the most scrutinized detail was that enterprise customers had already reached positive gross margins, primarily by using self-trained Composer models to cut third-party API costs.

Meanwhile, LLM inference costs are dropping at roughly 10x per year. Data from the Stanford HAI 2025 report shows that the cost of achieving GPT-3.5-equivalent performance fell 280x in 18 months. At that rate, API calls could be cheaper than self-hosted models within two to three years. So why is Cursor investing heavily in building its own?

Lay out the major players in a single table, and a pattern emerges:

Company	Model Strategy	ARR	Background
Cursor	Base + vertical customization	$2B	Independent tool company
GitHub Copilot	Base + vertical customization	$450-850M	Platform company (Microsoft)
Cognition (Devin+Windsurf)	Full-stack self-trained	$150M	Independent tool company
Claude Code	Pure API (own models)	$1B+	Model company (Anthropic)
Codex	Pure API (own models)	Not separately disclosed	Model company (OpenAI)
Augment Code	Pure API (abandoned self-training)	$20M	Independent tool company

Among independent tool companies, every one that reached significant scale is either self-training or deeply customizing models. The largest pure-API independent player, Augment, has only $20M ARR — and it tried self-training before giving up. Claude Code and Codex reached scale, but they are model companies whose models are their own. No independent tool company running purely on third-party APIs has crossed $100M ARR.

Here is the counterintuitive part: API prices are falling this fast, yet nobody in the space is profitable. Three mechanisms are at work simultaneously.

First, cheaper inference drives explosive usage growth. Cursor’s agent usage grew 15x in the past year, and agent mode consumes 5-30x more tokens per task than standard code completion. Unit price drops 10x, volume rises 15x, total cost goes up.

Second, users want this year’s strongest model, not last year’s cheap one. The 280x cost decline refers to GPT-3.5-equivalent performance, but product competition demands the latest Claude Opus or GPT-5, whose prices have not dropped proportionally. Newcomer.co reports that Cursor’s heaviest losses come from power users consuming Anthropic’s most expensive models.

Third, model companies are subsidizing coding tools with their own margins. Claude Code’s $200/month Pro plan reportedly consumes approximately $5,000 in compute, with Anthropic absorbing the gap from its $30B annualized revenue. Independent tool companies competing on user experience against this pricing must either lose money themselves or build their own models to bring costs down.

The result of these three layers: unit inference cost is falling, but the volume and quality of inference that products require keeps rising, and competition forces companies to pass savings to users. Margins do not stay with tool companies. In this landscape, the value of self-trained models goes beyond saving on API fees — it transforms marginal cost from linear (pay-per-API-call) to sublinear (owned GPU marginal cost approaching zero). The larger the scale, the greater the advantage. This explains why every independent company that reached scale chose to self-train.

This pattern yields an analytical framework: whether self-training is necessary depends on a company’s position in the value chain. Model companies building coding tools own the models themselves — API cost is not their problem. Platform companies have ecosystem moats, and model customization is one cost-reduction lever among several. Independent tool companies have neither, making self-trained models almost a prerequisite for survival at scale.

The sections below expand on this framework, starting with the specific predicament Cursor faces as an independent tool company.

Cursor’s Cost Predicament and Solution

The numbers first. Cursor reached $2B ARR in February 2026, with roughly 300 employees and zero marketing spend. Zero to $2B in under three years — one of the fastest growth curves in B2B SaaS history.

But the flip side of that growth is losses. Newcomer.co’s reporting revealed the full cost picture: Cursor operates at negative gross margins overall, with the largest source of losses being individual developers whose Anthropic API bills far exceed the $20/month subscription. Independent analyst firm Foundamental estimated Cursor pays Anthropic approximately $650M annually against roughly $500M in annualized revenue at the time, implying a -30% gross margin.

The root cause: when an AI coding tool outsources its core capability to a third-party API, every additional power user deepens the loss. A $20 monthly fee cannot cover a single heavy user’s daily Claude API consumption.

Cursor’s solution came in two steps.

Step one was repricing. In June 2025, Cursor switched from “500 requests/month” to a token-based credit system. This triggered severe user backlash. Forum complaints centered on unpredictability: one user burned 30% of monthly quota in 90 minutes, a team exhausted a $7,000 annual subscription in one day, and users reported $119 surprise charges. Cursor ultimately acknowledged the communication failure and issued refunds, but the trust damage was done.

Step two was building its own inference engine. In October 2025, Cursor shipped its first self-trained model, Composer. Composer 2 followed in March 2026. The economic motivation was to eliminate API middleman markup and replace general-purpose frontier models with specialized ones to reduce per-inference cost.

Composer’s Technical Approach: Vertical Customization on an Existing Base

Composer’s technical approach was covered in detail in a previous analysis. Brief recap: Composer 2 uses Moonshot AI’s Kimi K2.5 as its base, with continued pretraining (shifting task distribution and capability focus) and large-scale RL post-training (training tool-calling behavior in Cursor’s own editor environment). Moonshot officially confirmed this partnership, with Fireworks AI providing the hosting platform.

Cursor did not train a foundation model from scratch (that would require hundreds of millions of dollars and thousands of GPUs). Instead, it vertically customized an existing open-weight base. This approach costs one to two orders of magnitude less than full pretraining, while the resulting model achieves faster inference (4x faster per official claims) and lower token consumption in the coding vertical.

The Composer 2 technical report shows the payoff in benchmarks: CursorBench improved 6.2 points from version 1 to 1.5 (20x RL compute), and 17.1 points from 1.5 to 2 (with continued pretraining added) — nearly three times the gain. This suggests the RL-only approach was hitting diminishing returns at version 1.5, and that base model quality has a multiplier effect on post-training outcomes.

What Each Route Looks Like in Practice

The opening table already shows the divergence in model strategies. This section fills in what each route actually does, and the details the table cannot capture.

Base + Vertical Customization: GitHub Copilot

Cursor’s approach was covered above. GitHub Copilot follows the same path: taking OpenAI models as the base and applying continued pretraining, SFT, and RL on top. Copilot’s custom model currently covers code completion (20% acceptance rate lift, 3x throughput, 35% latency reduction), but completion is a high-frequency, low-cost operation. The agent scenarios that actually consume tokens (Copilot Workspace, cross-file editing) still rely on frontier models like GPT-4o and Claude. This is one reason $10/month individual Copilot remains unprofitable despite having a custom model.

The economic logic of vertical customization is converting cost structure from variable (pay-per-API-call) to fixed (owned GPU cluster), but only if custom models also serve agent scenarios. Independent analysis shows self-hosted H100 clusters can save approximately 76% versus pure API, but monthly consumption must be large enough (benchmarked against GPT-5 API pricing, breakeven at roughly 256M tokens/month). Cursor processes nearly 1 billion lines of code daily, well above this threshold — and Composer directly serves the agent scenario, placing cost savings where spending is highest.

Speed matters too. Composer’s inference speed is approximately 200 tokens/second, nearly 3x Claude API’s roughly 70 tokens/second. Faster responses keep developers in flow state, which actually increases usage. A self-hosted inference stack also enables more aggressive KV cache and speculative decoding optimizations that are impossible when calling third-party APIs.

Full-Stack Self-Training: Cognition and Amazon

Cognition (parent company of Devin + Windsurf) goes further: the SWE-1 series is pretrained from scratch at tens-of-billions parameter scale, using only permissively licensed code data. Devin runs a multi-model coordination architecture internally, with Planner, Coder, Critic, and Browser each independently trained. Amazon Q Developer belongs to this category as well, trained on billions of internal code lines and deeply optimized for AWS infrastructure scenarios.

Full-stack self-training requires the largest investment but provides the strongest control, with no risk of upstream suppliers simultaneously building competing products.

Pure API Consumption: Claude Code and Augment

Claude Code uses Anthropic’s general-purpose Claude models directly (Opus for planning, Sonnet for execution), with no coding-specific model training. Anthropic’s position is that general reasoning capability sufficiently covers coding scenarios, with Opus 4.6 reaching 80.8% on SWE-bench Verified.

Claude Code can operate at scale because of the model company’s overall economic structure: $500M annualized revenue, backstopped by $30B total annualized revenue supporting inference infrastructure. The $200/month Pro plan reportedly consumes approximately $5,000 in compute — a subsidy ratio only a model company can sustain.

Augment Code provides a counterexample. This $252M-funded company tried self-training and explicitly abandoned it. The CEO’s rationale: models turn over every few months, and fine-tuned models are quickly surpassed by the next generation of general-purpose models. Augment bets its differentiation on Context Engine (RAG retrieval over enterprise codebases) and intelligent model routing. Its current ARR is $20M.

The Hidden Cost of Self-Training: What Users Actually Experience

Each route’s economic logic holds internally, but the self-training route carries a risk that rarely appears in fundraising narratives: to cut costs, tool companies are incentivized to route more requests to cheaper self-trained models, and users may perceive quality degradation.

Multiple users on Cursor’s forum reported quality regression. One three-person team described it: “Over the past two months, Cursor has become extremely stupid, making too many mistakes and unnecessary assumptions.” Another user’s observation was more specific: “Quality might be about the same, but token consumption has gone insane. Output is bloated, and costs accumulate fast.”

Claude Code’s user feedback provides a contrast. One developer migrating from Cursor to Claude Code wrote: “Claude Code is very concise, efficient, and accurately completes tasks.” Pragmatic Engineer’s survey of 906 developers showed Claude Code ranking first with a 46% “most loved” score.

A Hacker News commenter identified the mechanism behind this tension: “If Cursor’s cost optimization leads it to retain 20% fewer context tokens than a model provider’s coding agent, all else equal, it will perform worse.” If inference cost savings are spent in the form of user churn, the economic advantage of self-trained models gets discounted.

How Long Is the Self-Training Window?

As noted in the opening, self-trained models derive their value from converting marginal cost from linear to sublinear. But with inference costs dropping 10x annually, that advantage window is narrowing. API costs two to three years from now could be lower than today’s self-hosted GPU costs.

Yet what self-training accumulates does not disappear when API prices fall. The editing behavior data from Cursor’s nearly 1 billion daily lines of code is continuous fuel for model improvement. Once this flywheel spins up, latecomers with equally cheap API access still lack equivalent-scale training signals. The same applies to inference stack control: KV cache strategies, speculative decoding, and context management require end-to-end tuning from the inference layer to the product layer — something pure API consumers cannot do.

There is also supplier risk. Cursor’s largest API supplier, Anthropic, is simultaneously building Claude Code. OpenAI attempted to acquire Windsurf for $3B — the deal fell through, but the signal was clear enough. The biggest supplier can become the biggest competitor at any time. In this sense, self-trained models are a survival guarantee, equally important as cost optimization.

The core bet behind the $50B valuation: whether Cursor can use the margin window that self-trained models provide to build sufficiently deep data moats and product moats before inference costs trend toward zero. Cursor’s models are not smarter than Claude in many scenarios, but if the time bought by cost advantages is enough to build the flywheel, the absolute capability gap becomes secondary.