Industry & CompetitionAI Coding

Token Pricing Squeeze: Buyers Flee, Sellers Double Down, Builders Absorb the Gap

Published Jul 2, 2026

On June 30, 2026, Anthropic launched Claude Sonnet 5. According to Anthropic Sonnet 5 official documentation, its promotional input and output prices are $2 and $10 per million tokens. Starting September, these will rise to $3 and $15. According to Finout’s analysis, the new tokenizer produces roughly 30% more tokens. The actual cost for the same task nearly doubles compared to the previous-generation Sonnet 4.6.

The next day, July 1, Palantir CEO Karp publicly blasted this metered pricing on a television program. According to CNBC’s coverage of Karp, Karp pointed out that enterprise customers are furious about the current billing model. He used an emotionally charged term to describe how enterprise customers view it:

The prevailing sentiment in enterprise circles, as Karp described it, is: “I’m going to relax, waste my time on tokens, get no value, and they’ll take my intellectual property.”

Karp argues that charging by the word signals that AI’s entire logic has gone off the rails:

“I’m not throwing shade at them, but something has gone completely wrong.”

He posed a rhetorical question: if frontier models truly create value, they should charge a share of the outcome, not by the word. As Karp put it, “If this thing is actually valuable — say I can make you $10 billion tomorrow — shouldn’t I say ‘I made you $10 billion, I want 30%’? If it’s so valuable, why charge by the token?”

Within the same week, buyers are fleeing metered pricing while sellers are doubling down on it. The two sides are pulling in opposite directions, and the contradiction is surfacing in plain sight.

The token bill is laying bare a problem that was once obscured by flat subscription pricing. Customers buy outcomes but pay by the token, and there is still no reliable conversion between the two. Buyers reject unpredictable naked token bills; sellers, under the pressure of losses, must make their costs explicit. This friction and gap lands on the product developers caught in the middle. Token billing won’t disappear, but it will recede into the background. The unit that remains in the foreground for direct billing must be closer to the outcome the customer is buying than the token is.

What Buyers Are Fleeing

Buyers are fleeing unpredictable, variable compute costs that are decoupled from business outcomes.

The most direct evidence comes from the mobility giant Uber. In December 2025, Uber deployed Claude Code and Cursor to roughly 5,000 engineers. Internally, it set up an AI tool usage leaderboard that accelerated consumption. Within four months, the entire fiscal 2026 AI coding budget was burned through.

According to Fortune’s coverage of Uber, COO Andrew Macdonald said publicly on a podcast that the company can’t make sense of the ROI on AI spending:

“that link is not there yet”

Macdonald explained that even when code output increases, finance struggles to draw a direct causal line between incremental code and user-facing features. Without outcome support, high compute spending cannot justify itself on the books.

To prevent budget runaway, Uber imposed usage caps. According to Simon Willison’s analysis of the caps, Uber capped per-engineer monthly tool spending at $1,500. Unpredictable variable costs were thus converted into governable fixed caps.

Microsoft took the same path. According to The Verge’s report, on the last day of its fiscal year (June 30), Microsoft revoked Claude Code licenses for thousands of engineers, routing them back to its own Copilot. The official line was toolchain consolidation, but sources explicitly stated that the decision was also financially motivated — a way to trim fiscal year operating costs.

Small and mid-sized developers are voting with their feet as well. On June 1, 2026, GitHub Copilot switched from flat monthly subscriptions to per-token usage billing. Under the old model, overages would silently fall back to cheaper models; under the new model, overages either stop service or incur additional charges at API rates.

According to GitHub community discussions, many users who previously consumed only 20% to 30% of their allowance burned through it in just one or two days under the new rules. To avoid excess charges, a wave of developers canceled subscriptions and switched to other tools.

These three cases point to the same problem. The pain point for enterprises and developers is not knowing how large the next bill will be, and not being able to articulate what the money spent actually bought. Under metered pricing, per-person monthly compute costs range from $150 to $2,000 — a 13x spread. Unpredictable budgeting has become an insurmountable barrier for metered pricing in enterprise settings.

Why Sellers Can’t Stop

Sellers are doubling down on metered pricing not out of greed. Crushing losses are forcing them to make compute costs visible.

Even leading players face existential challenges. According to The Information’s report, on roughly $13 billion in revenue, OpenAI is projected to lose $14 billion in 2026, three times its 2024 loss.

Anthropic is in the same boat. Annualized revenue has surged to $47 billion, but steady-state monthly compute costs reach $1.25 billion ($15 billion annualized). In Q2 it briefly posted $559 million in operating profit, but the profitable window cannot be sustained under depreciation pressure.

Massive capital expenditure is shifting the settlement rules between cloud giants and frontier labs. A textbook example is Amazon and Anthropic. According to AI Weekly’s report on Amazon’s billing terms shift, starting in 2027, the unit Amazon uses to pay Anthropic for model usage is switching from per-hour to per-token billing.

This change imposes direct cost pressure on Amazon. To keep its bill in check, Amazon engineering teams began distilling Claude into smaller versions. These smaller models feed into product lines like Alexa, reducing token consumption ahead of the change.

Beneath this lies a misalignment of interests between cloud giants and closed-source labs. AWS, Google Cloud, and Microsoft Azure resell third-party closed-source models on a per-token basis and must split revenue with the model vendors. If customers switch to the cloud giants’ own low-cost models (such as Nova or Gemini), the cloud giants keep all the revenue. This arithmetic forces every party to guard the cost of every single token.

Under this double squeeze, model developers are stealth-raising prices through tokenizer upgrades while keeping nominal prices flat.

According to CloudZero’s analysis of Claude Opus pricing, the new tokenizer deployed in updated models splits the same text into more tokens. The inflation is most extreme for code, JSON, and XML at up to 35%. Nominal unit prices haven’t changed, but enterprise bills for running the same workloads have quietly risen.

Forbes’ analysis calls this shift AI’s first genuine price discovery mechanism. Flat subscriptions created the illusion that costs were low; metered pricing exposes the real costs to the finance department. Once the CFO can see the line item, “what did all this money buy?” becomes a question that must be answered.

Karp Is the Plaintiff, Not the Judge

Once the buyer-seller dynamic is clear, Karp’s remarks reveal a distinct commercial interest behind them.

On June 29, 2026, Palantir announced a partnership with NVIDIA to launch an AI operating system for sovereign security environments.

According to the BusinessWire announcement, the product lets customers run open-weight models on their own hardware while retaining ownership of the model weights. On June 30, Palantir released a nine-point Sovereign AI Manifesto, in which it declared:

“Controlling your weights is controlling your fate.”

On July 1, Karp went on live television and tore into the absurdity of token billing. Three steps in three days — product launch, narrative priming, media amplification — forming a complete commercial loop.

Palantir charges by seat and deployment contract. Revenue scales with customer count and deployment depth, not linearly with token usage. By publicly disparaging tokens, Karp directly boosts the appeal of Palantir’s self-built model product line.

That said, a plaintiff with a conflict of interest doesn’t make the case false. Stripped of marketing spin, the economic problem Karp has identified still holds.

Separate Karp’s political rhetoric from the objective facts. His allegation that closed-source labs are stealing customers’ core assets lacks public evidence so far. Elevating token pricing to the political level of a wealth tax penalty is a logical leap.

That said, the retreats by Uber and Microsoft happened independently. These enterprises’ budget crises confirm the same thing from the buyer’s perspective: the token billing model is severely decoupled from the actual business value enterprises receive. The plaintiff is biased, but the case stands.

The Gap Lands on Builders

Buyers want cost certainty; sellers need to pass costs on. This translation gap between value and consumption can only be bridged by product developers.

To repackage an uncontrollable naked token bill into pricing that enterprises can accept, developers are reworking the underlying architecture.

Buyers and sellers at odds over metered billing, and how product developers caught in the middle absorb the translation cost through routing and caching architecture

On the API aggregation platform OpenRouter, the technical migration is most pronounced. According to the OpenRouter programming model leaderboard, the top four models by traffic share for programming are all open-weight: MiMo, MiniMax, Tencent Hy3, and Zhipu. The previously dominant closed-source flagship Claude Opus has shrunk to 4.7% traffic share.

The traffic shift is driven by the extreme cost-efficiency of open-weight models. Take DeepSeek V4 Flash as an example: its input token price ranges from $0.09 to $0.14 per million tokens, one thirty-sixth of the closed-source flagship Opus. The output price is as low as one eighty-ninth of Opus.

That said, good architects don’t bet everything on a single open-weight model. They’ve made AI gateways and multi-model routing layers standard infrastructure.

The router offloads simple queries, data formatting, and pre-classification tasks to cheap open-weight models. Only when complex multi-step reasoning or deep coding is needed does it call the priciest closed-source flagship. Public data shows this routing can compress requests sent to closed-source frontier models down to 26%. The rest is absorbed cheaply by open-weight models.

Beyond multi-model routing, a suite of token-consumption interception techniques has rapidly proliferated within the developer community.

According to the arXiv benchmark paper “Don’t Break the Cache”, front-loading prompt caching can reduce input costs for agent tasks by 41% to 80%, with the greatest impact in multi-turn conversations.

Developers can also install a semantic cache network at the very front of the system. High-frequency, equivalent redundant queries are intercepted locally and never sent to the cloud. This alone can slash total API spend by another 40% to 70%.

In addition, developers are making extensive use of providers’ offline batch processing channels. Tolerating a 24-hour delay earns a 50% half-price discount, reducing the budget for non-urgent batch tasks.

Providers themselves are also beginning to hand decision-making levers back to developers. Effort dials, per-task credit budget caps, and other explicit knobs allow product managers to directly participate in fine-tuning the balance between cost and experience.

Enterprise engineer monthly AI cost range and per-task actual cost comparison across different models

The price is having to be honest about the water content in open-weight model pricing. Zhipu’s new-generation open-weight model GLM-5.2 is a case in point. Its per-million-token sticker price is extremely low, but in actual tasks, output token count per run has grown by roughly 65% compared to the previous generation.

Open-weight models frequently trade more tokens for similar scores. From the perspective of per-task total cost, their real advantage over closed-source models is smaller than what the unit price ratio suggests. But the direction is clear — developers are using architectural means to reshape the naked token bill into pricing that enterprises can accept. OpenAI and Anthropic are also experimenting with success-based pricing, tying charges to task completion, which indirectly confirms that Karp has diagnosed a real problem.

Even with some efficiency loss, the direction is unstoppable. Developers are using engineering methods to reshape unpredictable bills into fixed costs that finance departments can accept. According to Axios, an enterprise customer accumulated a $500 million Claude bill in a single month due to having no spending cap in place. The report comes from a single anonymous consulting source and has not been confirmed by the vendor. This extreme case in turn has forced all developers to elevate cost-control architecture to top priority.

The Token Recedes to the Background

As a physical measure of consumption, the token won’t disappear. It will stay in the background as a meter, a unit for measuring underlying server performance.

But its billing position will recede. It will exit the foreground and return to the background. The unit that remains in the foreground for issuing enterprise bills must be tied to what the enterprise actually values: per-task pricing, per-seat monthly subscriptions, or annual budget commitments.

In the short term, what has already happened is a trust crisis in metered pricing. Tokenizer inflation and the subscription exodus triggered by GitHub Copilot’s switch to usage-based billing have brought this issue to the surface.

In the medium term, from the second half of 2026 through mid-2027, leading closed-source providers will cut prices by 30% to 50% under defection pressure. To preserve the gross margin story they need to tell Wall Street for IPOs, these cuts may be packaged as new efficient tiers rather than direct headline price drops.

In the long term, the legitimacy of using per-million-token as a front-facing billing unit will itself be shaken.

The incremental intelligence that closed-source frontier models offer over open-weight models cannot justify a 10x premium across ninety percent of routine tasks. Naked token billing is being pushed out of the foreground, back into the background, to serve as a quietly working dashboard.