In May 2026, a SaaS company with $50 million in annual revenue received its AI bill for the previous month. The number was $87,000 — three times higher than expected. Nearly all of it came from one source: engineers had integrated Claude Code into their daily workflow, with agents running tests, modifying code, and fixing bugs in the background, each session consuming hundreds of thousands of tokens.
That same week, the company’s CTO saw the API pricing for DeepSeek V4 Flash in another window. It is a 284B-parameter MoE model (13B activated), Apache 2.0 licensed, with a million-token context window. Input costs $0.14 per million tokens, output $0.28, and with cache hits the input drops to $0.0028. Compare that to Claude Opus 4.7 at $15 input and $75 output. He ran the numbers: if they could route non-critical tasks from Claude to DeepSeek, the monthly AI bill could be cut dramatically.
These two windows on the same screen capture the central paradox of the AI industry in 2026.
Token prices are falling 10x per year. a16z calls this “LLMflation”: models with equivalent performance cost 1,000x less to run in 2026 than in 2023. Epoch AI’s data shows a median decline of 50x per year, accelerating to 200x per year after January 2024. GPT-4-class token prices have gone from $30-60 per million two years ago to $0.05-0.15 today. It sounds like AI is getting cheaper.
But the same numbers show that enterprise AI bills are inflating faster. The FinOps Foundation’s 2026 report identifies AI as the fastest-growing enterprise spending category. Oplexa’s analysis of the report puts the average enterprise AI budget at $7 million in 2026, up from $1.2 million in 2024. Fortune 500 companies with monthly inference bills in the tens of millions are no longer exceptional.
Both trends are real simultaneously: unit prices are collapsing, total spending is exploding. Understanding why this contradiction exists matters more than arguing about “which model is best.” Because this contradiction points to a market that is splitting apart. AI is becoming two industries with fundamentally different economic logic, and most people are still using the same framework to understand both. I previously wrote about software getting cheaper to make but harder to sell, and this bifurcation is the same underlying logic manifesting in AI.
Start with a pricing table.
In May 2026, the gap between the cheapest and most expensive AI API hit 300x. Digital Applied’s LLM API Pricing Index tracks the full Q2 2026 spectrum: the low end is Alibaba’s Qwen 3.5 9B at $0.05 per million input tokens, the high end is Anthropic’s Claude Opus 4.7 at $15. On the output side, DeepSeek V4 Flash costs $0.28 per million tokens while Opus 4.7 costs $75 — a gap of nearly 270x. Swfte AI’s May pricing report uses a different reference: “the most expensive and least expensive model that can plausibly do the same job has stretched past fifty-to-one on input tokens.” On output, including self-hosted inference, the top is 250x the bottom.
This gap is expanding rapidly. When GPT-4 launched in March 2023, the spread was 30x. In July 2024, GPT-4o Mini pulled the low end to $0.15, widening the spread to 33x. After Chinese open-weight models entered at scale in late 2025, the gap jumped past 150x. Today, 300x.
The key is that this expansion is asymmetric. The low end is collapsing while the high end barely moved. From 2023 to 2026, the cheapest model dropped from $2 to $0.05 — a 40x decline. The most expensive model dropped from $60 to $15 — only 4x. A 40x vs 4x asymmetry is itself a market signal.
Three causes converged to create today’s situation.
First, the commoditization effect of open-weight models. DeepSeek V4 Flash launched in April 2026 at $0.14 input and $0.28 output, 36 to 89 times cheaper than Claude Opus 4.7. It is Apache 2.0 licensed, weights are public, anyone can run it on their own hardware. This set a new pricing anchor for the entire market — not “how much cheaper than GPT-4,” but “how much more expensive than running it yourself.”
Second, the systematic low-pricing strategy of Chinese AI labs. This is not an individual company’s tactical choice but a competitive strategy across China’s AI industry. The US-China Economic and Security Review Commission (USCC) published a report titled “Two Loops” in March 2026 analyzing China’s open-source AI strategy. The core finding: China has chosen a path of full embrace of open source, using extremely low API prices to accelerate global adoption, then using the data and ecosystem from that adoption to feed back into model iteration.
The strategy is working. In late 2024, Chinese models accounted for about 1% of global API usage. By late 2025, that number jumped to nearly 30%. On OpenRouter, a developer-heavy platform, Chinese models exceeded 45% of total weekly token volume in April 2026. Xiaomi’s MiMo V2 Pro became the single most-used model on the platform. Xiaomi alone accounted for 21.1% of token volume, while all of OpenAI combined accounted for 7.5%.
Third, rapid advances in inference efficiency. GLM-5.1’s high-speed API reached 400 tokens/s output in May. Zhipu’s TileRT inference engine did not simply “optimize faster” — it restructured GPU inference at the execution model level, converting batch processing into a continuous pipeline and eliminating idle gaps between computation steps. I analyzed this architecture in detail previously.
If the low end is collapsing, why isn’t the high end following? The answer lies in three things.
First, enterprise lock-in. Anthropic’s annualized revenue went from $9 billion at the end of 2025 to $30 billion in four months. This is one of the fastest growth rates in US corporate history. The driver is not consumer subscriptions but deep enterprise integration: KPMG embedded Claude into its Digital Gateway platform, giving 276,000 employees access; ServiceNow made Claude the default model for its Build Agent, running on a platform that processes 80 billion workflows annually. Over 1,000 enterprise customers spend more than $1 million annually on Claude.
These numbers tell a story not of “better model” but of “can’t switch anymore.” When an AI model is embedded in KPMG’s audit workflows, ServiceNow’s ticket system, and Goldman Sachs’ trade reconciliation pipeline, the replacement cost goes far beyond API price differences. This is why the premium market can sustain its pricing. Customers aren’t buying tokens — they’re buying integration depth and switching inertia.
But there is a counterargument worth examining: if model performance is converging, switching should be easy. a16z’s 2025 Enterprise AI report tracked this question specifically, and the finding was surprising. In 2024, they found most enterprises deliberately designed model-agnostic architectures with low switching costs. One year later, with the spread of agentic workflows, the situation reversed. One enterprise respondent said: “All the prompts have been tuned for OpenAI. Each one has its own set of instructions and details. Agent instructions run to dozens of pages. Quality assurance is no small matter either — switching models is now a task that takes a lot of engineering time.”
Menlo Ventures’ 2025 Enterprise AI report confirms this with market share data: Anthropic went from 12% (2023) to 40% (2025), while OpenAI dropped from 50% to 27%. But the critical nuance is that this shift came primarily from new workloads flowing to new providers, not from mass migration of existing workloads. Enterprises run multiple models simultaneously, each handling different tasks. This is not evidence that “switching is easy” — it is evidence that “new demand is being distributed differently.”
I discussed this question in my analysis of Anthropic letting Claude Cowork run third-party models: switching costs at the model layer are decreasing (OpenAI-compatible APIs have become the de facto standard), but switching costs at the runtime and control plane layers are increasing. You can swap the underlying model, but you still need to rewrite prompts, re-run evaluations, restructure guardrails, and re-validate agent behavior. These costs do not appear in API pricing, but they determine whether an enterprise can actually switch.
Capstone DC’s April analysis recorded a more direct signal: before the Opus 4.7 launch, Anthropic found that a large number of fixed-price contract users were “hitting session limits within 3-4 conversation turns.” These users were not anomalies — they were agent users. Anthropic subsequently banned personal agents from fixed-price contracts. Under metered pricing, a personal agent could run up “several hundred dollars” per day. Capstone’s conclusion: “the beginning of the end of the era of inexpensive experimentation.”
Second, agent workloads have pushed demand to another order of magnitude. Gartner’s 2026 analysis puts agent workload token consumption at 5 to 30 times that of traditional chat. Stanford Digital Economy Lab’s empirical data shows that agentic coding tasks consume 1,000x more tokens than code reasoning, with input tokens (context accumulation) rather than output tokens driving the cost.
What does this mean in practice? A developer who used to call a chat API 100 times a day at a few hundred tokens each, totaling tens of thousands of tokens, now runs an agent on a refactoring task that consumes 1 million tokens in a single session. Uber’s engineering team pushed Claude Code adoption from 32% to 84%, burning through their entire 2026 AI budget in four months. Together AI’s platform token consumption grew from 10 billion per day in early 2025 to 5 trillion per day in early 2026 — a 500x increase in one year.
When demand expands at this velocity, pricing power naturally returns to the supply side.
Third, subsidies are receding. OpenAI’s internal financial documents show a projected loss of $14 billion in 2026, with cumulative losses of $44 billion from 2023 to 2028. Anthropic’s gross margin is around 40%. For a SaaS company, this means 60 cents of every dollar goes to inference compute. Epoch AI’s analysis notes that OpenAI’s gross margin during the GPT-5 lifecycle was only 30%, and R&D spending far exceeded gross profit — the R&D spent in the four months before GPT-5’s launch exceeded the total gross profit generated during GPT-5’s entire lifecycle.
All three companies are preparing for IPOs. OpenAI is expected to file confidentially this week, with a September listing. SpaceX’s S-1 is already public. Anthropic’s annualized revenue has surpassed OpenAI’s. I discussed what each company is betting on in my analysis of the three prospectuses. In the pre-IPO window, each needs to demonstrate a sustainable path to profitability. This means subsidies are being systematically withdrawn — both the inference costs of free tiers and the agent subsidies embedded in fixed-price contracts.
Looking at the low end and high end together reveals a clear structure.
The low-end market is cost-driven. The competitive focus is inference efficiency, open-source ecosystems, and price wars. The players are Chinese AI labs, open-source communities, and inference engine companies. The users are price-sensitive developers who need “good enough, fast enough, cheap enough.” GLM-5.1’s 400 tokens/s competes here, DeepSeek’s Apache 2.0 weights compete here, Qwen’s $0.05 competes here.
The high-end market is lock-in-driven. The competitive focus is enterprise integration depth, security compliance, and switching costs. The players are frontier labs like Anthropic, OpenAI, and Google. The users are budget-insensitive enterprise customers who need “once it’s in, it doesn’t break.” ServiceNow runs 80 billion workflows annually on Claude — this is not a customer that API pricing alone can move.
The economic logic of the two markets is entirely different. At the low end, margins approach zero and the winner is the lowest-cost operator. At the high end, margins are protected by switching costs and the winner is the deepest integrator. These are not two segments of the same industry — they are two industries.
But for builders, the most difficult problem is not choosing one side. The problem is that both sides are necessary. The same team uses DeepSeek or Qwen for non-critical tasks to control costs, and Claude or GPT-5.5 for critical tasks to ensure quality. This “both sides” requirement creates a new bottleneck: the infrastructure for model routing and cost governance.
Stanford’s research found that token consumption for the same agent task can vary by 30x between runs. Mavvrik’s report shows that 80-85% of enterprise AI infrastructure budgets miss forecasts by more than 25%. These numbers tell the same story: most teams are managing AI costs by intuition.
This is no one’s fault. In 2023, you needed one model with transparent pricing and predictable usage. In 2026, you need to route between 10+ models, each with different price and performance characteristics, with agent token consumption that is highly stochastic, and with a 300x gap between the cheapest and most expensive option. Choose wrong and costs multiply by 10x. Choose right and save 90%. This decision space did not exist a year ago. Today it is every AI team’s daily reality.
Three things worth implementing in the next 6-12 months.
First, turn model selection from a human decision into a system decision. Your team should not manually choose “GPT-5.5 or DeepSeek” each time. There should be a routing layer that automatically distributes based on task type, quality requirements, and cost budget. With a 300x price gap, no routing layer means no cost control.
Second, make agent cost an engineering metric. Most teams monitor API latency and error rates but not token consumption and cost. When the same task can vary by 30x in token usage between runs, not monitoring consumption means flying blind. The cost of each agent session should appear in every PR review alongside latency and accuracy.
Third, accept the reality of two markets rather than betting on one. Do not choose between “AI will be so cheap that we use open source for everything” and “AI must be frontier so we use closed source for everything.” Both judgments are correct — they just apply to different tasks. The key is not choosing a side but building the bridge between them.
The AI market in 2026 is not “getting cheaper.” It is splitting apart. The low end approaches zero, the high end keeps rising, and between them lies a 300x gap. This gap is not a market defect — it is the market’s new structure. Those who understand it and use it will have an asymmetric advantage in the next phase of competition. Those who do not, who continue making 2026 decisions with a 2023 framework, will find their AI bills climbing higher and higher without knowing where the money went.
The question is not whether AI is getting cheaper. The question is whether you can operate efficiently in both markets at the same time.
This article was researched and written entirely by DeepSeek V4 Flash running locally on a Mac (ds4 engine). The inference engine is antirez’s ds4 project, the model is DeepSeek V4 Flash (284B parameters, 13B activated, Apache 2.0).