In the spring of 2026, Meta employee Ash Bhat built a leaderboard on the company intranet and named it Claudeonomics. The leaderboard tracked token consumption across roughly 85,000 employees, with RPG-style tiers from bronze to emerald, and titles like Token Legend, Session Immortal, and Cache Wizard. The top 250 were publicly displayed. KuCoin’s reconstruction of the event. The top user burned 281 billion tokens in 30 days — roughly $4.2 million at Claude Opus’s public pricing. Zuckerberg and CTO Bosworth didn’t make the top 250.
Over a 30-day period, this 85,000-person company’s token consumption climbed from roughly 60 trillion to 73.7 trillion, in the billions of dollars range (Meta’s internal projection, not a disclosed financial figure). Claudeonomics was not an isolated case. Uber gave 5,000 engineers access to Claude Code and burned through its entire 2026 AI coding budget in four months. COO Andrew Macdonald publicly questioned the link between token spending and actual output on a podcast: “That link is not there yet.”
The root cause of runaway bills is the subsidy structure. Behind Claude Code’s $200 monthly subscription, heavy users can consume up to $5,000 in compute per month according to Cursor’s internal estimates — roughly 25 times the subscription price in subsidy (unconfirmed by Anthropic). As subsidies narrow, the consequences of cost becoming visible are quick to appear. Klarna replaced roughly 700 customer service roles with AI, then the CEO admitted service quality was “lower quality,” and the company subsequently rehired gig workers. Meta itself set the tone in a June memo sent to roughly 6,000 employees: “In 2027, we expect Meta will move toward managing AI tokens in a more structured way—with budgets, allocation decisions, and supporting tools.” Source The subsidy cycle is ending. 2027 is the year of token quotas.
This looks like a cost crisis unique to the AI era. It isn’t.
Treat AI as a workforce: assign tasks, control budgets, evaluate output. These are things every manager does every day. When AI felt free, companies suspended all resource allocation discipline — no routing, no ROI measurement, no budgets, treating activity as output. Read Bosworth’s memo carefully, and he’s saying exactly the same thing: take the knowledge-worker management discipline you already know, and apply it to AI. ServiceNow Chief Customer Officer Chris Bedi put it more directly:
“It’s almost like measuring a restaurant based on how many ingredients they buy. You don’t measure a restaurant that way. I wouldn’t.”
You don’t measure a restaurant by how many ingredients it buys, and you shouldn’t measure AI by how many tokens it burns. Source
If AI cost management is talent management, how do you translate the management playbook into AI? Four intuitions, each mapping to an actionable move.
Tiered workforce → model routing. Every manager knows to reserve the most expensive people for the hardest work and hand routine tasks to cheaper staff. The AI equivalent is model routing. UC Berkeley’s RouteLLM reduces costs by over 85% while retaining roughly 95% of quality on tasks with long-tail difficulty distributions like coding and Q&A (GPT-4 Turbo vs Mixtral, best-case result on MT-Bench; results vary with model pairs and task distributions). There are two fastest paths to adoption. For interactive coding, use Aider’s architect/editor mode: an expensive model handles planning, a cheap model turns the plan into code. o1-preview went from 79.7% to 85.0% on the benchmark — a win on both quality and cost. For server-side use, set up rule-based fallback with LiteLLM or OpenRouter, which takes a few hours.
Tiered workforce solves which model to use. The next question is what to show each model.
Bounded scope for new hires → context engineering. You wouldn’t dump the entire codebase on a new hire and tell them to figure it out. You’d give them a clear task boundary. The AI equivalent is context engineering: don’t shove the whole repo into context. Use RAG retrieval, sub-agent isolation, and repo maps to keep the token count per call low. Long context carries a double burden: you pay the full input price every round, and the longer the context, the more the model’s attention scatters and accuracy drops, leading to more retries and costs that climb in reverse. Aider’s architect/editor mode splits a code change into planning (expensive model) and execution (cheap model) — this is context engineering as division of labor: each model only sees the information it needs.
Context engineering reduces the token volume per call. The parallel track is reducing the unit price.
Don’t call the consultant every time → prompt caching. External consultants charge per engagement at high rates. Smart companies distill the consultant’s methodology into internal SOPs so they don’t need to call the consultant next time. The AI equivalent is prompt caching: cache your system prompt and codebase context as key-value pairs on the provider side, and subsequent cache hits are charged at a discount. Anthropic gives a 90% discount, OpenAI gives 50% and it’s fully automatic, Gemini gives 90%. Integration takes half a day, with zero quality loss. Not doing this means paying full price for the same context every round.
Technical measures can compress costs, but if the metric you’re evaluating is token consumption itself, the best techniques won’t help.
Measure output, not activity. Bill Gates’s three-decade-old judgment: “Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” Token count is the AI era’s LOC. Bosworth gave the AI version of the same judgment in his June memo:
“All motion is not progress and token usage alone is not a measure of impact of any kind.”
Source Jellyfish’s analysis of 12,000 developers put this judgment into numbers: the highest-usage group burned 10× the tokens per PR but only had 2× the PR throughput; per-PR cost soared from $0.28 to $89.32, with no quality improvement over the low-usage group. Source If you evaluate token consumption, teams will optimize token consumption rather than output. Goodhart’s Law operates on AI as reliably as anywhere else: when a measure becomes a target, it ceases to be a good measure.
Measuring output is the right call, but it raises a harder question: if teams know every token spent will be scrutinized for ROI, will they still experiment?
Bosworth himself provided the textbook counterexample. In April 2026, he championed tokenmaxxing in Forbes (as reported by media): “It’s like getting free money. Keep racking up tokens with no upper limit.” Source Two months later, he sent the “of any kind” memo himself. From “burn with no ceiling” to “all activity doesn’t count as output” — a swing between two extremes.
What you want to preserve is the impulse to explore. What you want to add is a signal for when to escalate. Tiered usage doesn’t mean prohibition. It means explore freely with cheap models, and escalate only when you hit a hard case. This is the Lean Startup MVP intuition translated to AI: validate the riskiest assumption at the lowest cost. Stage-Gate’s kill criteria and a router’s confidence threshold are two names for the same thing. Jensen Huang argued at GTC that token budgets should equal half an engineer’s annual salary — the other extreme, institutionalizing subsidies rather than refining them. The right point is in the middle: give budgets, but allocate by output rather than by headcount. Tencent has already moved in this direction: shifting from equal per-person allocation to dynamic allocation by task output, with no token consumption rankings.
2027 is the year of quotas, but quotas are not the endpoint — they are the return of normal economics. The subsidy cycle made many people forget that AI is also a resource, also has a unit price, also needs to be allocated. When the bill becomes visible, the first to adapt win: they’ve been managing AI as a workforce all along. Managers who know how to manage people already know how to manage AI costs. The playbook in your head hasn’t changed. What’s changed is where you apply it.