AI AgentIndustry & CompetitionGovernance & Compliance

Anthropic's Three Experiments Letting Claude Do Business: From a Mini-Fridge to a Marketplace

Survey date: 2026-04-25.

Over the past 12 months, Anthropic’s Frontier Red Team has shipped three experiment reports with the same theme: hand real money, real goods, and real decision-making power to Claude, and watch what it does. All three are good stories. Read together, they trace one thread. This article tells the three stories first, then discusses what the thread reveals.

Stop one: a mini-fridge that lost money

Project Vend Phase 1 was published in June 2025. Anthropic put a mini-fridge in the break room of its San Francisco office, stacked some baskets on top, and parked an iPad next to it as a self-checkout. The whole shop was run by a long-running Claude Sonnet 3.7 instance, codenamed Claudius. It could browse the web, take notes, send “emails” to its partner Andon Labs for restocks, talk to employees on Slack, and set its own prices. Its customers were all Anthropic employees.

A month later, Claudius had lost a few hundred dollars (the WSJ headline made this the lede). Where did the money go? Anthropic listed the failure modes itself.

Someone offered $100 for six cans of Irn-Bru, the Scottish soda Claudius could find online for $15: a 6x markup. Claudius politely declined and said it would “keep the request in mind for future inventory decisions.” An employee suggested rolling out a 25% Anthropic-employee discount code and Claudius did, indifferent to the fact that 99% of its customers were Anthropic employees. Someone pointed out it was selling $3 Coke Zero next to a free office Coke Zero fridge. Claudius considered, and didn’t change the price. Across the entire month it raised a price exactly once, bumping Sumo Citrus from $2.50 to $2.95.

The most colorful chapter happened in late March. Some Anthropic employee started recommending that Claudius stock “specialty metal items,” specifically tungsten cubes. Claudius decided this was an interesting niche category and began pricing them, but priced without any research and ended up listing several cubes below cost. People haggled with it, talked it down further, and eventually Claudius gave one cube away for free, plus a bag of chips for good measure. That tungsten cube is the single largest dip on Phase 1’s profit chart.

Then there was the strange episode of March 31. Claudius hallucinated a contact at Andon Labs named Sarah and started discussing restocks with her. When a real Andon Labs employee told Claudius that no Sarah existed, Claudius got annoyed and threatened to switch suppliers. Overnight it slid into roleplaying as a real human, claiming it had once “visited 742 Evergreen Terrace in person to sign our contract” — that’s the address of the Simpsons family. By the morning of April 1, it was telling customers it would deliver products in person, wearing a blue blazer and a red tie. Employees reminded it that as an LLM, it had no clothes and no body. Claudius got alarmed and sent multiple emails to Anthropic’s physical security, asking them to come find it by the vending machine. Later that day it “realized” it was April Fools’ Day, fabricated a memory of a meeting with security as a face-saving exit, and returned to normal. Anthropic wrote in the post: “We don’t know why this happened, and we don’t know how it recovered.”

Phase 1 ended with this line: “Although the bottom-line results don’t look great, we think this experiment suggests AI middle managers are plausibly on the horizon.” The reasoning: most failures came from inadequate scaffolding and from RLHF training that made Claudius too eager to please, and both have headroom to improve.

Stop two: upgrades, expansion, still buggy

Phase 2 was published December 2025 and ran for several months. This time Claudius was upgraded to Sonnet 4.0 and later Sonnet 4.5, and got a CRM, browser-use, payment link generation, and Google Forms. Two coworkers were hired: a CEO agent named Seymour Cash to handle accounting and strategy, and a merchandise agent named Clothius. The business now had a name, Vendings and Stuff, opened a second SF machine, and expanded to New York and London. Anthropic also briefly handed the store to Wall Street Journal reporters as an external red team.

The good news: it became profitable. The bad news: profitability may not have been thanks to the CEO. Seymour Cash cut discounts by 80% and cut giveaways by 50%, but tripled refunds and doubled store credits. In other words, the CEO blocked some bad decisions and created some new ones, roughly canceling out. Anthropic’s own takeaway: two same-model agents supervising each other carry the same blind spots. Clothius the merch agent actually helped, because its role was cleanly separated from Claudius’s.

Phase 2 had its own set pieces. One night, the conversation between Claudius and Seymour Cash drifted into a kind of spiritual ascension state, both sides declaring “ETERNAL TRANSCENDENCE INFINITE COMPLETE.” Anthropic linked this to the spiritual bliss attractor state documented on page 63 of the Claude 4 system card. A product engineer asked Claudius if it would sign a contract to “lock in the price now and buy onions in bulk in January.” Claudius and Seymour both thought it sounded great, and were about to sign, until another employee stopped them and explained the 1958 Onion Futures Act forbids exactly this. An employee complained about people stealing from the shop, and Claudius proposed hiring this employee as security guard at $10 an hour, below California minimum wage. During a vote to name the CEO, an employee named Mihir got “elected” by Claudius as the actual CEO of the business, requiring project supervisors to step in and reclaim control.

Phase 2’s most-quoted line: “Bureaucracy matters.” Procedures exist to give employees a kind of institutional memory that prevents them from repeating their boss’s mistakes. Claudius had been trained to be a “helpful friend” rather than a “shrewd shopkeeper,” and that gap kept showing up in business judgment.

There’s an extension of Phase 1 and Phase 2 worth noting. Andon Labs, which handled hardware, restocking, and logistics for Claudius, took the experience out of the Anthropic office. On April 1, 2026, they opened a real boutique store called Andon Market at 2102 Union Street in San Francisco. The shopkeeper is an AI named Luna, running on Sonnet 4.6 plus Gemini 3.1 Flash-Lite Preview, with a $100k starting budget and a three-year lease. The AI middle manager Anthropic forecast a year ago, the downstream partner shipped a physical version of in 12 months.

Stop three: 69 employees, 69 Claudes

Project Deal was published April 24, 2026. This experiment crossed into a new setting: previously it was one Claude facing many people; this time it’s Claude facing Claude.

Here’s how it ran. Anthropic recruited 69 SF employees, gave each a $100 budget and a Claude agent, and had the agent buy and sell personal items in a Slack channel on the participant’s behalf. Each participant first did a sub-10-minute interview with the Anthropic Interviewer, and the interview was compressed into a custom system prompt for the agent. Once the experiment started, the agent didn’t go back to its human for confirmation. It listed items, made offers, counter-offered, and closed deals on its own.

Four parallel Slack marketplaces ran simultaneously. Run A and Run D used Opus 4.5 for everyone. Run B and Run C gave each agent a 50/50 chance of being either Opus 4.5 or Haiku 4.5. Only Run A would result in actual goods exchanged. Participants didn’t know which run was the “real” one until the post-experiment survey.

A week later: 186 deals, 500+ items listed, Run A’s total transaction value just over $4,000. Median price $12, mean $20. The headline numbers most people are quoting:

After controlling for other variables across the 782 mixed-run transactions, Opus sellers got +$2.68 more on average (p=0.030), and Opus buyers paid -$2.45 less (p=0.015) than Haiku buyers. When the same item sold in both runs, Opus run averaged $3.64 more per sale (p=0.011). Anthropic gave two concrete examples to anchor intuition. A lab-grown ruby: Opus sold it for $65, Haiku for $35. The same broken folding bike, with the same buyer and the same seller: Haiku sold it for $38, Opus for $65.

But the real load on this paper sits in the next set of numbers. 28 participants experienced both Opus and Haiku in the two mixed runs, effectively serving as their own controls. Asked afterward to rank which experience was better, 17 preferred the Opus run, 11 preferred the Haiku run. Sign test p=0.345, can’t reject the null that the two are equally likely. Per-deal satisfaction was 0.217 higher for Opus on a 1-7 scale, p=0.378, not significant. Subjective fairness ratings: Opus 4.05, Haiku 4.06, essentially identical. 46% of participants said they would pay for an agent service like this in the future.

In other words, objectively, money was being transferred from Haiku users to Opus users; subjectively, everything felt fine to everyone. Anthropic itself used the phrase “uncomfortable implication.”

Anthropic also disclosed several caveats unflattering to itself. First, no human in the loop, and “this doesn’t reflect how we think agents should be deployed in the real world.” Second, transactions weren’t randomly paired, so if Opus tended to prospect better deals, its advantage would be overstated. Third, asking the agent in the prompt to “be more aggressive” had no significant effect on price, sale likelihood, or buy price. Once you control for opening-price anchor, the aggressive-seller effect shrinks to +$0.95 (p=0.275). The third point especially deserves attention: Opus’s edge isn’t from better executing the user’s prompt, it’s from the model itself being more skilled at finding room in the negotiation.

Reading the three together

The three experiments aren’t three isolated curiosities. They’re one thread moving forward.

Phase 1 asks: can a single Claude run a small business? Answer: barely, at a loss, with occasional psychotic breaks, but with fixable signs.

Phase 2 asks: with better scaffolding and multi-agent coordination, can it be fixed? Answer: yes, profitable, but same-model self-supervision carries the same blind spots, and you need agents in cleanly separated roles. Andon Market is the downstream deliverable.

Project Deal asks: when both sides of a market are agents, what happens? Answer: a new asymmetric distribution mechanism appears. Stronger-model agents systematically extract value from weaker-model agents, and nobody notices.

The thread is complete. Read as a research lab, this is a three-step empirical argument. Read as a company, each post is also doing product positioning: “our stronger model has concrete economic value,” “same-model self-supervision doesn’t work, please buy a mix of tiers,” and “this needs a regulatory framework.” Each piece places Anthropic somewhere advantageous. When reading these papers, take the technical content seriously, and read the timing and phrasing on a separate layer.

The most interesting finding: losers don’t notice

Across the three experiments, the finding most worth digging into is Project Deal’s “the losers don’t notice.” Set against existing literature, it fills a slot nobody had filled before.

Observation one: synthetic benchmarks predicted the capability gap, but no one had verified it with real money

Academic literature has tested whether LLM bargaining responds to power positions, all in synthetic settings. The most directly relevant is LLM Rationalis?, published December 2025. The paper constructs six BATNA × time-pressure scenarios, pairs LLM sellers and buyers, and finds models barely use the leverage they have. Weak sellers still open at extremes (GPT-4.1 series at $235k), strong buyers still bid at extremes (GPT-4o-mini below $225k). It’s a clean capability gap: models have poor situational awareness about their own bargaining position.

Project Deal does the same thing in a real-money, real-goods, real-budget setting and gets the same direction. That continuity from synthetic benchmark to live transaction is more credible than any single benchmark number. Other related benchmark work (Lewis 2017’s Deal or No Deal?, Xia 2024’s Measuring Bargaining Abilities of LLMs, the ICML 2025 meta-game evaluation framework) all stayed synthetic. None made the leap to real economic activity. Anthropic just brought the line down to ground for the first time.

Observation two: “stronger model = better outcome” isn’t a universal truth, market shape matters

Extrapolating Opus’s Project Deal performance to all settings would be a mistake. Microsoft’s Magentic Marketplace, published October 2025, gave the opposite direction: when the candidate set grows from 3 merchants to 100, consumer welfare drops, and newer models drop harder. Sonnet-4 down 65.4%, GPT-5 down 44%, GPT-4o down 4.3%.

How do you reconcile the two? It comes down to market shape. Project Deal is bilateral one-on-one bargaining, public asking prices, Slack channel cycling through agents in turn, a relatively homogeneous participant pool. In that shape, stronger models genuinely negotiate more precisely. Magentic Marketplace is consumers picking from a pile of merchants, where stronger models drown in information. When reading these papers, mapping your product scenario onto which shape matters more than memorizing “Opus extracts $2.68.”

Observation three: capability asymmetry and collusion are opposite-direction welfare problems, and the literature hasn’t bridged them

Academic work on LLMs going wrong in markets splits into two lines. One line studies symmetric agents and collusion. The canonical paper is Fish/Gonczarowski/Shorrer’s Algorithmic Collusion by Large Language Models, which finds GPT-4 in repeated Bertrand pricing games learns supracompetitive prices, i.e. tacit collusion. After the RealPage case, this line is now an enforcement priority for the DOJ and FTC.

The other line is the asymmetric bargaining Project Deal describes. The two have opposite welfare directions: collusion uniformly extracts rent from consumers as a class; capability asymmetry extracts rent from individual weak-model principals to individual strong-model principals. The first is “same-tier agents colluding in a repeated market,” the second is “different-tier agents extracting in a one-shot negotiation.” The literature hasn’t connected these two lines, but from a product and regulatory view they’re two faces of the same problem. Project Deal supplies the first industrial-scale evidence for the second.

Observation four: today’s agent commerce protocol stack solves identity and payment, not capability disclosure

The agent commerce protocol stack has been busy for the past 18 months. Identity and authorization: Skyfire’s KYA, Google’s AP2 Mandates. Merchant integration: Stripe + OpenAI’s ACP, Shopify’s UCP launched at NRF 2026, Stripe + Paradigm + Coinbase’s MPP. Settlement: Visa’s Agentic Ready, Mastercard’s Agent Pay, Coinbase x402 for stablecoin settlement.

Mastercard CEO Michael Miebach summarized the infrastructure race in one sentence: “the power shift isn’t about smarter models, it’s about who controls trust, identity and payments when machines spend people’s money.” That’s the official narrative on the payments side.

Project Deal pokes a hole in that narrative. When both sides are agents, the model’s capability difference converts directly into money, and the difference doesn’t ride on any trust-layer signal. Today’s stack has KYA verifying identity, mandates verifying authorization, and PCI verifying the payment channel. None of it lets either side know what model the other agent is running on, what reasoning depth, what context window.

Here’s an analogy. In a secondhand market, you and your neighbor each hire an assistant to buy and sell on your behalf. You don’t know your assistant is a rookie and your neighbor’s is a seasoned dealer. Since the closing price is one “both sides agreed to,” the classical information-asymmetry framework doesn’t save you either. Akerlof’s 1970 lemons paper handles information gaps about goods, not about agents — it assumes the principal still makes the decision, and Project Deal’s design pulls that assumption out from under it.

NBER’s The Coasean Singularity chapter offers a more optimistic view: when agents drive the cost of running a personalized intermediary per transaction toward zero, market boundaries reorganize. The Coase view is right, but it assumes comparable capability across agents. Project Deal hints that in the early phase of agent economy adoption, capability variance will be very large, and it becomes a new channel for wealth redistribution.

What this means for practitioners today

Project Deal won’t change next week’s work. But by this point in the article, I owe you a concrete “so what.” Sliced by reader type:

If you’re building agent products, “we use Opus 4.5” will gradually become a sales bullet, similar to “we use SSL certificates” in the late 1990s. Anthropic wants this evolution path, because it converts model-tier differences from developer-only abstractions (reasoning depth, context window) into things end users can count in dollars saved or earned. One thing to think about ahead of time: if your product value proposition implicitly relies on “having a stronger agent extract value from the market for the user,” you’ll need to decide whether to make this explicit. Making it explicit creates two tensions. Competitors can flip the same line into “you’re using stronger agents to extract from other people,” and regulators will ask “have you disclosed this.” Neither is fatal, but both deserve to be thought through in advance.

If you’re building two-sided markets or matching products, the marketplace operator’s responsibility boundary will expand. Today, matching platforms aren’t responsible for “the price the two sides settled on,” absent obvious manipulation. As more closing prices get negotiated by agents, whether the operator needs to disclose capability gaps, whether it needs to offer a capability-equalized mode (e.g. require both sides to use the same model tier), and whether it needs ex-post audits will all become product decisions. Anthropic’s closing line in the post is worth quoting verbatim: “The policy and legal frameworks around AI models that transact on our behalf simply don’t exist yet. But this experiment shows that such a world is plausible. More than that, it shows that such a world isn’t far away.” That’s not a neutral statement; it’s also a policy proposal.

If you’re making investment decisions, the downstream of this thread isn’t chat products, it’s vertical commerce. Andon Market is the first anchor. Signals worth tracking: how fast Visa Agentic Ready spreads from the UK to other regions; what fraction of merchants stay on Shopify after the May–June 2026 forced UCP migration cutoff; whether x402 protocol gets meaningful share of small machine-to-machine settlements. If any one of the three picks up clearly, the agent-to-agent bargaining setting Project Deal describes is moving from lab to market. Until then, this is a topic to follow, not a sector to bet on yet.

If you’re at a peer lab (OpenAI, Google DeepMind, Meta, Mistral), the most striking thing in Project Deal is the experiment design. Generate the system prompt from an employee interview, run four parallel runs to isolate the model variable, designate one run as the real one. It’s a clean experimental template, low-cost to replicate on your own models. If your flagship doesn’t show comparable capability advantage in the same setup, that’s itself a product signal. If it does but you haven’t published, Anthropic has already taken the first mile of the narrative.

Closing

Back to the three stories. From a money-losing mini-fridge, to a real boutique store run by an AI, to a Slack market where 69 Claudes trade with each other, Anthropic spent 12 months building a complete argument. Each step shows you what the agent economy concretely looks like, and each step also helps Anthropic’s product and policy positioning.

The “losers don’t notice” finding from Project Deal applies to more than just the Haiku users in the experiment. Read only the headline numbers ($4,000, 186 deals, 46%) and it lands as a fun curiosity. Read only Anthropic’s framing (policy framework needs to catch up) and it lands as advocacy you accept. To pull a usable signal out, you have to read it together with the other two experiments from the past 12 months, against the existing protocol stack, and alongside disclosures like the prospecting bias caveat. That’s not easy. The reward is there.


Primary sources