In mid-May, Vercel Labs released a programming language called Zero. The most
interesting part isn’t the syntax — syntax is shared between humans and
AI — it’s the compiler output. Running zero check --json
doesn’t return prose error messages. It returns structured JSON with
stable error codes and repair IDs. Each diagnostic has both a
message field (for humans) and code +
repair.id fields (for agents). Same output, two tracks.
Why would a compiler do this? You could call it a gimmick — Zero has 981 stars, two releases, sits under Vercel Labs not Vercel’s main line, and the HN thread is split between “brings nothing new” and “agent repair loop infrastructure.” Or you could ask a different question: when an AI agent writes code, the compiler throws an error, and the agent tries to fix it — a loop that runs millions of times a day — what should compiler output actually look like?
The traditional answer: it should be readable by humans. Zero’s answer: it should be matchable by agents. The difference isn’t technical — it’s a change in the premise of who consumes compiler output.
Cloudflare ran into the same problem from a different angle. Their
API has 2,594 endpoints. If the MCP approach is to create one tool per
endpoint, the tool definitions alone would consume 244,047 tokens —
before any conversation even begins, the context window is already
blown. Their solution: compress the entire API into two tools —
search and execute — and let the agent write
JavaScript to call a typed API client, running code in isolated
sandboxes. Tokens dropped from 1,170,523 to 1,069 — a 99.9% reduction
(Cloudflare Code
Mode MCP).
Put these two cases side by side, and you see something in common: they’re both working with AI’s capability boundaries. Zero handles ambiguity — the same error can be worded differently across compiler versions, and agents guess wrong. The fix is giving them stable identifiers instead of natural language. Cloudflare handles choice — an agent’s accuracy drops sharply when picking from hundreds of similar tools (Anthropic’s own data: at 134K tokens of tool definitions, Opus 4 achieved only 49% accuracy). The fix is letting it write code instead of picking from a menu.
Neither of these is “exposing existing features to AI.” That’s not design — that’s thin wrapping.
Most MCP servers are thin wrappers. They map API endpoints one-to-one to MCP tools. The format is right, the content unchanged. AI can call them, but the calling context doesn’t tell it when to use which, in what order, or how to recover from errors. The guidance knowledge — the operational knowledge that should ship alongside the tools — is missing. The leverage tools — those that encapsulate error-prone tasks into deterministic operations — are also absent. I discussed this framework in another article: when the consumer shifts from human to AI, what you ship shouldn’t just be the core API. It should also include the knowledge system for guiding AI use, and tools for bypassing AI’s weaknesses. I used Stripe as an example then; looking at compilers and MCP now, the logic is the same.
Set these two kinds of design side by side, and what really distinguishes them isn’t “did you think about AI” — it’s whether you’ve taken AI’s fundamental characteristics seriously. These characteristics are completely different from human ones, yet most “AI-first” products don’t address them in their design at all.
Human engineers accumulate knowledge — you’ve worked at this company for three years, you know which modules are problematic, you know where the last refactor went wrong. AI agents don’t. Every session starts from zero. Humans can fill in context through experience, colleagues, and code review; agents can only rely on what you feed them at startup.
Zero ships its guidance knowledge with the compiler: the
zero skills get zero --full command lets the agent read a
Markdown-format operations guide directly from the compiler — syntax
rules, build processes, common pitfalls — packaged with the compiler
version, always precisely matched. The agent won’t read documentation
describing an API that doesn’t match the installed version, the way it
would with web docs. AGENTS.md follows
the same logic: a file at the repository root that injects project
background, build commands, and code conventions into every session’s
context. Matt Pocock
has cited Humanlayer’s “instruction budget” concept — frontier LLMs
can reliably follow 150-200 instructions. This means AGENTS.md can’t
bloat; every extra rule competes for the model’s attention in
understanding the task. Completely unlike humans reading a README:
humans can skip, scan, selectively ignore — agents treat every line you
write as an instruction to follow.
This limitation also explains why Salesforce Headless 360 isn’t just “adding an API” — it’s encoding business context that previously required a human to log in and navigate the UI (whether a customer has an open escalation, a renewal due in 30 days, a violated SLA) as data the agent can access while writing code. It’s not that the agent got smarter — it’s that information that previously only lived in human memory and UI navigation paths now has an interface the agent can directly consume.
Faced with a menu of hundreds of items, a human can scan and find what they need. An agent can’t. Give it 100 tools, and its accuracy on the first five is already low; by the fiftieth, it’s essentially guessing. Anthropic recommends keeping Claude Code’s core toolset at around 12. It’s not that the model isn’t good enough — it’s that tool selection as a task doesn’t match how LLMs make decisions.
Cloudflare’s response is extreme: not optimizing tool descriptions to
make selection easier, but eliminating selection entirely — giving it
search and execute, and letting it write code
to call the API. Agent code generation accuracy is far higher than tool
selection accuracy.
Stripe’s Agent Toolkit uses a gentler version: curated surface area. Stripe has hundreds of endpoints; exposing all of them to an agent is like asking it to blind-pick from an enormous menu. The Toolkit selects the dozen or so operations an agent most likely needs, each with precise schemas and descriptions. These tools do exactly the same thing as the traditional Stripe SDK — they all call the payment API. What changed is the design assumption of the interface layer: traditional SDKs face human developers who read documentation; the Agent Toolkit faces AI systems that discover capabilities at runtime.
This is the aggressive transparency principle discussed in Beyond DRY. Traditional API design faces human developers and centers on protective abstraction — hiding complexity, providing clean interfaces, preventing user mistakes. AI-native design is the reverse: an agent won’t be scared off by complex error messages, but it will get stuck on vague ones.
A concrete example: when a traditional API catches a low-level
network timeout, it typically throws an abstracted
APIFailureError with a line saying “Operation failed,
please try again later.” This is friendly to humans — they don’t need to
know whether it was a TCP handshake timeout or a DNS resolution failure.
But it’s fatal to an agent. An agent’s effectiveness depends on a
“try-feedback-fix” loop. Vague error messages break this loop — the
agent doesn’t know what specifically went wrong and can only flail
randomly before retrying.
The correct approach is to preserve the original
ConnectTimeoutError with the full stack trace and context.
The agent immediately sees that a specific step timed out and can retry
with backoff or switch endpoints. The information volume is excessive
for a human; it’s exactly right for an agent.
Zero’s repair IDs are the same principle applied differently. Natural
language error messages have ambiguity — the same error can be worded
differently across versions, and the same wording can be parsed into
different fixes by the agent. The NAM003 →
declare-missing-symbol mapping is stable. Certainty at
every step of the repair loop is worth more than “letting the AI
understand natural language.”
Progress varies dramatically by layer. The platform layer is moving fastest — Salesforce, Stripe, Atlassian, AWS are shipping agent-first products as core roadmap deliveries. The protocol layer is standardizing. The security layer is still early. Whether the compiler-layer experiments survive is unknown.
There’s one common criticism worth taking seriously. Tom Bedor, in MCP is a Fad, argues that instead of building new protocols for agents, we should make human interfaces clearer. This criticism hits a real problem — a lot of what’s done for agents is indeed just thin wrapping. But it misses another category: compiler output, CRM platforms, payment systems — products where decades of design have baked in the assumption that “there’s a human on the other end.” You can’t solve that with a better README.