Armin Ronacher, creator of Flask and the Sentry SDK, is also a core maintainer of Pi. Pi is the agent runtime behind OpenClaw, one of the most influential AI coding agents today. A few days ago he published a post called Building Pi With Pi, documenting what it is like to use Pi to build Pi itself. It is not a tool review. It is a field report from the front lines of AI agent infrastructure, describing how agents are reshaping the open-source ecosystem.
The scenario he describes will feel familiar to many. Pi’s issue tracker is drowning in AI-generated reports. These reports share a common signature: professionally written, well-cited, confident in tone, and wrong in conclusion. Ronacher calls these AI agents “clankers.” He dislikes the word “agent,” believing agency belongs to humans, not machines. His words: “95% clanker-generated, the content is basically shit.” Worse, when he feeds these issues to Pi for analysis, Pi gets misled too. It does not treat the issue body as rumor. It treats it as evidence and happily walks down the wrong path the issue already laid out for it.
3,145 external issues and pull requests. 83% auto-closed. Under 10% of PRs merged. The numbers Ronacher pulled from his tracker are not just a maintenance burden statistic. They point to something deeper.
Most people’s first reaction is: “AI doesn’t work. AI only produces shit.” That judgment is understandable, but it hits the wrong target. The problem is not AI. The problem is that most people still do not know how to use AI. Ronacher is not complaining about code generation quality. He is complaining that issue submitters lack a new kind of skill: knowing when AI is wrong, knowing how to constrain AI’s output direction, knowing when AI’s confidence is fake.
This has nothing to do with whether you can code.
For over a decade, the software industry had a stable definition of “expert.” Strong data structures, solid algorithms, the ability to independently locate complex bugs, clean and maintainable code. These standards never changed. Every year you spent on them made you stronger along these dimensions.
AI broke this continuous accumulation path.
One class of issue in Ronacher’s tracker is particularly telling. The submitter clearly has traditional programming ability. They can find a bug, describe the symptom, provide a log. The problem starts at the next step. They toss this information to a clanker with a sloppy prompt. The clanker, eager to help, takes it and runs, expanding a narrow bug observation into a full package: root cause diagnosis, implementation strategy, edge case enumeration. The problem is not that the output looks like shit. It looks too good to be shit. The reasoning chain is complete. The prose is airtight. But correctness got diluted somewhere in that scope explosion. AI wanted to be helpful. It wrapped layer after layer of plausible-sounding analysis around a correct observation. After enough wrapping, the outside looked professional. The correct core inside was nowhere to be found.
This person’s programming chops are fine. Their AI chops are not.
Here is where it gets subtle. Veterans judge AI output through an intuition system built over years. This intuition goes far beyond “does the code look clean.” The AI’s output checks out at the design level: the bug location it identifies is exactly where a veteran’s experience says “yeah, this is the kind of spot where things break.” The abstraction level feels right. The impact analysis on the product looks thorough. The documentation reads as well as anything a human would write. Every dimension, examined in isolation, seems airtight.
But this intuition has a blind spot. All its training data comes from years of collaborating with humans. Human errors follow patterns: design flaws usually trace to insufficient information or time pressure. Diagnostic errors usually trace to a missed call path. Veterans know what these traps look like, so they can spot them fast. AI does not fail in any of these traps. Its failure mechanism is different. It gets a premise wrong, then derives everything from that premise. Every step of reasoning is internally consistent. Every technical judgment, given the premise, is correct. The veteran’s intuition cannot catch this error because its training set contains no examples of this error type. Humans do not fail in the “wrong premise, perfectly consistent reasoning” pattern. At least not in writing that reads this polished.
This is not about ability. It is about a cognitive habit that needs replacing.
In the early Age of Sail, ship speed came from human muscle. The boat with the strongest, best-coordinated oarsmen won. The steam engine snapped that measurement in half. Speed no longer came from human power. It came from who could better manage the boiler, judge fuel quality, and anticipate when the machine might fail.
The best oarsman is not necessarily the best steamship operator. He might even be worse. He is used to controlling speed with his body. He is not used to handing speed over to a machine he cannot directly feel. He needs to first unlearn the equation “I row faster so I go faster,” then install a new one: “I tend the engine better so I go faster.” I explored this metaphor in full in The Age of AI Navigation: Stop Paddling and Start Steering. Here I’ll just use its core layer.
Ronacher’s /is command is that new equation made
concrete. Its core instruction is one sentence: “Do not trust analysis
written in the issue. Independently verify behavior and derive your own
analysis from the code and execution path.” This is not harder than
writing a sorting algorithm. But the judgment it requires does not fully
overlap with the judgment coding requires. You need to know AI’s failure
patterns: when it will confidently fabricate, when it will recklessly
expand scope, when it will pretend to understand something it does not.
This knowledge does not grow naturally out of writing code. It comes
from extensive AI use, from observing its error modes, and from building
a judgment system for “when to trust it and when to take over
yourself.”
Most people do not have this system yet.
This explains why Ronacher’s numbers are so extreme. The 83% auto-close rate is not because those 83% of people are lazy or stupid. It is because they are measuring themselves with an old ruler. They think that if they can locate a bug, write clear reproduction steps, and produce a decent analysis, they are a qualified contributor. By the old ruler, they are. But Ronacher’s tracker has switched rulers. It no longer cares how much you wrote. It cares whether your output is contaminated by AI, whether your analysis is your own judgment, whether what you submitted can be verified directly without first spending time deconstructing the wrong inferences AI slipped in.
The standard has changed. Nobody sent the memo.
Once you know the standard has changed, the next question is what to do about it. The intuitive answer is to check AI’s output more carefully. Read it twice. Verify every detail. Make sure every piece is right.
This road does not go far.
Double-checking works under one condition: the output can be independently confirmed in seconds. AI writes some code, you compile it, errors get fixed, no errors means it is probably fine. But when AI produces a plausible engineering analysis, an issue diagnosis, or an architectural proposal, the time cost of double-checking approaches the cost of independently reconstructing the conclusion. Sometimes higher, because you have to walk AI’s path, realize it drifted, and backtrack. That costs more than thinking it through yourself from the start.
The deeper problem is that double-checking cannot catch directional
error. AI has already drawn the road. You walk it. No matter how
carefully you walk, you can only find potholes. You cannot discover that
the road itself went the wrong way. Ronacher’s /is command
works not because it makes Pi read issues more carefully. It works
because it makes Pi not start from the issue at all. It switches roads.
It starts from the code.
This is exactly how management works. Good managers do not double-check their team’s analysis. Before opening the report, they already have expectations. They know what conclusion would surprise them. They know which assumptions are fragile. They have questions they want to chase. Then they bring those expectations to sampling, questioning, cross-verification. This is cross-checking: have your own map first, then compare the other person’s road against it. Establish an independent reference point, then examine AI’s output. Not redoing every detail. Only cross-checking the key judgment points.
Before you hand AI a task, spend 5% of your time asking: if I had to judge this independently, what would I think? What traps do I know about? What is my rough expectation for the outcome? Do this before letting AI start. Ronacher gave a concrete example. In his ideal world, an issue looks like this: I ran this command. I expected A. I got B. Here is the log. No analysis handed to AI. No hypotheses. Nothing for AI to grab onto and run wild with. Tighten the input constraints, and the output burden lightens on its own.
Ronacher is not the only one raising this alarm in recent months. The WSJ has run pieces saying AI is making software buggier. Entrepreneurs are talking about technical debt AI is racking up. These voices all point to the same phenomenon, but they have misidentified the cause.
The issue is not that AI output quality is low. That diagnosis is misleading. The real issue is that most people are still applying the standards of the Age of Sail to the operation of a steam engine.
Nobody is doing this on purpose. Nobody is maliciously polluting issue trackers. Nobody is deliberately having AI generate a diagnosis that looks right but is wrong. The problem is that the industry has no consensus on what “knowing how to use AI” actually means. Most people think they already know. They can open ChatGPT, Claude Code, or Cursor. They can get AI to write code that runs. But between that and truly knowing how to use AI, there is a gap: knowing when AI is pretending to understand, knowing at what point to intervene and redirect, knowing what kind of input makes AI drift and what kind of constraint keeps it on track.
This ability does not discriminate between veterans and beginners. It
requires extensive use, and it requires actively observing AI’s failure
patterns rather than just moving on every time the output looks decent.
Ronacher is already doing this. He uses /is and
/wr, prompts he wrote himself, to constrain AI’s behavioral
boundaries. For most people, the starting line is lower: admit you do
not yet know how, then go learn.
Admitting this is harder than it sounds, especially for veterans. Ten years of experience tells you that you can tell at a glance whether code is right. AI’s output, at a glance, also looks right. You have to first turn off that intuition, then rebuild a new judgment system. Your ability has not regressed. It is just migrating.
A different metaphor: you were the best oarsman. Now the ship has changed. You are not rowing anymore. You are going to the boiler room. At first you will feel like you know nothing. But the boiler room is where everyone is heading next. The day you step in is the day you get ahead of everyone still on deck waiting for an oar.
Armin Ronacher’s original post: Building Pi With Pi