AI 编程AI 产品与平台

The Claude Code Nerf: An Invisible, Unilateral Downgrade at the Runtime Layer

A new word has been making the rounds in Chinese AI circles recently: 降智, literally “intelligence reduction.” It refers to the situation where some AI tool quietly gets worse at its job, even though the model name is unchanged, the parameters are unchanged, and the subscription price is unchanged. The tasks you give it start coming back lazier. It edits files without reading them first. It loops in long sessions. It starts reaching for the simplest possible fix instead of solving the actual problem. Every user’s first reaction is the same: check the prompts, blame the workflow, blame the run of bad luck. Then a few weeks later, somebody in the community produces evidence, and the vendor eventually concedes that there was an infra bug or a default value change. English-speaking communities reach for words like “nerf” or “lobotomy” for the same phenomenon.

What happened to Claude Code between February and April of 2026 is the fifth time in eight months that Anthropic has faced this kind of accusation. The previous four times all played out the same way: users complain, the vendor stays quiet, and several weeks later there is an acknowledgment of an infrastructure bug. This time was a little different, because AMD’s AI Director Stella Laurenzo did something nobody had done before. She took 6,852 local Claude Code session files, 17,871 thinking blocks, and 234,760 tool calls and ran a reverse audit on her own machine, turning “I think it got worse” into “here is exactly when, how, and on which behavioral metrics it got worse.” The analysis pushed the Hacker News discussion to 1,147 points and 630-plus comments, The Register picked it up on April 6, and Anthropic closed the issue as completed within twenty-four hours.

This piece is not going to recite all the numbers from that analysis. The issue itself and The Register’s report already cover the data in full. What I want to do is something lighter than that: use this nerf event to articulate the one intuition a builder should walk away with. The specific statistics, the mechanism names, the configuration flags will all be obsolete in a few months. The intuition will stick, and it will apply to every AI agent tool you use today and every one you will use in the future. If you are using Claude Code right now and you have started getting frustrated by the time you reach this paragraph and want to stop the bleeding immediately, there are two lines you can add to your shell config today that will bring back most of the lost reasoning depth. They are at the end of the article.

The short version up front: the real cause of this nerf is not at the model layer. The weights of Claude Opus 4.6 did not change, and the standard benchmark pass rates did not noticeably move. What changed is a layer that sits between the model and you, a layer I am going to call the runtime layer. Until recently, builders almost never evaluated this layer as a separate dimension. From now on it is the layer that decides what your AI tool actually does for you on any given day.

The runtime layer is something that did not exist before

To make sense of this event, you first have to internalize one image: when you use Claude Code or any other agentic AI tool today, the model and you are separated by an opaque intermediate layer.

Using LLMs used to be straightforward. You wrote a prompt, called the API, and got a result back. There was nothing in the middle that needed your attention. The model version and a few API parameters were the entire surface area. But agentic AI tools introduced a new kind of intermediate layer: the model is no longer the thing you call directly, it is wrapped by an agent runtime. That runtime decides on every call how deeply the model should think, how many tokens it gets, how much context to load, whether to auto-compact, when to retry. These decisions shape the output of your work every single day, and they live nowhere in the release notes, nowhere in the changelog, basically opaque to the user.

Every one of the three changes Stella’s analysis surfaced lives at this layer. On February 9, Anthropic added a mechanism called adaptive thinking to Claude Opus 4.6, letting the model decide for itself how long to think on each turn. On March 3, the default reasoning effort level dropped from high to medium. On March 5, Anthropic began rolling out a UI header that defaulted to no longer returning thinking content to local transcripts. Boris Cherny, the lead on the Claude Code team, acknowledged all three things in his response under the issue, even conceding that on certain turns adaptive thinking allocated zero reasoning budget. None of these changes appeared in any release notes. There was no advance notice. For every daily user, the bottom line is one sentence: the Claude Code you use today is not the same product as the Claude Code you used six months ago.

The model layer did not change. The protocol layer did not change. Everything that changed lives at the runtime layer. But because the runtime layer is invisible to users, the way it surfaces in your experience is “the model got dumber.”

The first intuition worth fixing in your head is this. Starting today, when you make a tool selection decision, the model layer, the protocol layer, and the runtime layer should be three independent evaluation dimensions. Asking “what model does this tool use” is no longer enough. You also have to ask: who controls the runtime layer’s defaults, when is that party going to change those defaults, and how will you know when they do.

Why the runtime layer is destined to be quietly turned down

The second and deeper intuition is about why this kind of downgrade happened in the first place. The short answer is subscription economics.

Stella included a cost estimate in her report. She took her actual March usage and priced the same API requests at Anthropic’s Bedrock on-demand rates. The number that came out the other end was $42,121 a month. The Claude Code subscription she actually paid for was $400 a month. The ratio is 122 times.

The logic behind that number is simple. The “full-speed reasoning” that power users want is genuinely too expensive in compute terms for any subscription price to cover. Anthropic is not technically incapable of giving Stella the quality she expected. They are economically incapable. Sustaining that level of reasoning for a power user under a $400 subscription means each power user is dragging down the entire subscription business. When Boris says in the issue that they will “test defaulting Teams and Enterprise users to high effort,” what you are watching is the productization of that economic conflict. Full-speed reasoning gets repackaged as a more expensive paid tier. Subscription users and enterprise users get different runtime defaults inside the same product.

This pattern is not specific to Anthropic. Cursor switching to credit-based pricing and introducing auto mode in mid-2025 was the same wall hit a different way. OpenAI uses reasoning effort tiers to handle the same pressure. Any LLM tool on a fixed-price subscription model will eventually walk into this in the power user segment. The marginal cost of the model weights is too high, and the elasticity of the subscription price is too low. The runtime layer becomes the only knob the vendor can quietly turn.

This is the second intuition worth fixing. Whatever AI agent tool you use comfortably today is, with high probability, going to be quietly turned down in ways you cannot see within six months. This is not a moral problem, it is an economic one. Building your workflow on the assumption that the tool will get quietly worse is a much more stable foundation than building it on the assumption that today’s quality is locked in. The specific form will vary. The default reasoning may drop. The context window may tighten. Retry counts may shrink. Some hidden cap may slip in. But the direction is fixed.

Between the silent change and “you’re imagining it” sits one audit log

The third intuition is about why this kind of runtime layer downgrade was able to last as long as it did before getting properly exposed.

Rewind to November 2024. Lex Fridman asked Dario Amodei a very specific question on his podcast: why do users keep saying Claude has gotten dumber? Dario’s answer at the time was that this kind of complaint is constant, and the models mostly do not change. That answer was reasonable in context, because users did not have the evidence to push back. Almost every “got dumber” complaint in history existed in one of three evidentiary forms: a vibes-based description, an isolated screenshot, or third-party benchmark scores. The first two are too soft. The third has saturation and contamination problems. Vendors could comfortably classify all of it as user perception drift or prompt quality issues.

But Anthropic itself wrote something very revealing in a postmortem from September 2025: their internal evaluations failed to catch the degradation that users were reporting. The reason is that Claude is good at recovering from individual step failures, so outcome metrics like pass rate look smooth, but in long sessions and multi-step tasks users accumulate a large number of small intermediate quality losses. This says something important. “Benchmark looks normal” and “user experience feels worse” can be simultaneously true, and that is in fact the most common failure mode of this generation of agentic AI tools. Right around the same window when Stella’s issue went up, the Marginlab Claude Code daily benchmark tracker was still showing Nominal status, which is exactly the gap.

The reason this round finally broke differently is that Stella did not use a benchmark, and she did not use vibes. She used the audit log on her own machine. Claude Code stores the complete transcript of every session as JSONL files in ~/.claude/projects/. Every thinking block, every tool call, every prompt, every model response is right there. This is not an audit interface that Anthropic deliberately exposed. It is a product decision Claude Code made early on for the sake of debugging convenience, and the side effect is that every daily user now holds in their hands a complete behavioral record of how the AI tool is acting. What Stella did was, in essence, run a reverse statistical analysis on that local audit log, surfacing the runtime layer’s silent changes for the first time in a form that is reproducible, time-stamped, and spans thousands of sessions.

This trick does not work on tools like OpenAI ChatGPT or Cursor, which keep their session data in the cloud. Codex CLI is similar to Claude Code in that it stores sessions locally, so in principle the same approach would work there. But for now, Claude Code is the first agentic AI tool to be reverse-audited this way, and Stella is the first person to make the methodology actually run end to end.

The third intuition therefore is this. What really determines whether an AI tool can be held accountable from outside is not how transparent its changelog is, and not how responsive its support team is. It is whether the tool puts behavioral data into the user’s hands. If you are choosing AI tools, treating “does it keep a complete, readable audit log locally” as an evaluation dimension matters more than how many benchmarks it scores well on. You will not feel the value of this when the tool is working well. You will discover its value when the tool starts to quietly get worse, because that audit log is the only piece of hard evidence you can produce.

A few less important facts that are still worth knowing

With those three intuitions out of the way, the remaining details are things you can pass through quickly.

One is about Boris Cherny’s response. On the GitHub issue, he attributed the regression to the effort=85 default and suggested Stella use /effort high to dial it back. Stella replied that her team had been running effort=max for some time and it did not help. Boris later shifted to a different framing on Hacker News, conceding that adaptive thinking was allocating zero reasoning to certain turns and recommending CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 instead. At the same time, the issue was closed as completed. The combination of “concede the bug but close the issue” got summed up in a Reddit thread like this: Anthropic should spend less energy making this kind of decline harder to see, and more energy actually fixing the model.

Another is that Stella was not alone. Starting in early March, Yan Gao’s Substack had already been compiling the timeline. Independent blogs and Reddit megathreads were all accumulating evidence in the same direction. But all of it stayed at the descriptive level. Stella was the first to quantify it. There is one line from Greyhound Research’s analyst quoted in InfoWorld that says it more precisely than I can: this kind of degradation does not cause users to walk away overnight, the danger is that it is a quiet shift, slowly corroding the trust that teams placed in the system for serious multi-step work.

The last is about the historical pattern. Anthropic has been through this loop many times now. August through September 2025 produced the three-bug infrastructure postmortem affecting roughly 30% of Claude Code users. December 2025 saw five separate incidents in a single month. January 26 of 2026 was a harness rollback. Now this. Every time it has been a real bug. Every acknowledgment has trailed community complaints by weeks or months. The pattern is stable enough that the next time it happens you should already know what to do.

What you can do today

Before continuing with the more abstract judgments, let me finish the most practical part. Anthropic is unlikely to take further action on this (the issue is closed as completed, and nothing appears in the changelog), but Boris himself gave away two flags on Hacker News that you can add to your shell config today to recover most of the lost reasoning depth.

export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1

This is the important one. It disables the adaptive thinking mechanism and forces the model to use a fixed reasoning budget every turn instead of deciding on its own. The worst cases in Stella’s data were exactly the turns where adaptive thinking allocated zero reasoning tokens, and Boris confirmed this behavior on HN in so many words. Once the flag is set, the “some turns just don’t think” behavior stops happening. Multiple users on HN have confirmed this flag alone makes a noticeable difference.

The second thing is a session-level command to set the effort ceiling to max:

/effort max

This one you have to type at the start of each new Claude Code session (it does not persist). It raises the reasoning budget ceiling from medium (effort=85) back to max. The two settings are orthogonal: /effort max raises the ceiling, and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 stops the model from allocating below the ceiling at random. Doing only /effort max is not enough, because Stella’s data was collected with effort=max already set and the problem still happened. You need both to get what used to be the default behavior.

Both workarounds come straight from Anthropic, not some community hack. What is worth noticing, though, is what this means: the real power-user default for Claude Code is now something you have to opt into, while new users still get the quietly turned-down defaults. This is the thing worth remembering from this whole event. The runtime layer defaults are not the same as they were six months ago, and they will probably shift again. Next time something feels off, start by checking whether Anthropic has added a new flag that needs to be opted into.

Closing

Stella’s issue is closed, Boris’s explanation has been partially accepted, and the adaptive thinking bug will probably be quietly fixed in the next Claude Code release. Inside Anthropic, this whole thing will likely be metabolized as one postmortem document or one line in a release note. But for everyone who uses AI agent tools heavily, this nerf event should leave more behind than the impression that “Anthropic messed up again.”

What is worth carrying away are three judgments, in descending order of importance. First, today’s AI agent tools place a runtime layer between you and the model. This layer is opaque by nature, and it deserves its own evaluation dimension when you make tool selection decisions. Second, subscription-priced LLM tools are guaranteed to hit the wall of compute economics in the power user segment, and a silent runtime layer downgrade is the most common form that wall takes, so do not assume the tool you are comfortable with today is the same product six months from now. Third, whether a tool puts behavioral data on your local disk in a readable format is the only lever you will ever get for retroactive accountability, and that matters more than any benchmark.

None of these three intuitions is actually about Claude Code. They apply to Cursor, Codex CLI, Aider, Cline, OpenCode. They apply to the next generation of tools that have not been built yet. The biggest thing Stella’s event accomplished was not catching one nerf. It was leaving every builder with a new mental coordinate: there is now one extra layer between you and the AI model, and you should start treating it that way.