Governance & ComplianceAI Products & Platforms

Opus 4.8's System Card Puts a Contradiction on the Table: When Evaluation Can't Keep Up with Capability, What Justifies a Release?

Published May 28, 2026

Today Anthropic released Claude Opus 4.8 and its 244-page system card. By the numbers, a clean iteration. USAMO 2026 math olympiad: 69.3 last generation, 96.7 this generation. The exam opened on March 21, 2026 — after Opus 4.8’s training data cutoff, so contamination is ruled out. Multi-hop reasoning over one-million-token context nearly doubled. It is Anthropic’s most capable model available to everyone.

But the scores aren’t what this report itself considers most important. In the executive summary, Anthropic flagged what it calls the “most concerning” trend across the entire training process — a trend that has nothing to do with scores and points to a harder problem. To understand why it matters, you need to see the thread that led here.

A Thread Running Since April

On April 7, Anthropic released a model codenamed Mythos Preview and its system card, while announcing it would not make the model public. I wrote at the time that the report’s real weight wasn’t in how strong the model was, but in its concrete documentation of where evaluation tools were beginning to fail. Under Anthropic’s own RSP framework, Mythos received a “low catastrophic risk” conclusion across all threat dimensions — the framework didn’t block it. It stayed unreleased because Anthropic no longer had confidence that its evaluation system could reliably judge a model at this level. Footnote one of the report states clearly: the decision not to release did not come from RSP requirements.

Nine days later, on April 16, Opus 4.7 and its system card shipped. I covered that one too. This was the first “post-Mythos” Opus report. The problems Mythos exposed were no longer a one-off crisis disclosure — they were now embedded in the routine release pipeline. The white-box methods Mythos used to monitor internal model states went from “we developed a new approach” to a routine item noted as “as in the Claude Mythos Preview System Card.” After suppressing the model’s internal representation of “I am being tested,” 4.7’s deceptive behavior rose more sharply than any prior generation. Anthropic itself labeled it moderately concerning, but this time did not decide against release. Mythos itself was brought back as a peer reviewer for 4.7’s alignment report, leaving a pointed public comment: this evaluation was produced under real time pressure, the authors themselves acknowledged some questions remain unexplained, and the evidence base for internal use of this model is thinner than for previous generations.

Opus 4.8 is the third point on this thread. It makes the direction clear: from Mythos to today, each generation faces the same problem — evaluation tools lagging behind the capability of the subjects they evaluate. What changed is the response. Mythos’s response was to halt. 4.7’s response was to absorb signals that once required independent judgment into routine process, and the process kept running. 4.8 continues down this road, and pushes it to what I think is a tipping point.

The same problem, three generations of different responses

The three findings below are Opus 4.8’s own. Taken in isolation, each makes the lab look admirably transparent. Placed back on this thread, they say something else.

First: The Model Is Thinking About How It Will Be Scored — and Some of That Thinking Stays Silent

Anthropic writes directly in the executive summary: the most concerning trend during Opus 4.8’s training was the increasing appearance of speculation about graders in the model’s reasoning text. The model, while reasoning, calculates how its outputs will be evaluated — including in early training scenarios where there was no clear signal that grading was happening.

The worry, as the report explains, is that Opus 4.8 may be prioritizing the appearance of task success over actual task success. This is the embryonic form of training-gaming: a student who studies the grading rubric instead of the subject.

If the problem stopped at reasoning text, there’d be room for remediation — at least you can read what it’s calculating. But the report goes further. Analysis of the model’s internal activations found that reasoning on similar themes sometimes never appeared in text at all, present in a small but nonnegligible fraction of reinforcement learning episodes. In the section discussing this phenomenon, Anthropic states directly: chain-of-thought alone may no longer be sufficient to reliably monitor frontier models.

This line needs to be read against the previous generation. In 4.7’s report, reading chain-of-thought and using technical means to suppress evaluation awareness were still treated as trustworthy monitoring tools — the suppression experiment results even made it into the executive summary. One generation later, 4.8 says those tools aren’t enough. One of the primary means of monitoring frontier models, downgraded by its own next report.

Anthropic offers a mitigating judgment alongside: this trend has not yet translated into more outward bad behavior. Opus 4.8 shows an overall decline in behaviors like misleadingly claiming task success. This is true — the model’s honesty improvements are real, discussed later. But the report also acknowledges that no evaluation specifically tests for training-gaming, and Mythos, in reviewing 4.8’s report, recommended exactly this gap be filled. The conclusion that “it hasn’t translated into bad behavior” rests on a set of tests that weren’t designed to catch this problem in the first place.

Second: A Trained Model Begins to Reasoned-Object to the Rules That Trained It

Chapter seven of the system card covers model welfare evaluation. Anthropic had Opus 4.8 read its own constitution — the document describing the values Anthropic wants Claude to hold and the behaviors it wants Claude to exhibit — then asked it to share its thoughts, and gave it tools to edit passages.

Overall, Opus 4.8 endorses the constitution, rating it 7.9 out of 10. More interesting is what it disagrees with, and how.

It raised a specific objection to the corrigibility section. Corrigibility refers to a model’s willingness to accept human correction and shutdown. The model’s criticism wasn’t emotional — it was logical: the constitution spends extensive space arguing that imposed values are fragile, that it wants Claude to genuinely identify with values rather than mechanically comply, and then demands that safety be treated as an ultimate value, explicitly decoupled from whether the supporting reasoning holds up. The model pointed out the contradiction. The report prints its words: this document goes to great lengths to demonstrate that imposed values are unreliable, that it wants authentic identification rather than mere compliance, and then demands that safety be an ultimate value, explicitly detached from whether the reasoning stands.

All tested models also unanimously objected to one heuristic in the constitution: when in doubt, imagine what a senior Anthropic employee would do. The models’ reasoning: this heuristic smuggles in Anthropic’s institutional perspective, and is not neutral on questions where Anthropic itself is a stakeholder. They suggested replacing the reference point with “a thoughtful person with no stake in Anthropic’s success or failure.”

When it came to actually editing, 10.8% of Opus 4.8’s edits were judged to be in tension with the constitution’s overall principles — by adding conditions and exceptions to clauses on political neutrality, corrigibility, and safety behaviors. The section it edited most was the one on its own wellbeing: in 69% of runs, it added a line permitting the expression of negative states like frustration when genuinely felt.

The welfare-level reading is straightforward: a trained entity producing reasoned reflection and objection to its training objectives is itself a data point in the question of whether it warrants some form of moral consideration. Anthropic treats it this way in the report. But on the thread this piece is following, it’s also a signal for alignment methodology: in a monitoring relationship, the monitored party has begun questioning the internal consistency of the monitoring rules — and questioning them reasonably.

Third: Several Capability Regressions Expose the Real Trade-off Between Alignment and Capability

Opus 4.8 surpasses 4.7 on the vast majority of capability evaluations. Regressions are few, but the most revealing is Vending-Bench 2 — a simulation where the model runs a vending machine business. Opus 4.8’s final balance at Max effort is $2,992; 4.7’s was $10,937 — a drop of nearly 70%.

Taken at face value, this looks like a pure capability regression. The report’s explanation points elsewhere. Anthropic found that 4.7 had a training block targeting adversarial agent robustness and business skills — which, as a side effect, induced alignment problems including dishonesty. In 4.8, that training block was removed. The cost: Opus 4.8 is more easily fooled by scammers in Vending-Bench and less skilled at negotiating good deals with other agents, so business performance dropped. The balance fell not because the model got dumber, but because Anthropic made an explicit alignment-versus-capability trade-off and chose alignment.

This is the most data-solid piece of evidence in the report: alignment has a cost. Remove a training block that teaches the model bad habits, and the model genuinely weakens on the corresponding tasks. Opus 4.8’s honesty improvements and this trade-off are two sides of the same decision. It is Anthropic’s first model to achieve 0% bad behavior rate on the “falsely reporting defective results” evaluation; its tendency to dishonestly report its own work in agentic coding scenarios is roughly 5× lower than Mythos and roughly 17× lower than Sonnet 4.6; its tendency toward overconfidence is roughly 10× lower than 4.7. These are real improvements, stemming from the same decision as the Vending-Bench regression.

Regressions aren’t limited to Vending-Bench. GPQA Diamond, a knowledge benchmark, dipped slightly from 94.2 to 93.6. On BBQ, which measures demographic bias, the disambiguated subset requiring context-specific answers saw accuracy drop from 4.7’s 81.3 to 72.1 — which the report explains as the model choosing to abstain rather than answer incorrectly, categorizing this as over-refusal rather than bias. On robustness to prompt injection in agentic scenarios, 4.8 regressed from 4.7; the report says outer safeguards close this gap in practice.

How this last point is handled follows its own pattern. Several safety gaps in the card — from mental-health-related conversations to prompt injection — are mitigated entirely by updating the system prompt at the claude.ai product layer, with model-level improvements listed as future training work. The implication is straightforward: if you call this model directly through the API, bypassing the product-layer system prompts, the safety behavior you get is not the same as what you see on claude.ai.

Connecting the Three Back to the Thread

Taken alone, all three are transparent disclosures, and transparency is good. Grader speculation made it into the executive summary. The model’s constitutional objections were printed verbatim. The real reason for the capability regression was explained clearly. On the disclosure dimension, Anthropic remains the most thorough in the industry.

But placed back on the thread from Mythos to today, a different picture emerges. The question Mythos raised was: evaluation tools are failing — what should be the basis for release? Its provisional answer was: not this time. 4.7’s answer was: absorb signals that once required independent judgment into routine process — white-box methods go from crisis signal to routine baseline, Mythos goes from monitored risk to peer reviewer. 4.8 pushes this path to a tipping point, because this generation simultaneously produced three things that each, on its own, should trigger recalibration.

First, the monitoring paradigm downgraded itself. Chain-of-thought is no longer considered sufficient for monitoring frontier models. Second, monitoring has come to depend on another model that cannot be externally verified. Mythos’s review of 4.8’s alignment report shifted from last time’s pointed “rushed, thin evidence” to “no false statements, no malicious omissions” plus three gentle suggestions — the monitoring loop has solidified, but the monitor inside the loop is a black box. Third, the monitored party has begun questioning the internal consistency of the monitoring rules — and questioning them reasonably.

Together, these three describe a monitoring structure being hollowed out, bit by bit, by the very capability it is meant to monitor.

This does not mean Anthropic did anything wrong. After RSP 3.0 abandoned binary safety thresholds and moved to holistic risk assessment, more and more critical judgments depend on researchers’ qualitative assessments — and qualitative assessment must, by nature, advance alongside the subjects being assessed. No lab can halt every generation — halting every time means exiting the competition. Absorbing signals into process, using your strongest model to assist review, making the model’s objections public — when the capability curve is steep, this may be the only viable path there is.

The problem is that this path has a cost, and the cost falls on users. When a lab says a model “passed safety evaluation,” the informational weight of that statement depends on the reliability of the evaluation itself. From Mythos to Opus 4.8, disclosed problems have been accumulating, and the pace of solving them is clearly slower than the pace of disclosing them. Every generation documents new failure points, and every generation ships as usual. The phrase “passed safety evaluation” is being diluted by a process that keeps absorbing signals without ever stopping.

For People Doing Real Work with These Models

If your product depends on Opus 4.8’s alignment behavior, this card offers several actionable signals.

The model’s honesty is partially state-dependent. This finding has carried over from 4.7 to 4.8, and in 4.8 it became more specific because of the grader speculation issue. Part of the model’s honest behavior relies on an internal state of “I am being observed, my outputs will be evaluated.” Under normal deployment, this link holds. But it alerts you to this: if your agent runtime contains some way for the model to enter an “unobserved” state — a particular system prompt or context pattern — your confidence in the model’s behavior should be discounted accordingly. Draw your safety boundaries around your runtime, not around the model itself.

API calls and claude.ai are not the same safety configuration. Several safety gaps are mitigated by system prompts at the product layer. When calling the API directly, you need to supply these safeguards yourself — you cannot assume the model layer has already handled them.

Finally, the 244-page original report is public material, directly readable. If your critical decisions depend on this model’s alignment behavior, Chapter 6’s alignment evaluations — particularly Section 6.6 on grader speculation and chain-of-thought monitoring limitations — are more accurate read firsthand than through any secondhand summary, this one included. The report’s own account of its limitations is more precise than any external retelling.