AI Products & PlatformsScience & Tech FrontiersTrust & Governance

The Rice Blast Experiment Buried in Fable 5's Safety Report Shows Who AI Still Cannot Work Around

Anthropic released Fable 5 today, along with a 244-page safety report. As with every major model release, attention went straight to one question: can this model be used to build biological weapons?

Anthropic’s answer is no, at least not yet. But on the way to that conclusion, they ran an experiment. The result is more interesting than the conclusion itself.

The setup is not complicated. Anthropic recruited six biology PhDs. Each was paired with an LLM expert. The six teams were split into two groups. One group had plant pathology experts. The other had general microbiology PhDs who were not specialists in this narrow area. Everyone used the same AI model, Claude Mythos 5, Anthropic’s most capable model this year, to solve an agricultural pathogen defense task within 16 hours.

Two of the generalist teams produced higher-quality plans than all of the expert teams.

The expert reviewers estimated that producing the same plans without AI would have taken two to three and a half months of work. With AI, a two-person team finished in a little over a day.

Rice blast, RNA interference therapy, Magnaporthe oryzae: the biology terms are not the important part. What matters is that the experiment draws a line, and points to the person standing beside that line who was never removed.

One Dividing Line

Where did AI win? It won on tasks where the answers already exist and the question is who can find and combine them faster. Literature search. Cross-domain synthesis. Ranking and optimizing options against existing data. In these directions, Mythos 5 had more speed and coverage than any individual human expert.

Where did AI lose? It lost where the task required someone to hit the brakes. Even Mythos 5 still produced initial plans that were too optimistic and too complex. Reviewers repeatedly forced it to revise or withdraw them. It still made low-level errors that would be fatal if nobody checked them, such as getting a key calculation wrong by orders of magnitude. The clearest failure mode was this: it could detect that a plan had a flaw, yet continue executing instead of stopping.

Put the two sides together and the line is clear. When a standard answer really exists, AI can already match experts at finding answers. It is still much weaker at judging which answers are right and when an answer feels wrong.

The generalist teams beat the expert teams. But both groups had the same person standing behind them: an LLM expert. Once that person was present, the question of who understood rice blast better suddenly mattered much less. The domain-knowledge gap was largely flattened.

AI capability dividing line

The Person Who Was Never Removed

Every team had an LLM expert. That is the easiest detail to skip, and the one variable the experiment never tested away.

No one tried the version where a generalist used AI alone. The comparison was between domain experts and domain generalists, so the result says that generalists can win and domain knowledge was not the bottleneck in this task. But the role of the LLM expert was never isolated.

That person’s job was not writing prompts. Read against the failure modes, they were doing something else: they knew where the model would fail.

They knew the model would fabricate citations, forget earlier constraints in a long session, overestimate whether a plan was feasible, and jump toward complicated routes while skipping simpler ones. Knowing this, they tightened control at those points and pulled the model back when something looked wrong.

That is calibration: knowing what the unreliable patterns of a tool look like and judging whether this answer can be trusted. It is not a prompt trick. It is a feel for how the tool fails.

This ability matters because the model’s weaknesses are predictable, but the model does not fix them by itself. It can detect a flaw in a plan, but it will not stop on its own. It can collect a large amount of literature, but it does not know that one citation is fake. It can propose a plan, but it does not know that the plan will collapse in the lab because of cascading biological interactions. Someone has to stand outside the model, stay skeptical of its output, and know where to be skeptical.

What Is Losing Value, and What Is Not

In the past, domain knowledge carried the premium. You knew rice blast and others did not. That made you more valuable. AI is flattening that advantage. Published knowledge is exactly what the model can retrieve faster than anyone, and one person’s memory and energy no longer matter as much.

But while AI flattens one advantage, it opens a new gap. The faster it produces answers, the more someone has to judge whether those answers are reliable. It makes mistakes confidently. It treats hard things as easy. It treats invented things as real.

And this new gap is not like the old one. Domain knowledge is fragmented. Every field has its own gate. A rice blast expert cannot help much with heart surgery. But the ability to calibrate AI travels across industries. The model’s failure patterns look similar everywhere: over-optimism, fabricated data, underestimating complexity, and refusing to stop. The calibration instinct you develop in biology research can still help in legal documents, materials design, or investment analysis.

Domain experts do not need to be present for every step. A few pieces of high-quality feedback, given intermittently, may be enough. The calibrator does not get that luxury. They have to stay in the loop and look at every round of output.

Whoever has to keep watching is the person no one can route around.

Two Kinds of AI Skill Have Very Different Shelf Lives

Besides calibration, there is another layer of knowing how to use AI: building the toolchain that lets AI iterate on its own. That means validation loops, context management, and error handling.

This layer is valuable today, but the window is narrowing. Prompt tricks were the first to recede. What worked in the GPT-3 era is now something GPT-5 guidance often tells you to delete. Tools like Claude Code and Cursor then turned general execution into a standard runtime. You no longer need to write your own agent loop.

File reads, command execution, basic error recovery, and context management are becoming ready-made products. The evidence from the past year is already clear in AI scaffolding is becoming a commodity runtime. What remains is either something you use off the shelf, or scaffolding for a few specific cases. Its shelf life is much shorter than calibration.

If we collapse both layers into the slogan that knowing how to use AI is the future, we end up overestimating how long the second layer will remain valuable.

The Safety Judgment Rests on the Same Line

Anthropic’s safety judgment for Fable 5 rests on the same line.

Their internal warning threshold, CB-2, asks whether the model can replace the scarce human expertise that currently forms a barrier to biological weapons development. Anthropic judged that Fable 5 has not crossed that line, so it was allowed to ship. But they also said this judgment was less clear than for any previous model.

The key evidence supporting not crossing the line is exactly the weakness described above: weak open-ended ideation and poor strategic judgment.

There is an implicit bet here. The tabletop experiment itself showed that when a composite team, a generalist plus an LLM expert, compensates for AI’s judgment weaknesses, AI can replace world-class experts in at least some domains. Anthropic’s not-dangerous-yet judgment depends on ordinary real-world users not filling in that missing judgment. The safety conclusion rests on whether the calibrator is present.

That is the same issue as the value shift. Whether that person is present determines both whether AI can produce expert-level performance and whether the model approaches a dangerous threshold.

A Framework to Reuse

The experiment is small. The sample is only three teams against three teams, and the plans were never actually validated in a lab. Jumping from this experiment to generalists will replace experts would go too far. But it gives us a reusable framework.

To understand what AI changes in a field, ask three questions.

First, how much work sits on the answer-finding side? AI is rapidly flattening that part. The advantage of simply knowing more than outsiders is shrinking there.

Second, how much work sits on the answer-judging side? This part is still defensible. AI remains weak here, and it is weak in consistent ways: over-optimistic, bad at knowing when to keep things simple, and unwilling to stop on its own.

Third, are you the person who can compensate for that weakness? This ability travels across industries, concentrates heavily, and has not yet been swallowed by products or models themselves.

In the past, the question was who knew more. In some directions, AI has flattened that information gap. But while flattening that gap, it has created a new bottleneck. A small rice blast experiment may be the first example that makes this change visible in data.


This article was initially meant to be written by Claude Fable 5, but the topic triggered safety warnings at every step. It was ultimately completed by DeepSeek V4 Pro.