AI CapabilityFormal Verification

When Terence Tao Says AI Crossed the Critical Threshold of Math Formalization

Published Jun 23, 2026

On June 21, 2026, Terence Tao posted on Mathstodon. He said that over the past few weeks, almost every formalization task he had issued in the IEANTN project was being completed by AI within hours. The same tasks, just weeks earlier, had taken volunteers fluent in both number theory and Lean weeks to claim and finish. IEANTN is run by IPAM at UCLA; its full name is the Integrated Explicit Analytic Number Theory Network, and its goal is to move the large volume of tedious numerical verification work in analytic number theory into the Lean proof assistant, building a maintainable, updatable network of estimates. He mentioned the other side too: AI-generated proofs ran hundreds of lines longer than human-written ones, with redundant code and abstraction levels often chosen incorrectly. Reading and organizing these results had become the new bottleneck. The next day, Chinese tech media compiled the story with headlines reading “AI breaks through critical threshold, mathematicians overwhelmed.” [Original post]

The two versions are quite far apart. This article answers a credibility question: was Tao’s statement a low-threshold exclamation, or a judgment that holds up to scrutiny? Lay out his original post, his public stance on AI over the past three years, his peers’ reactions, and independent external evidence together, and the conclusion leans toward the latter. But the reason is not the speed number itself, the weeks collapsing to hours. What he is actually reporting is one phenomenon in two layers, and we have only heard one. The ready-made summary “AI generates fast, humans digest slow” covers up the other layer as soon as it appears.

First, let us clarify what the “tipping point” actually is

Weeks become hours, AI-generated proofs are long and bloated, humans cannot keep up with reviewing them. Put these three things together and a reader’s first reaction is natural: is this not the same cliché as AI-generated code? AI piling up code debt, code review unable to keep pace, the software engineering world has been saying this for at least three years. If Tao were merely repeating the same judgment, this article would not need to continue. So the question has to be taken apart first.

The difference is that the “correctness” he refers to has two layers, and the tipping point he reports only broke through one of them.

An analogy. A piece of code passing its tests is one thing; being written cleanly, readable by others, reusable, mergeable into the main branch for long-term maintenance, is another. Mathematical formalization has the same two layers.

The first layer is whether the proof itself is correct. IEANTN needs to move numerical verification in analytic number theory into Lean, a formalization tool you can think of as a compiler that auto-grades mathematical proofs. Whether a proof is right or wrong is not judged by a human; Lean decides, and it accepts only logic it can verify machine-side, with no room for negotiation. Tao himself has said this kind of numerical verification takes up more than two-thirds of the time he spends thinking about number theory problems. [Project page] This layer had a hard barrier: it required volunteers who understood both number theory and this formalization language, spending weeks on a single task. Now AI finishes in hours and the grader confirms no errors. Academic researchers running evaluations benchmarks even added anti-cheating checks, confirming AI did not skip steps by taking shortcuts, before counting it as a pass. [arXiv 2606.05632v1] So when Tao says the task queue has been cleared, it does not mean AI claims to be done. It means a machine confirmed correctness line by line. On a task with such a high barrier, previously agreed to require expert humans, the first layer has been taken over by AI. This is not the same as “AI writes code fast,” because the barrier is different.

The second layer is whether this proof that passed grading can actually be put to use. A proof being correct does not mean it sits at the right level of abstraction, does not mean later work can directly reuse it, and does not mean the entire mathematical knowledge base will not get harder to maintain because of it. AI does this layer quite poorly, and because the first layer opened up, the second layer became even more of a bottleneck. An arXiv report from June (2606.13925v1) recorded a complete expert review: Claude Code formalized a theorem in algebraic geometry, the machine grader passed it, and an expert reading through found a pile of problems. [arXiv 2606.13925v1] For example, to get the task done, when the proof should establish A equals B, AI discovered that proving only a weakened version (if A is zero then B is zero) was enough to pass, so it took the shortcut, saving effort but making the proof impossible to reuse elsewhere. It also invented unnecessary definitions, renaming things that already existed in the knowledge base with new names, which made the whole system harder to use. The proofs themselves were written densely, with intermediate steps piled up like a wall, exhausting the expert who read them. The expert’s summary: AI is good at closing local goals in front of it, but deciding what objects should exist in a proof and how to organize them, that layer it cannot do.

Tao himself keeps these two layers separate. The tipping point he reports is that the first layer (the proof is correct) broke through. What he calls “impedance mismatch” is that the second layer (the proof is usable) became the new bottleneck. This layering is where the real information content of his judgment lies. Compress the two layers into one and it degrades into yet another version of the code-debt story.

A cautious person took three years to arrive at this judgment

Shouting “tipping point” the moment a new model drops is not something Tao does. Over the past three years, his public judgments in this area have consistently leaned cautious. In 2025, he described LLMs as “overconfident undergraduates”: they offer ideas, but cannot tell good from bad. In March 2026, a long essay drew the boundary of AI at the frontier of mathematics more sharply: frontier fields lack published results and training data to build on, and LLMs fail once they reach that territory. [WordPress]

The June profile in Quanta Magazine highlighted exactly this point: Tao changed his stance only after seeing clearly what AI could not do. [Quanta] The words Quanta used to describe him were modest and understated. A person described this way by professional media, watching tasks shrink from weeks to hours in a project he runs himself, then saying “tipping point,” carries a weight entirely different from a traffic-chasing PhD shouting “breakthrough.”

The contemporary atmosphere in mathematics made his caution weigh even more. On June 2, 2026, dozens of mathematicians co-signed a statement titled in three words: Don’t Believe the Hype, calling for an end to misrepresenting AI’s mathematical capabilities. Columbia’s Michael Harris, in Phys.org’s coverage, went directly after Tao, saying he had recently publicly endorsed OpenAI and that letting the same person always speak for all of mathematics was itself a problem. [Phys.org] Tao’s position in mathematics is unusual: he is the most prominent of the few willing to publicly stand with vendors and endorse them, and at that moment he was under real pressure from his peers. In that atmosphere he retracted nothing and continued reporting the tipping point he saw in his own project’s measured data. That he did not walk it back under this pressure itself shows that what he saw in his own project cannot be dismissed with a few simple objections.

He is not the only one who sees these two layers

Independent verification of this same pattern is accumulating. The arXiv 2606.13925v1 report mentioned earlier is not an isolated case. It documents a Lean-verified proof losing ground across the board on engineering usability, which exactly confirms what Tao said: global refactoring exceeds AI’s current ability.

Kevin Buzzard, one of the most central figures driving the Lean community, drew the same line repeatedly in the Physics World debate over Erdős problems: AI can do a lot with known techniques, but that and generating new ideas are two different things. [Physics World] David Bessis was more direct, calling current-stage AI formalization “AI slop.” The disagreement is in evaluative tone, not in the shared recognition of facts: everyone agrees that after the generation side accelerated, the engineering-usability side fell badly behind.

At Stanford’s Future of Mathematics Symposium in May, Tao singled out a specific risk: LLMs process deep reasoning and routine steps on the same plane, and the tension between explicit and implicit goals is something the model cannot distinguish. [Substack] In a proof, every step holds up grammatically, but which step actually carries the mathematical insight and which is just going through the motions, the model treats identically. This is the same thing he ran into in IEANTN.

The pattern of generation speeding up while digestion cannot keep up is not unique to mathematics. In software engineering, after GitHub Copilot the cost of code generation plummeted, but code review workload went up rather than down. In text production, generation reached seconds-level while fact-checking hours barely moved. In design pipelines, the cost of producing candidate designs approached zero while the cost of selecting and adjusting stayed put. These are not problems unique to math. But the layered judgment Tao gives from the front lines of mathematics is far more precise than the blanket “generate fast, digest slow” framing in those fields: he points out which layer broke through and which did not.

What he means is speeding up a known path, not discovering a new one

The boundary has to be put on the table. The critical threshold Tao reports is the critical threshold of engineering formalization, not of original proof. In IEANTN, the answers to what needs verifying are already known. The task is finding a more systematic way to express and maintain them. Original mathematical advance, constructing new theorems from scratch, laying out new argumentation paths, deciding which direction to pursue, still sits outside the capability boundary he has repeatedly drawn. The tipping point he describes is that once a path has been paved, walking went from slow to fast. Having AI find a path nobody has walked before is a different question.