AI CodingScience & Tech FrontiersModel Architecture

Beyond Vibe Coding: The Industrialization of AI Programming

MAI built a pipeline to mass-produce training environments from open-source PRs — 265K usable problems from 4.87M candidates. That infrastructure, not better models, is what changed AI coding.

MAI-Thinking-1 Deep Dive · Part 3 of 3 · [1] Training LLMs Is Rock Climbing, Not Rocket Science · [2] Getting Models to Think Is Easy. Sustaining It Is Hard. · [3] Beyond Vibe Coding: The Industrialization of AI Programming (this article)


Two years ago, having AI write code looked like this: open Cursor, describe what you want in natural language, AI generates some code, you copy-paste it, run it, tweak it, done. People called this “vibe coding”: you give a vibe, AI spits out code.

What about now? Open GitHub, find an issue. Paste the issue description to a coding agent. The agent clones the repo on its own, reads the code on its own, edits files on its own, runs tests on its own, and if tests fail, fixes them on its own — until every test goes green.

What changed in those two years?

Most people would say “models got better.” The numbers back that up. But this explanation misses something more interesting. The real shift in AI coding didn’t happen inside the models. It happened somewhere far less glamorous: the industrialization of training infrastructure.

This is the third article in the MAI-Thinking-1 deep-dive series. The first discussed MAI’s pre-training philosophy: the climbing machine, Efficiency Gain, and watching trends rather than points. The second discussed its RL training discipline: thermostats, circuit breakers, and self-distillation. This one covers the final piece: how the problems and reward signals used in training were built through industrial-scale processes.


Before getting into the numbers, it’s worth understanding why this system matters. The core of agentic training is a feedback loop: a model takes actions in an environment (writing code, running tests, calling tools), the environment returns a reward signal (right or wrong, good or bad), the model adjusts itself based on that signal, then enters the next round. The environment is the teacher; the reward is the teacher’s feedback. More teachers, more precise feedback — the model learns faster.

What MAI did was turn “producing teachers” and “standardizing feedback” from manual labor into an assembly line. Let’s look at what these exam papers look like first, then how they were mass-produced.


What a Self-Grading Exam Looks Like

Before understanding what MAI did, you need a reference point.

Each problem in SWE-bench is essentially three layers nested together. The innermost layer is a Docker container holding a codebase with a bug. The middle layer is a test suite — fix the bug and these tests all pass. The outermost layer is the grading logic: tests all pass, you pass.

That’s a standardized exam: a question (the bug), a workspace (the codebase), a correct answer (all tests green), and a grading rule (run the tests and check). You drop a model into this exam and let it figure things out.

SWE-bench itself is for evaluation, and it only has a few thousand problems. But RL training needs something structurally identical: an executable environment plus a grading logic. The only difference is scale. A few thousand problems work for evaluation. Training needs hundreds of thousands. What MAI did was industrially produce these training problems from GitHub.

So: if you want to use these exams to train a coding agent at scale, how many do you need? Not ten. Not a hundred. Tens of thousands. Where do they come from?


How 265,000 Problems Were Refined

MAI’s answer: from GitHub.

The numbers go like this: start by pulling 4.87 million merged PRs from GitHub, each paired with its issue description. In theory, every PR is a potential exam. The issue is the question, the code diff is the answer key, and the passing tests are the grading criteria.

But “potential” and “actually usable” are separated by three cuts.

The first cut: can it even run? Can this PR be loaded into a Docker sandbox with the right dependencies and toolchain, and actually execute? MAI used an automated pipeline to test this. Out of 4.87 million, 2.08 million survived this cut. Nearly 60% gone.

The second cut: can it be graded? The environment runs, but that doesn’t mean you can give the model a clear grading signal. Did the tests pass or not? This judgment has to strictly correspond to what the original PR was trying to do — you can’t have the grading fail because some unrelated component in the environment broke. Out of 2.08 million, 745,000 passed this step. Another 60%+ gone.

The third cut: can it run stably in a training environment? The training sandbox is far more demanding than the validation environment. Network is cut off. Git history is wiped. Commit hashes, PR numbers, discussion links — all scrubbed. All non-deterministic behavior is eliminated. The same exam, run at different times, must produce the same result. Out of 745,000, the final number that passed all three checkpoints was 265,000.

4.87 million in, 265,000 out. 5.5%. For every usable exam paper that emerges, eighteen are thrown away.

That 5.5% yield is the cost of industrialization. Environments don’t fall from the sky. They’re industrial products, filtered into existence.


Why the Exam Hall Has No Internet

The third cut hides a crucial issue. These exam papers all come from public PRs on GitHub. If a model can access the internet during training, it can search for the original PR. No reasoning required — just search, copy the answer.

MAI’s response went beyond cutting the network. They scrubbed every clue in the sandbox that could reveal the answer. Commit hashes, gone. PR numbers, gone. References to the original repo, gone. Discussion links, gone. What’s left is a clean problem description plus a test suite, with no trail of breadcrumbs.

This reveals something subtle but important. Environment design and reward signal design aren’t two separate steps. They’re the same task, bound together by the shared goal of preventing cheating. In an industrial training pipeline, writing the exam and preventing cheating are two sides of the same coin.


265,000 Problems. Each One Needs a Grade.

Building the environment is only half of industrialization. The other half is grading: after the model has struggled inside the sandbox, what tells it whether it got things right or wrong?

For SWE-type tasks, the answer is unit tests. All green means correct. Most of MAI’s 265,000 problems follow this structure: write code to fix a bug, tests all pass, you pass. This is what’s called a “verifiable reward signal” — deterministic, clean, no human review needed.

But agentic training includes a more complex category of tasks: tool use. For example: give the model a database and a set of tools, and ask it to “query all customers who spent more than $500 in the past three months.” The model has to decide which tool to use, what parameters to pass, how to chain multiple queries. It returns a list of results. How do you judge whether that list is correct?

MAI built a three-tier scoring system for these tasks.

The first tier is an environment-specific grading script. Each task environment comes with its own grader that checks whether the final state is correct, whether the right tools were called, and whether the returned results are right. It’s the same logic as unit tests — what changes is what’s being checked: not “does the code compile,” but “is the tool-use workflow reasonable.”

The second tier is an AI judge. It decomposes complex tool-use tasks into sub-tasks, scores each sub-task individually, then aggregates. For example, “query the past three months” is one sub-task, “over $500” is another, and “return a correctly formatted list” is a third. Each sub-task is judged independently, so an error in one part doesn’t tank the entire score. Multiple scoring runs are averaged to reduce variance from individual judgments.

The third tier is a reward model trained specifically on human preference data. This model doesn’t judge “right or wrong” — it judges “good or better.” Both queries may return the correct customer list, but one used a single tool call while another used five. Both are correct, but they differ in efficiency. The reward model’s job is to tell the model, among equally correct answers, which approach deserves encouragement.

Three tiers combined: verifiable signals ensure deterministic judgments, the AI judge ensures task coverage, and the reward model ensures quality preferences. During RL training, the weights of these three are dynamically adjusted based on task type.

This scoring system works well in most scenarios. But it has a blind spot.


Which Problems Are Hardest to Grade

MAI’s three-tier scoring system covers backend code and tool use. But there’s one category of tasks that all three tiers fail to reach: frontend. What a button looks like — unit tests have no idea. Whether the layout is broken — unit tests have no idea. Whether responsive breakpoints are falling apart — unit tests have no idea.

This is also why you may have noticed a phenomenon: some AI models produce frontend code that looks surprisingly polished and handles interactions smoothly, while others, though functionally correct, have interfaces that feel “a bit off.” That gap isn’t innate. It’s a training artifact.

Training a model to write frontend code and training a model to write good-looking frontend code use the same algorithm but different reward signals. If your reward signal is only “all unit tests pass,” the model only cares about functional correctness, not visual quality. If you add a layer of visual inspection to the reward signal — is the button right, is the layout broken, does the interaction flow — the model learns those things during training.

GLM-5 includes a system called Agent-as-a-Judge that automates this visual inspection. Every generated frontend project is first verified to build successfully, then deployed to a preview page. A multimodal agent opens the page in a browser and simulates a human’s visual and interaction inspection workflow. This agent judge’s assessments have been calibrated against human professional reviewers and fall within an acceptable range.

The value of this case isn’t in GLM-5’s specific implementation. It demonstrates something broader: review tasks previously thought to require humans can be automated. Once automated, they can be fed into RL training at scale. Models aren’t born knowing how to write good-looking frontends. They’re shaped by whatever reward signals you train them with.


One Exam Hall Is Enough

Look at “producing exams” and “grading” together, and you start to understand what really changed in AI coding over the past two years.

Exam production has been industrialized. A training environment is no longer a single line of prompt handwritten by an engineer — it’s an automatically generated exam with a deterministic grading signal. MAI’s funnel tells us that refining 265,000 of these exams from 4.87 million GitHub PRs takes three quality checkpoints, each cutting out more than 60%. Training scale is determined by how many of these exams you can produce.

Grading has been industrialized too — and its industrialization determines the quality of what the model produces. If you use only unit tests as a reward, the model learns only to write functionally correct code. If you add tool-use efficiency to the reward, the model learns to write more concise code. If you add frontend visual quality judgment to the reward, the model learns to write better-looking interfaces. The richness of the reward signal is the upper bound on the quality of the model’s output.

Both things run on the same logic: the bottleneck in agentic engineering isn’t the model itself — it’s the training infrastructure. The quality and quantity of exam production determine how much the model can train; the precision and dimensionality of grading determine how well the model can train. Neither of these is new, but turning them from manual labor into an assembly line — that’s what actually happened in the last two years.

Let me close with a metaphor. Think of agentic engineering as an automated exam hall. Two years ago, you had a student who could write code, but no exam papers and no graders. Now, the exam production line is built and the grading pipeline is running. You have hundreds of thousands of problems, and the scoring standard for each isn’t just “it runs.” Passing unit tests is the floor. Tool-use efficiency is the baseline. Visual and interaction quality is the ceiling. These standards weren’t bolted on at evaluation time. They were baked into the reward signal during training. The model didn’t just learn to test well — it was trained into the shape you wanted from the start.

The real change over those two years wasn’t that the student got smarter. It’s that the entire system around it was built from scratch for the first time.


Knowing What to Look For

The next time you see a coding agent posting a new SWE-bench high score, don’t start by asking “what model is it using.” Ask three questions instead.

First, where did the training environments come from? A few dozen problems handwritten by engineers, or industrially filtered from real repositories? What funnel was used? How much got cut at each stage?

Second, what’s the reward signal? Is “all unit tests pass” enough, or is there an agent watching the frontend in a browser?

Third, was anti-cheating done? Can the model access the internet? Can it search for the answer key? Was the git history and PR references in the sandbox cleaned out thoroughly?

These three questions have nothing to do with the model’s architecture. But they determine whether that string of benchmark scores you’re looking at was propped up by training infrastructure. Learn to ask these three questions, and you’ll learn to tell the difference between “this model can write code” and “someone built it a proper exam hall.”


This article is based on Microsoft AI’s MAI-Thinking-1 Technical Report and Zhipu AI’s GLM-5 Technical Report, both published in 2026.