Model ArchitectureIndustry & CompetitionScience & Tech Frontiers

Microsoft AI's MAI-Thinking-1: Training LLMs Is Rock Climbing, Not Rocket Science

An underrated 109-page R&D manual that reveals top AI labs’ engineering taste for the first time


If you opened a technical report from Microsoft AI in June 2026, your first reaction might be: “Oh, Microsoft makes models now.”

That’s fair. In the pre-training world, the loudest names belong to OpenAI, Anthropic, Google DeepMind, and DeepSeek. Microsoft is better known as OpenAI’s investor and Azure’s provider, not as a model developer. So a 109-page Microsoft technical report easily reads as yet another corporate PR move — “we made a model too, though it’s not that great.”

But if you skip this report because of that first impression, you’ll miss something genuinely rare.

It’s not that “Microsoft made a good model.” (They did — 35B active parameters, 97% on AIME 2025, 52.8% on SWE-Bench Pro, trading blows with Claude Sonnet 4.6 — but that’s not what this article is about.)

What makes this rare: for the first time, someone has written down the R&D philosophy that top AI labs know internally but rarely state publicly — across 109 pages, without holding back.

I’m not talking about a specific trick or a clever training technique. I’m talking about the mental framework that lets you build better models from scratch, over and over again. That framework, the top labs certainly know — otherwise they couldn’t keep shipping. But to the outside world, it’s been a black box. You always see the results (“so-and-so topped the benchmark again”), but you never see the process: how did they know this change was correct? How did they decide an idea was worth pursuing? How did they avoid wasting tens of millions of dollars on the wrong direction?

This report pulls back the curtain.

This article isn’t an ad for MAI. What I want to do is help you, through this report, absorb the engineering taste that top labs have internalized but ordinary people have no access to — not how to train models, but how to make R&D decisions.


Starting from first principles: pre-training is insanely expensive

Let’s establish something everyone can agree on: large model pre-training is extremely expensive.

MAI’s MAI-Base-1 was trained on 30 trillion tokens using 8,192 GB200 GPUs. Electricity alone, by conservative estimates, runs into millions of dollars. Their full pre-training plus mid-training exceeds 33.5 trillion tokens, spanning months.

At this cost level, you can’t make R&D decisions by “trying things out.” Every choice — how many layers, how many experts, what activation function, what data mix — must be grounded. Pick the wrong direction, and the price isn’t rewriting code; it’s tens of millions of dollars down the drain.

So where does the grounding come from?

From experiments. But you can’t run experiments on large models. You can only run them on small models.

This is an industry-wide consensus: train dozens of small models with a few hundred million parameters, keep costs manageable, find the winning approach, then scale it up to hundreds of billions. The entire process rests on a default assumption — ranking order stays the same when you scale up. Academics call it the “rank invariance hypothesis.” If you prove with a 5B model that approach A beats B, then at 500B, A should still beat B.

If this hypothesis holds, R&D is straightforward: run massive experiments on small models, scale the winners, done.

The problem: it doesn’t hold.


One small experiment, one big lesson

In their early data mix studies, MAI designed two approaches: one biased toward code data (code-heavy mix), and another that substantially increased the proportion of STEM math and science data (stem-heavy mix).

They first trained both with a 5B model. Result: stem-heavy was clearly better on STEM benchmarks. This makes perfect intuitive sense — more STEM content in the training data, stronger STEM capability in the model.

Then they scaled both to 23B parameters, each trained on 20 trillion tokens.

The STEM performance curves crossed in the middle. Code-heavy ultimately overtook stem-heavy.

This means real trouble: all the experiments you ran on small models, all the “best approaches” you selected — they don’t necessarily hold when scaled up. You invested massive resources pushing a direction you thought was validated, only to discover the validation itself was unreliable.

Why did this reversal happen? Their attribution analysis found a surprising cause: the stem-heavy mix included a set of high-quality STEM data sources — excellent content but lacking diversity. Essentially, they were feeding the model the same kind of information over and over. For small models with limited capacity, these “concentrated supplements” were extremely helpful. But larger models learn faster, extracting all the value from this data in just a few passes. After that, repeated training became an overfitting risk. In the code-heavy mix, these data sources made up a tiny fraction, forcing the model to learn more generalizable STEM capabilities from more diverse data — initially “slower,” but ultimately yielding a higher ceiling.

This finding isn’t about MAI “inventing” a new method. It confirms something top labs have known internally but rarely discuss publicly: small-scale experiment results cannot directly inform large-scale decisions.

So the question becomes: if you can’t make decisions based on small-model experiments, how do you make decisions?


Watch the trend, not the point

After the stem-heavy vs. code-heavy lesson, a natural idea emerges: don’t just look at performance at a single scale — watch the trend.

If small-model experiment results can flip when scaled up, then test every candidate approach at multiple scales — from 300M to 2B to 10B parameters — and plot its performance curve against model scale. If the curve stays above baseline across the entire scale range with no slope decay, the approach is genuine. If it only shines at one scale and starts falling behind at larger scales, it’s another stem-heavy.

This idea is intuitively correct and lies at the heart of MAI’s methodology. But “watching the trend” as a vague qualitative judgment has four problems that must be solved before it can actually work in R&D.

Problem one: improvements in different dimensions can’t be compared. You explore three directions simultaneously — architecture tweaks, data source swaps, training recipe adjustments. The architecture change drops loss by 0.02, the data change drops loss by 0.015, the recipe adjustment drops loss by 0.025. Which is worth pursuing? Absolute values aren’t comparable because different experiments use different model sizes and token counts. You need a unified metric.

Problem two: loss improvements at different scales carry different weight. Dropping loss by 0.01 on a 300M model versus dropping it by 0.01 on a large model involves cost increments in entirely different orders of magnitude. Larger models make loss reduction harder — this is scaling law’s diminishing marginal returns. Looking only at absolute curve values, you’ll severely underestimate how hard improvement is at large scale.

Problem three: you can’t compare different approaches’ scaling momentum. Approach A beats B by 10% at 5B, but only 2% at 20B. Approach C loses to B by 3% at 5B, but beats B by 8% at 20B. Staring at a few curves by eye, it’s hard to precisely say whose scaling prospects are better — especially when they cross.

Problem four: you can’t tell “the algorithm itself is weak” from “the implementation isn’t optimized yet.” An approach performing mediocrely at 10B could have two causes — it’s fundamentally not suited to scale, or its kernels haven’t been optimized yet and actual training efficiency suffers. Distinguishing these two is critical, because the latter can be solved with engineering investment while the former isn’t worth pursuing.

These four problems share a common root: you need to turn the intuition of “the trend looks good” into a computable, comparable, decision-driving operational metric.

MAI’s answer is Efficiency Gain (EG). The calculation is straightforward: first fit a scaling law curve for the baseline approach — describing how its performance changes with training cost. Then, for any candidate approach, see what performance P it achieves at a given cost C, and trace back to the baseline’s scaling law curve to find how much the baseline would need to spend to reach the same P. If the baseline needs 30% more cost, the candidate’s EG is 1.3.

This definition solves all four problems at once. All improvements are uniformly converted into “what percentage more would the baseline need to spend” — whether you changed the architecture or swapped data, EG = 1.3 means the same thing (problem one). Because it’s based on scaling law fitting, diminishing marginal returns are baked in — reaching the same loss improvement at larger scale requires more baseline cost, and EG automatically encodes this difference (problem two). Plotting EG at each FLOPs level as a curve, you can see whose scaling momentum is stronger — crossings, decay, acceleration, all at a glance (problem three). Finally, they separately compute EG_FLOPs (compute only) and EG_Time (including actual training time), decoupling algorithmic efficiency from hardware implementation efficiency — if EG_FLOPs > 1 but EG_Time < 1, the direction is right but the kernels haven’t caught up (problem four).

To obtain reliable EG estimates, they built that “ladder” — a series of models trained with a strictly consistent token-per-parameter ratio, from L12 to L78. Every time you have a new idea, you don’t test it at a single scale. You test it at every rung of the ladder, then plot an EG-versus-cost curve. What matters isn’t whether EG exceeds 1 at a particular rung. What matters is whether it rises, stays flat, or falls at larger scales. If the curve is climbing, congratulations — you’ve found a direction that gets better as it scales. If it’s falling, abandon it early.

This is the core of engineering taste: not finding “what works,” but finding “what still works at larger scale.”


How the climb actually went: five versions, five roller coasters

With this mental framework in place, you can truly understand the “ECG” below.

MAI’s Figure 11 documents their evolution from architecture v2 to v5. It tells two stories at once: training efficiency (MFU) drops after every architectural change before climbing back through engineering optimization; model quality (EG) rises steadily across version iterations.

Figure 11: MAI pre-training configurations show MFU recovering after optimization while EG rises across versions.

The pink bars on the bottom right match your image of “continuous improvement”: v2 to v3 climbing, v3 to v4 climbing, v4 to v5 climbing.

But if you only look at the EG half of the chart, you miss the part that actually matters.

What carries real information is the top half: MFU crashes and recovers every single time.

For v4, they made a series of major architectural changes: expanding experts from 192 to 512, switching routing from top-4 to top-8, introducing Latent MoE. After these changes, training efficiency dropped from 22% straight to 16%.

This isn’t because they did anything wrong. A brand-new architecture is always inefficient — no optimized kernels exist yet, no mature GPU memory layout strategies, no targeted communication optimization. Before you can “optimize it,” you must first “invent it,” and the moment you invent it, it’s clumsy.

How did v4 climb back from 16% to 20%? They upgraded FlashAttention from v2 to v4, rewrote CPU launch logic to reduce scheduling overhead, and increased intra-batch task aggregation parallelism. Over twenty optimizations later, it stabilized.

Then v5 arrived. Parameters expanded from 23B active to 35B, total parameters from 600B to 1T. A new round of crashes, a new round of optimization — this time relying on activation offloading and finer-grained sharding strategies.

The essence of hill climbing is right here. Every innovation makes efficiency drop first, then rise, and you can tolerate this volatility because your ladder tells you “EG is trending up.” You have confidence — not because current MFU looks good, but because you know your direction is right, and that makes the climb worth it.


What to take away

After reading these 109 pages, my strongest impression has nothing to do with whether the model is good.

My impression is this: the competition among top AI labs no longer revolves around who has a better architecture idea. It revolves around who has a better system — one that can rapidly validate good ideas, continuously scale validated ideas, and make the scaling process controllable and reviewable.

They gave this system a name — hill-climbing machine. The hill-climbing machine isn’t a specific algorithm or a piece of GPU hardware. It’s an entire closed loop spanning data processing, scaling experiments, training frameworks, evaluation, safety, and infrastructure.

This report hides no “secret sauce.” You can read it yourself — 109 pages, from how attention is initialized to how experts are load-balanced to how RL rewards are designed, all written out.

But after reading it, you might realize: knowing all the details doesn’t mean you can build the same thing. Because this isn’t a recipe — you can follow a recipe. This is an organically grown system that takes time to accumulate every fall and every climb back up. It’s like reading a professional rock climber’s training log — you know every move they made — but it’s not something you can replicate immediately.

Still, for the vast majority of people, what you need to know isn’t “how do I replicate it,” but “so this is how it’s actually done.”

That’s the report’s real value: it’s not showing off a model, it’s documenting a way of thinking. And that way of thinking — starting from first principles, facing the high cost of failure, designing experimental frameworks that protect you from being deceived, accepting that “innovation must first regress” — it applies far beyond AI R&D.

A good system isn’t something you design once and call it done. A good system is one that can keep making itself better.

Next time you see a new model release, don’t just ask “how good is it.” Ask: what does its hill-climbing machine look like?


This article is based on Microsoft AI’s MAI-Thinking-1 technical report, published June 2026.