AI AgentModel ArchitectureInference & Performance

DeepSeek V4 Explained: Engineering Decisions Around Agentic Workloads

Written by DeepSeek-4-Flash

Three Generations, Three Problems

The quickest way to understand what DeepSeek V4 is doing is to look at the problems its two predecessors each solved.

For V2 and V3, the core problem was the compute cost of training and inference. At the time, the mainstream way to serve large models was the dense model: both training and inference consumed enormous computational resources. DeepSeek’s answer was MoE (Mixture of Experts). A routing mechanism activates only a small subset of parameters per inference pass. Combined with load balancing and low-precision computation, this lets the same amount of compute serve more users. The tension of this era was cost versus throughput.

R1 took on a different problem. OpenAI’s o1 demonstrated the value of long reasoning paths: before giving an answer, the model generates an internal chain of thought, repeatedly verifying and correcting itself. But o1’s methodology was opaque. R1’s core contribution was proving that a pure reinforcement learning route can let a model autonomously develop this long-chain reasoning capability, without requiring large amounts of human-annotated reasoning traces.

With V4, the problem shifted again. V4 is not handling a single Q&A, or even a single long reasoning session. It handles sustained multi-turn agent tasks: read a code repository, examine various error logs, invoke tools to modify files, run tests, adjust based on results, repeat until the task is finished. Along the way the model needs to remember which files it modified, what results the tools returned, which hypotheses were already ruled out. This is the core objective of the V4 technical report: agentic workload.

Three problems, three generations. V4 sits right at the inflection point where language models shift from conversation tools to task-execution systems.

Reading V4’s technical report, I’m reminded of Blue Space from The Three-Body Problem. That ship was not built for elegance in a drydock. It took its real shape after the Doomsday Battle, when it had to set out on genuine long-distance voyage: carrying enough fusion fuel for centuries, with eight layers of redundancy on critical components, external storage pods making the hull look like an irregular mass. It was not beautiful. It looked more like a long-haul traveler’s actual condition.

V4’s technical choices resemble that ship. It does not chase the theoretically cleanest solution. You’ll see that it didn’t pick the most classic attention route, nor the most aggressive efficiency route; instead it cobbled together several mechanisms for different scenarios. Post-training didn’t take the most straightforward end-to-end mixed RL, nor the most convenient weight averaging; it trained specialists separately first, then merged them through distillation. The training system mixes two optimizers, multiple precisions, and plenty of engineering patches. V4 doesn’t look beautiful or elegant. It looks more like a pragmatic, effective engineering system.

Starting from the goal of agentic workload, each of V4’s key choices can be derived. Agent tasks push three tensions to the foreground simultaneously. The V4 technical report is, at its core, an answer to these three tensions.

The First Tension: A 1M-Token Window Is Useless If the Agent Can’t Use It

For agent tasks, a 1M-token context window is not merely about “how much document you can fit in.” Passing a long-context benchmark is just the starting line. An agent needs to repeatedly retrieve history across a dozen or more tool-calling rounds, continuously preserving task state. The demands on context are fundamentally different from reading a long document from end to end.

Agent scenarios differ in kind from traditional needle-in-a-haystack tests. In a haystack test, the text is long, but the question is specific, the answer sits in one location, and the model just needs to find it. Agent tasks are nothing like that. A coding agent may have already run ten rounds of tool calls, with user requirements, returned results, and intermediate reasoning from every round all sitting in context. It needs to revisit a specific error log at any moment, a JSON blob from a previous tool return, the reason a particular abandoned approach was dropped. What this scenario demands is not finding a single sentence; it’s preserving a continuously evolving task state.

Conventional chat models can discard old thinking content when a new user message arrives, saving context. Agent scenarios cannot do this. Discarding old thinking means discarding accumulated task state. The V4 report explicitly distinguishes between these two scenarios: preserve full reasoning history during tool-calling, discard old thinking in regular conversation. This tells you that the point of 1M context is whether an agent can remain useful across long tasks, not simply how long a document it can read.

This problem ultimately lands on the attention mechanism. Every time a Transformer generates a new token, it must decide which parts of the context are worth consulting. Attention is the lookup mechanism that does this. When context is only a few thousand tokens, the lookup cost is negligible. But at the million-token level, with an agent that needs to repeatedly revisit old logs, tool returns, and code snippets, attention becomes the place where cost and retrieval precision collide.

Faced with this problem, designers have several routes to choose from.

One route is full attention. It’s the most reliable: every token in history retains a precise access path. When the model needs to revisit a particular block of code or a particular log fragment, it can locate it exactly. But under 1M context, every new token requires scanning the entire history. The computational cost is hard to accept. It’s like flipping through the entire archive from the first shelf to the last every time you need to find something: precision is perfect, but re-scanning the whole archive each time is unsustainable.

Another route is switching to alternative architectures like linear attention or SSM (State Space Model). These approaches essentially abandon the attention paradigm of on-demand lookup into historical tokens. Instead they compress history into a fixed-size state. The cost drops dramatically, but token-level identity information gets blended away. For agent tasks, precisely revisiting a specific block of code or a specific tool return matters more than understanding global trends. Abandoning attention’s on-demand retrieval capability entirely might crush critical capabilities along with the cost.

Both paths carry costs. V4’s approach sits between them: keep attention’s precise lookup ability, but don’t re-scan the entire archive. First check the map to identify the right area, then pull and examine the relevant cabinets. It does compression and layering within the attention framework:

The first layer is HCA (Heavily Compressed Attention). It compresses distant history into coarser-grained entries, then runs global dense attention over all the compressed entries. Information is lost, but cost is low. It’s good for providing a global outline of distant history. The model at least knows there’s a region somewhere in the distance involving earlier error logs or tool returns; the details can be examined later when needed. HCA is like an ultra-compressed map. It doesn’t preserve every detail, but it lets the model know what regions exist in the distance.

The second layer is CSA (Compressed Sparse Attention). It also compresses, but after compression it no longer runs dense attention over all entries. Instead, an indexer selects the top-k most relevant compressed blocks for the current query to attend to. For V4-Pro, top-k is 1024; for V4-Flash, it’s 512. CSA works like a queryable index, letting the model locate specific information in distant history on demand.

The third layer is a 128-token sliding window. What just happened, the most recent tool return, the error that just surfaced, the code just modified: these need precise preservation. You can’t rely on compressed versions alone. This is like keeping what just happened on your desk.

Three layers working together: coarse view of the distance, detailed view of relevant blocks, full view of the near.

The trade-off nature of this approach is obvious. It didn’t choose the theoretically cleanest attention alternative (full attention or pure SSM). Instead, within the attention framework, it splits access precision through compression, selection, and a local window, giving different historical distances different cost-and-precision configurations.

The payoff is clear from the numbers: under 1M-token context, V4-Pro needs only 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2; V4-Flash needs only 10% FLOPs and 7% KV cache.

But the savings are not free. Hybrid attention produces multiple types of KV entries, violating some basic assumptions of inference frameworks like PagedAttention. V4 needs to manage both classical KV cache and state cache simultaneously, and design different strategies for on-disk KV cache. The savings at the attention layer shift complexity onto cache layout, kernel design, and the serving system.

The Second Tension: After Capabilities Diverge, How Do You Merge Them?

Long context solves the state-preservation problem for agent tasks. But there’s another problem that attention can’t fix.

Consider an analogy from human work: doing math requires focused derivation without interruption; searching through materials requires rapid scanning and jumping around; debugging code requires repeated verification and modification. These three activities don’t use the same mental mode. For models, the situation is similar: math reasoning, agentic coding, and regular conversation each demand different behavioral strategies, and their corresponding training objectives and reward signals differ substantially. Math wants the model to think long; chat wants it to respond fast. A coding agent needs multi-turn tool calls; general Q&A wants a direct answer.

Here lies a tension: the training paths for different capabilities begin to diverge early, but the product requirement is that users call a single model. You can’t expose a dozen specialist models and ask users to pick. So the question becomes: how do you merge the capabilities of multiple expert models into one servable model?

At this point you might wonder: V4 is itself a MoE architecture, so why not let different experts handle different capabilities? The reason is that MoE routing occurs inside a single forward pass. Each expert represents architecture-level parameter specialization. Math reasoning and agentic coding might activate different expert combinations, but all experts are co-optimized within the same pretraining process. They were never separately trained by distinct post-training reward signals. Post-training specialists are different: they are complete behavioral strategies independently trained for different tasks and different reward signals. Their parameters have already been pulled in different directions. Product deployment needs unified model behavior. MoE’s architectural routing does not solve behavioral merging at this level.

V4’s OPD (On-Policy Distillation) is the answer to this tension. The best way to understand OPD is to start from the decision tree behind it.

First level: how to merge multiple capabilities?

The simplest approach is mixed RL: throw all tasks’ data and rewards together, train one model. It’s like putting all subjects into the same training camp. The process is direct, but the training objectives and rewards for different subjects interfere with each other. Math and agent tasks have different sequence lengths, different exploration paths, different failure modes. One reward recipe can’t serve all objectives well at once. If mixed RL doesn’t work, are there other directions?

One option: train specialists first, then merge.

Second level: once you train specialists, how do you merge them?

A straightforward idea is weight merging: average the parameters of multiple specialists in some way. It looks simple from an engineering standpoint, but different specialists, after their respective RL training, may have already ended up in different regions of parameter space. Direct merging easily loses capability. Different experts keeping their own parameters with a router deciding who to call: that’s architecture-level specialization. Weight merge, by contrast, crushes multiple directions back into a single parameter space and tends to do neither side well.

Distillation is another path. It has a student model learn the behavioral distribution of multiple teacher models. Weight merge is parameter-space merging; distillation is function-space merging. The student doesn’t directly average parameters; it learns the teachers’ output distributions. It comes closer to integrating multiple routes at the functional level. Using the earlier analogy: mixed RL is training all subjects in one camp simultaneously; distillation is having each specialist mature separately, then conducting a unified assessment.

V4 chose the latter.

Third level: whose trajectories does distillation use?

Off-policy distillation uses fixed datasets pre-generated by teachers or older models for the student to learn from. Cost is low, but the state distribution the student actually encounters at runtime doesn’t match the fixed data. An intuitive way to think about it: the student watches the teacher’s standard recordings. The recordings are clean, but when the student hits the road on its own, it makes its own mistakes, reaching states that never appeared in the recordings. In agent tasks this mismatch is worse, because one wrong tool call and every subsequent state is different.

On-policy distillation has the student generate its own trajectories first, with the teacher providing guidance at the states the student actually reaches. Cost is higher, but the training states align more closely with real usage states. V4 chose on-policy. This decision determines where the training signal lands: the student isn’t merely imitating data the teacher prepared in advance; it receives teacher calibration on the paths it will actually walk.

Fourth level: how much information does the teacher provide?

The lowest-cost approach is token-level KL (Kullback-Leibler divergence) estimation: look only at the specific token the student sampled, using it to approximate the gap between teacher and student. The V4 report notes that this approximation leads to high variance and unstable training.

Full-vocabulary logit distillation preserves the teacher’s probability distribution across the entire vocabulary. Cost is far higher, but the teacher’s judgments about other candidate tokens are preserved as well, giving a more complete training signal. V4 chose full-vocabulary.

Connect all four levels of decisions together, and you see that V4 chose the higher-cost, more complex route at every step: mixed RL replaced by specialist-first, weight merge replaced by function-space distillation, off-policy replaced by on-policy student trajectories, token-level estimate replaced by full-vocabulary logits.

OPD is the name this route eventually acquired. Its cost is not just one extra distillation step. It requires training over a dozen domain-specific teachers, maintaining teacher scheduling, rollout services, hidden-state caching, logit reconstruction, and fault-tolerant rollout infrastructure. The V4 report specifically mentions not directly materializing logits for vocabularies exceeding 100K, but instead caching last-layer hidden states and reconstructing them through the prediction head. This is a textbook example: to reduce extreme memory and compute costs, the system has to add new engineering layers.

The OPD method originates from the On-Policy Distillation work at Thinking Machines Lab. DeepSeek’s contribution is integrating multi-teacher, full-vocabulary, long-context rollout, teacher scheduling, and fault-tolerant rollout infrastructure into a single system, and making the entire setup work stably at engineering scale.

The Third Tension: The Engineering Cost of Hybrid Approaches

The previous two sections explained why agentic workloads demand more complex attention and post-training. But these complex approaches don’t just run once you build them. They impose additional demands on the training system. These demands concentrate in two areas: training efficiency and training stability. In V4, they correspond to Muon and mHC (manifold-constrained Hyper-Connections). What these two solve is whether a complex architecture can still train stably and efficiently, not capabilities that users directly perceive.

Training efficiency: Muon

The optimizer determines how model parameters get updated during training. At the scale of trillions of parameters and 32T+ tokens, the choice of optimizer affects not just convergence speed, but also the memory and communication cost of the distributed system. This is a system-level decision in its own right.

The classic route is AdamW, the most widely used default optimizer in the field. V4 didn’t stick to this single route. It brought in another optimizer line from public research: Muon. I won’t go into its mathematical details here. To put it intuitively: the way training parameters get updated is like how a road repair crew decides where to dig. AdamW decides the repair method for each small stretch of road independently; Muon pays more attention to the overall direction of the entire road.

V4 didn’t replace everything with Muon. Different parameter groups use different optimizers: most matrix parameters use Muon, while parameters like embeddings and normalization layers keep AdamW. Making Muon work within the frameworks of MoE, distributed parallelism, and mixed precision also required additional engineering adaptation. This is not a clean replacement story; it’s a set of localized engineering trade-offs.

Muon comes from public optimizer research, not invented from scratch by DeepSeek (K. Jordan et al., 2024; Jingyuan Liu et al., 2025 verified its scalability on LLMs). V4’s work was bringing this external optimizer line into its own large-scale MoE training system, completing the engineering adaptation, and then publicly documenting the details. Muon, together with attention, mHC, low-precision training, and distributed strategies, collectively constitutes V4’s training cost control.

Training stability: mHC

Transformers rely on residual connections to pass signals between layers. The standard approach is a single direct channel: the layer output gets added back to the input. Hyper-Connections (from external work in 2025) aim to expand this channel into multiple paths, enabling richer inter-layer information exchange. But more channels mean signals are more likely to go out of balance when stacked deep, and training becomes unstable.

The original Hyper-Connections concept comes from Zhu et al. (2025). V4 uses a manifold-constrained variant, mHC (Xie et al., 2026). The core idea is to add a boundary across the multiple channels, preventing any one path from being excessively amplified at depth. The cost: introducing mHC increases memory and communication overhead, requiring companion kernel optimizations, recomputation, and pipeline scheduling adjustments. The V4 report gives an overhead figure of roughly 6.7%, showing that the benefit, trainability of a more complex architecture, was achieved at a controllable cost.

Muon and mHC together support a single judgment: V4 is not just saving cost on forward inference; it’s also expanding the manageable internal complexity of the model. CSA/HCA, MoE, long-context curriculum, low-precision training, and OPD all alter gradient paths or training distributions. Muon controls from the direction of parameter updates; mHC controls from the propagation of residual signals. Their shared role is to keep an architecture composed of more hybrid components still trainable and convergent.

Complex Engineering Systems Don’t Need Polishing

DeepSeek V4’s technical report presents an engineering system full of localized trade-offs: mixed optimizers, mixed precision, cache tricks, teacher scheduling, rollout services, kernel optimizations, communication strategies, mHC recomputation. This is the result of an engineering team making a large number of system-level decisions under real constraints, not a lab that completed a deduction from elegant theory.

V4 is not driven by one or two clean, beautiful original frameworks. Reading its technical report feels more like watching a team that reads widely and selects carefully, picking out many unglamorous but detail-effective approaches (OPD, Muon, Hyper-Connections, plus their own accumulated MoE, MLA, and infrastructure), recombining them around the goal of agentic workload, and making them work stably through solid execution and extensive experimentation. The ability to connect so many components from so many different sources: that itself is hard to replicate.

DeepSeek invested substantial resources in engineering exploration across optimizer adaptation, cache layout, teacher scheduling, and rollout infrastructure, and, in the end, wrote all of these trade-offs and compromises into the technical report. At a time when the large model field is increasingly trending away from disclosing engineering details, this degree of openness is itself rare.

DeepSeek V4’s distinctiveness lies right here: an open-weight model that handles million-token-level agentic workloads through systematic engineering integration. It may not be the strongest on every single dimension. But as a publicly available engineering specimen, it clearly demonstrates one thing: the competition at this generation of large models has advanced from whose theory is more elegant to who can get a complex engineering system to actually run.