Inference & PerformanceModel Architecture

DeepSeek DSpark: Speculative Decoding Comes Down to Hardware Scheduling

Published Jun 29, 2026

Why Large Models Are Slow, and How Speculative Decoding Accelerates Them

Anyone building products on top of large model APIs has almost certainly felt the pain of slow responses. The physical cause is well understood: large models generate one token at a time, and producing each token requires a full forward pass through the model. A reply that runs hundreds of tokens means hundreds of sequential forward passes, and latency simply accumulates with each step. In Agent workflows the situation is even worse — multi-turn tool calls force you to wait for the model to finish at every step, layering latency on top of latency. This doesn’t just degrade user experience; it directly blocks entire business models.

DeepSeek recently published a paper on DSpark. According to the paper, it reduces inference latency by 60% to 85% and can boost overall throughput by up to 5×. Those numbers sound impressive, but is this a genuine breakthrough or just another academic paper that looks great on benchmarks but shrinks under production workloads? To gauge its substance, we need to understand which acceleration approach it takes.

DSpark follows the speculative decoding route. The idea is straightforward: if large models are slow because they’re serial, what if we find a small, fast draft model that guesses the next several tokens ahead of time, then have the large model verify the entire batch at once? Verifying a sequence of tokens takes only a single forward pass through the large model. As long as the draft model guesses accurately, one forward pass replaces several rounds of serial generation, and latency drops. If guesses are wrong, we discard everything from the point of error onward, the main model fills in the correct token, and the next round begins.

We previously discussed DeepSeek V3’s multi-token prediction in From General Hospital to Smart Triage, using the analogy of a doctor diagnosing and dispensing medication in parallel. The attending physician spots common symptoms, orders lab tests while simultaneously asking the pharmacy to prepare medication in advance. If the test results confirm the diagnosis, the patient takes the medicine and leaves. If not, the pre-prepared medication is discarded. Speculative decoding is essentially the systems-engineering version of this metaphor.

Under High Concurrency, the Core Challenge Is How Many Tokens to Verify

For speculative decoding to actually work, the draft model must be accurate — that’s the prerequisite. But fixating solely on accuracy misses the other half of the cost equation: verification by the main model is not free. Every token sent for verification consumes GPU batch capacity during the forward pass. In single-user, single-request scenarios, many GPU compute units sit idle, so verifying a few extra tokens costs almost nothing. But in high-concurrency production environments, every unnecessary verification token steals batch slots from other requests, slowing everyone down. The real question, then, is not how accurate the draft model can be, but how to dynamically decide how many tokens to verify at each step based on real-time system load. This cost-benefit tradeoff shifts continuously, and a one-size-fits-all static rule simply cannot handle it.

This is the hard problem DSpark tackles. Its algorithmic modification to a semi-autoregressive model is only a means to an end; the true breakthrough is at the system level. DSpark transforms verification length in speculative decoding — from the naive approach of verifying everything the draft model produces, to a global optimization that dynamically schedules based on real-time load. This directional shift also explains why DeepSeek could push it into production while most academic work remains on paper. Getting this scheduling to work requires modifying the draft model, the system scheduler, and the underlying inference kernels together. Only a team that owns the entire inference stack has both the incentive and the capability to pull off this kind of deep optimization.

This also signals that speculative decoding is transitioning from a nice-to-have acceleration add-on into a standard component of the large model inference stack. For developers who call APIs directly, DSpark is entirely transparent because DeepSeek already runs it server-side. But for engineering teams building their own inference stack, this scheduling logic is required reading. It fits DeepSeek’s consistent pattern: rather than betting on isolated theoretical innovations, they integrate proven peripheral techniques through heavy systems engineering to gain an absolute edge in cost and speed. We’ve discussed this in Reflections on Using DeepSeek R1 and Understanding DeepSeek V4’s Agentic Workload — their habit of filtering out the most effective surrounding approaches, then honing complex systems through intense engineering integration. DSpark is simply another confirmation of this roadmap.

The Prefix Matching Rule in Speculative Decoding

We discussed multi-token prediction in From General Hospital to Smart Triage using the medical metaphor. It’s like an experienced doctor who, seeing a patient with classic symptoms, orders lab tests while also asking the pharmacy to prepare common medications ahead of time. If the test results match, the patient takes the medicine and leaves, saving queue time. If not, everything must be re-evaluated and the pre-prepared medication discarded.

In engineering terms, speculative decoding operationalizes this metaphor. A computationally cheap, fast draft model guesses several tokens ahead, and then the massive main model verifies the entire batch in a single forward pass. Correct tokens are accepted; when a mismatch occurs, all tokens from that position onward are discarded — even those that later happen to be correct.

The rigid constraint underlying this verification is prefix matching. The large model verifies tokens sequentially from left to right. Once a mismatch occurs at any position, all subsequent draft tokens are doomed. This is why the draft model’s accuracy on the very first token directly determines speculative decoding’s acceleration effect. If the first step is wrong, the main model must discard everything that follows — even if the rest would have been correct — and roll back to recompute properly. This means that even if the draft model’s overall token acceptance rate is decent, frequent errors on the first step will cause speculative decoding to repeatedly fail, possibly even making it slower than standard step-by-step decoding.

The prefix matching mechanism in speculative decoding, with accuracy decay curves for parallel and autoregressive draft models at different positions.

Why Parallel Prediction Beats Autoregressive

On the choice of draft model architecture, Section 4.3.1 of the DSpark paper reports a counterintuitive empirical finding. Common sense suggests that an autoregressive draft model — which feeds each previously generated token as input to predict the next step — should be more accurate than a parallel draft model. The parallel model doesn’t condition on its own predictions; it produces all subsequent steps in one shot. An autoregressive draft model, when predicting the next token, at least references the context of the tokens it just generated.

But the empirical data shows that the parallel draft model achieves higher accuracy on the first token. For example, on math tasks, the parallel draft model’s first-token accuracy reaches 0.88, while the autoregressive model only manages 0.81.

The reason lies in fundamentally different latency constraints. The parallel draft model requires only a single forward pass, so its latency cost is fixed. This allows its network to be designed wider and deeper, with stronger representational capacity. The autoregressive model, by contrast, must run a full forward pass for every extra step it guesses. To prevent the draft prediction latency from canceling out the time saved by the main model, the autoregressive model’s parameter count and depth are forced to shrink drastically, capping its performance on the first token.

Since prefix matching makes first-token accuracy paramount, the parallel model’s advantage there directly compensates for its weaknesses on later tokens, winning on average acceptance length.

That said, the parallel model has a hard flaw of its own: the further out the prediction position, the more likely it is to produce conflicts. Because parallel prediction treats each subsequent position as an independent task, the positions don’t consider each other’s sampling results. If a sentence can continue with “of course” down one path or “no problem” down another, both continuations are grammatically valid. But the parallel draft model, with no communication between positions, may splice words from different paths together, producing “of problem” or “no course.” This multi-modal conflict causes accuracy to drop sharply after the second step, whereas the autoregressive model — able to see what has been sampled — becomes more stable on later predictions.

At the algorithmic level, DSpark’s problem is precisely how to retain the parallel model’s high first-token accuracy while preventing accuracy from decaying too quickly on later positions.

Trading a First-Order Markov Model for the Lossless Constraint

The parallel model’s biggest structural weakness is that each prediction position is isolated: the further out, the faster accuracy drops. To pull accuracy back up on later positions, DSpark must attach a lightweight module to the parallel backbone that lets later prediction positions see the actual token already sampled at earlier positions.

But this module faces a hard mathematical constraint: its output probabilities must be exact. Speculative decoding guarantees that the generation distribution remains entirely lossless solely through rejection sampling — a verification mechanism that compares the draft model’s and main model’s probabilities to decide whether to accept a token. Rejection sampling requires the draft model to produce an exact probability distribution p_d(x_k) for the token at each position. Without this exact single-step probability distribution, the subsequent rejection sampling verification cannot proceed. This rules out most common sequence modeling approaches.

In non-autoregressive generation, to introduce dependencies between positions, people commonly use CRF (Conditional Random Field, a probabilistic graphical model that introduces global constraints among neighboring tokens) or CTC (Connectionist Temporal Classification, an algorithm that marginalizes over all possible alignment paths for sequence alignment). But neither approach works in speculative decoding. CRF requires global normalization to compute the partition function across all paths. CTC requires marginalizing over the hidden variables of all alignment paths. These computations can yield sequence-level scores or marginal probabilities, but they cannot be reduced to the exact single-step softmax probability of an individual token given the current context.

DSpark ultimately chose to step back and adopt a first-order Markov relationship to break the deadlock. A first-order Markov model means the probability of the next token depends only on the immediately preceding token, ignoring more distant history. Mathematically, the probability of each token can be written as p(x_k | x_{k-1}) = softmax(base_logit_k + transition_bias(x_{k-1}, x_k)). Here, base_logit_k comes from the parallel backbone network, and transition_bias comes from a lookup operation — using the preceding token to index a V×V low-rank transition matrix.

Because the dependency is limited to first order, DSpark needs neither global normalization nor path hidden-variable marginalization. Each token’s probability remains a single efficient softmax computation. While this scheme’s modeling capacity is weaker than that of complex recurrent neural networks or CRFs, it lands exactly on the sweet spot: able to model dependencies while still producing exact probabilities. In real-world engineering, a weak model that satisfies all hard constraints is far more useful than a strong model that breaks probability consistency.

In engineering terms, this first-order Markov mechanism costs almost nothing. It consists of a single V×V transition matrix with a low-rank parameter dimension of r=256. At runtime, you simply look up the preceding token and add the result to the current logit. With negligible computation, the additional latency overhead DSpark incurs is only 0.2% to 1.3% — essentially a cheap correction patch on top of the parallel backbone.

The paper shows that with this correction patch, DSpark improves acceptance length over Eagle3 on Qwen3-4B, 8B, and 14B models by 30.9%, 26.7%, and 30.0% respectively. Compared to DFlash, acceptance length is higher by 16.3%, 18.4%, and 18.3%.

System Layer: Turning Verification Length into a Global Optimization Problem

If the effort stopped at model architecture and algorithmic tweaks, DSpark would likely remain stuck outside real production environments, like most other speculative decoding schemes. Its real breakthrough lies not only in the algorithm itself but in the system-level confidence-based scheduling. This design transforms the verification length decision from a static one-size-fits-all rule into dynamic scheduling informed by hardware throughput.

In traditional speculative decoding, verification decisions are mostly static. For instance, if testing shows that a draft length of 7 works best, the draft model is told to guess 7 tokens every time. This static rule causes few issues under single-user conditions, but hits two major snags in high-concurrency production.

One snag is that different task types have divergent characteristics. For logically rigorous, structurally constrained tasks like code generation or math, the large model’s output is highly deterministic, and the draft model’s acceptance rate is typically high. But for open-ended conversation, where word choice varies wildly, the acceptance rate drops quickly. Using the same fixed length to verify both types of tasks wastes substantial compute.

The other snag is that system load fluctuates constantly. We discussed this in our earlier Prompt Caching analysis — whether you can keep running under high concurrency depends directly on the scheduling mechanism. See Prompt Caching as a First-Class Constraint. Under light load, GPU compute is underutilized, and verifying a few extra tokens — even ones that turn out to be wrong — carries negligible overhead. But under heavy load or request queuing, GPU resources are saturated. If the draft model blindly guesses more tokens and sends them for verification, it clogs the batch channel, leaves other requests waiting, and drags down system throughput.

DSpark changes the verification length from a fixed number to a dynamic decision based on real-time load. To make this work, they restructured the scheduling pipeline.

The first step is predicting the actual acceptance probability of draft tokens. They attach a lightweight confidence head to the tail of the draft model, specifically trained to predict the survival probability of each draft token — that is, the probability that this token and all preceding tokens will be accepted by the main model. Since the raw predicted values are often imprecise, they use a calibration algorithm called STS to apply temperature smoothing to the confidence output on an offline dataset, aligning the estimated probabilities with real acceptance rates.

The final step integrates hardware throughput into scheduling. The scheduler pre-measures a throughput curve SPS(B), which describes how many steps per second the inference cluster can execute at different batch sizes. When high-concurrency traffic arrives, the scheduler collects all draft tokens generated by the current batch’s requests and ranks them by calibrated survival probability. It selects the highest-probability tokens into the verification pool, continuing until adding more would increase the batch size past the inflection point on the SPS(B) throughput curve — the point where total steps per second begin to drop. The scheduler stops just before that threshold.

Through this approach, DSpark successfully reframes “how many tokens to verify” into a system-level global throughput optimization problem, dynamically adjusted based on real-time concurrency, request type, and GPU hardware characteristics.

The hardware-aware scheduler: from draft confidence estimation, STS calibration, and global ranking, to dynamic admission control based on the engine’s throughput curve.

Two Engineering Challenges in Production Deployment

Getting this confidence-based scheduling into a real inference engine requires solving two engineering problems.

One is the hardware substrate. When running batched inference on real GPUs, the throughput curve is not continuous and smooth. Due to physical constraints from Tensor core alignment and hardware caches, the throughput curve SPS(B) exhibits stepwise discontinuities. This means the optimum cannot be solved analytically with a formula. Even more troublesome: to perform global confidence ranking for the current inference step, you need the confidence scores of all requests in the current step. This creates a causal dependency between model computation and scheduling logic, introducing non-negligible latency and interfering with CUDA graph compilation and replay. Furthermore, if handled improperly, such real-time decisions could leak information about future tokens, violating the lossless verification requirement.

DeepSeek adopts an asynchronous design. When deciding how many tokens the current batch can verify, the scheduler does not use the confidence scores being computed right now, but instead uses confidence information lagged by two steps to predict the batch space the system might face. When choosing which specific tokens to admit into the pool, it still ranks by current real confidence scores. The step offset between these two operations provides a safety buffer — avoiding information leakage while allowing the GPU to replay CUDA graphs normally.

The other challenge is the performance penalty that variable-length verification imposes on low-level operators. Because the number of draft tokens verified per request changes continuously, the token lengths sent for verification vary across requests — they form variable-length queries. Using standard static decode kernels would require padding with zeros to align sequences. This wastes massive compute on zero-matrix multiplications, visibly lowering GPU utilization.

To solve this, DeepSeek rewrote two kernels at the bottom of the inference stack: the index-attention operator and the compress operator. Instead of padding inputs, they flatten all variable-length tokens across the batch into a one-dimensional flat tensor. Positional dependencies and causality within each sequence are conveyed to the underlying self-attention computation through a set of position-marker tensors. This also means DSpark is not a plug-in you can download as an open-source model and slot into a generic vLLM. It is a system deeply co-designed between algorithms and inference engine kernels, specifically tuned for particular hardware.

DSpark performance overview: left panel shows per-user speedup at matched throughput (V4-Flash 60%-85%, V4-Pro 57%-78%), right panel shows accepted length improvement (vs Eagle3 and vs DFlash), bottom callout notes Markov head latency overhead of only 0.2%-1.3%.

Limitations and What DSpark Means for AI Builders

DSpark has three limitations worth noting.

The first is overhead in low-acceptance-rate scenarios. Although DSpark uses STS calibration and confidence scheduling to control wasted verification, in naturally divergent, low-acceptance workloads — such as open-ended conversation — the forward pass overhead of the draft model cannot be eliminated. Under these tasks, speculative decoding not only fails to accelerate but may slightly increase overall token cost.

Another limitation is STS calibration under distribution shift. Confidence scheduling works only if the estimated confidence aligns with the actual acceptance rate. But calibration depends on an offline validation set. When online traffic encounters unfamiliar data formats or domain-specific syntax, confidence estimates are likely to drift. If the scheduler consequently admits tokens it shouldn’t, or misses tokens it should, performance degrades. The paper does not yet provide robustness data under distribution shift.

Additionally, the asynchronous scheduling relies on a strong assumption: that online traffic and request states do not undergo explosive, drastic fluctuations within the extremely short time window of two generation steps. Whether this assumption holds under sudden spikes of high-concurrency traffic — and whether it could cause scheduling queue blockages — still needs further testing in real production environments.

As for what DSpark means for the average AI builder, we need to consider different patterns of compute resource access.

If you’re an engineer building applications on top of large model APIs, DSpark is essentially transparent. You don’t need to change your code or understand low-level operators; DeepSeek has already deployed this scheduling system server-side on its API. The tangible benefits are lower API latency and headroom for future price reductions. In particular, for Agent scenarios requiring fast responses and multi-step interactions, reduced latency noticeably improves the interaction experience.

But if you’re an engineering team building your own LLM inference stack, DSpark delivers a very clear message: the bulk of speculative decoding’s gains lie at the system scheduling level, not at the draft model’s algorithmic level. If you build an in-house inference stack by simply mounting open-source Eagle3 on your engine, without writing a hardware-aware dynamic scheduler tuned to the specific throughput curve of your GPU cluster, you’ll find that under high-concurrency production loads, whatever benefits speculative decoding provides will quickly be consumed by GPU queuing delays. To unlock its potential, you have to go all the way down, like DeepSeek, and rebuild scheduling and operators tailored to your own hardware.