Inference & PerformanceModel Architecture

How LLM Inference Works: Following the SGLang Omni Team's Design Thinking

Published May 30, 2026

SGLang recently published a technical article about SGLang Omni (Zhihu, GitHub). It is a rare thing — it lays out an elite inference system team’s complete decision-making chain in the open: from problem definition, to computational characteristic analysis, to system design. The prerequisites for every choice, the rejected alternatives, and the rationale are all written down.

Most system teams’ public output consists of architecture docs and benchmark numbers. The decision process rarely gets shared. The author, Chenyang Zhao, is the RL Lead of the SGLang community. He moved from the RL group to the Omni group in late February this year. He openly admits he does not come from a systems background — he was tormented by dynamic programming and intro to computer systems in college to the point of nearly giving up on CS. It is precisely this background that makes his writing friendly to readers without systems experience: he is learning too, so every judgment is carefully explained.

This article reads like an internal team design doc made public. If you are curious about inference systems, it is an excellent entry point — not because of advanced techniques, but because it shows how a great team analyzes problems and makes design choices.

Below is the content, organized by reader interest into three layers. If you are simply curious about how LLM inference runs at a basic level, Part 1 is enough — it explains autoregressive decoding, the difference between prefill and decode, what compute-bound and memory-bound mean, and what the scheduler and KV cache do. These concepts are the foundation for understanding any inference system. If you are interested in the engineering challenges of multi-modal models (especially speech input/output), Part 2 explains why speech cannot be fed directly to an LLM the way text can, how codec tokens are compressed, and why the Thinker-Talker architecture is inherently two decode loops. If you care more about low-level infrastructure design, Parts 3 and 4 are the focus — starting from the three computational challenges that multi-stage brings (heterogeneous scheduling, divergent dependencies, memory contention), and seeing how the SGLang Omni team made their architectural choices step by step.

First, Understand One Thing: How Standard LLM Inference Runs

The problem SGLang Omni aims to solve is fundamentally a redesign of standard LLM inference systems. So to understand what it does, you first need to know what the original approach looks like.

Models Generate Text One Token at a Time

When a large language model receives user input, it does not output the entire response at once. It generates one token at a time — as each token (a word, a punctuation mark, or part of a character) is generated, it is appended back to the input as context for the next step, and the next token is generated. This continues until a special “end” token is produced. This process is called autoregressive decoding.

In engineering terms, this process is split into two phases.

The first phase is called prefill. All tokens in the user’s prompt are fed to the model at once. The model performs attention layer by layer — each token “looks at” every token before it, computing how they relate semantically — and stores these computation results as a KV cache (Key-Value cache). The KV cache is an intermediate result cache: all the attention information computed during prefill is saved, so that during the subsequent decode phase, each step does not need to recompute the entire history — it only needs to query this cache with the newly generated token. Prefill is a one-shot computation: the matrix multiplication is as large as the input length, and the GPU’s compute units are mostly saturated — this phase is compute-bound, bottlenecked by GPU compute power.

The second phase is called decode. After prefill completes, the model already has the prompt’s KV cache and begins generating tokens one at a time. With each new token generated, it performs an attention computation against the stored KV cache to produce the next token. The key point: each step only processes one new token, so the matrix multiplication is very small, and most of the GPU’s compute power sits idle. The real bottleneck is not computation — it is reading. That KV cache grows longer and longer, and each step must read it from various locations in VRAM into the compute units. This phase is memory-bound, bottlenecked by VRAM bandwidth.

Why does this distinction matter? Because the optimization directions are completely different. For the compute-bound prefill phase, you want to increase batch size and parallelize prompt chunks — keep the GPU busy. For the memory-bound decode phase, you want to store the KV cache more compactly and read it faster — compress memory access patterns, not computation volume.

One GPU Must Serve Many Users Simultaneously

An inference service does not handle one request at a time. Dozens or hundreds of users send requests simultaneously — how does the GPU allocate compute?

The solution is called continuous batching. Traditional batching waits for one batch of requests to fully complete before starting the next — but different requests have different output lengths, some finish in three sentences while others write three thousand words. Those that finish early must wait. Continuous batching mixes all active requests together at every decode step — whenever a token is generated it joins that round’s forward pass, and whenever an end token is produced the request exits. This increases the total number of tokens the GPU processes per forward pass, improving compute utilization.

But KV cache management becomes more complex. Each request has its own KV cache, and this cache grows continuously as decoding progresses. If you pre-allocate a large contiguous chunk of VRAM for each request based on maximum possible length, most of the space sits empty but reserved. The solution is called paged KV cache — VRAM is divided into fixed-size pages (like virtual memory pages in an operating system). Whenever a request needs new space, a new page is allocated; when a request finishes, its pages are reclaimed. Who decides which request gets how many pages, which pages stay in VRAM, and which get swapped out? That is the Scheduler’s job.

What SGLang Main Already Has

SGLang’s main repository (SGLang main) has accumulated a series of battle-tested optimization techniques for LLM inference. Prefill/decode disaggregation puts prefill and decode on different GPUs, each scheduling independently without blocking each other. Chunked prefill splits a very long prompt’s prefill into small chunks interleaved with decode steps — preventing one extremely long input from stalling everyone else’s decode. RadixAttention identifies shared prompt prefixes across different requests (e.g., the same system prompt), sharing KV cache to save both VRAM and computation. CUDA Graph records the repeated kernel call sequence in the decode loop as a graph for later replay, eliminating the overhead of launching each kernel individually.

All of these optimizations rest on a shared assumption: every request’s computation process is homogeneous — prefill once, then decode token by token, nothing more. This assumption held during the LLM era, because LLMs only process text. But with Omni models, things changed.

Speech Output Complicates Everything

Speech Is Not Text — It Is a Waveform

Text is the most economical information encoding humanity has invented. A sentence written in text, spoken at 3–5 characters per second, might be just 3–10 discrete symbols in tokens. Each token comes from a fixed vocabulary — token number N corresponds to a specific character, with clear boundaries.

Speech is a different beast. It is a continuous waveform, with 16,000 to 48,000 sample points per second — each sample point is a float representing the amplitude of the sound wave at that instant. One second of speech = at least 16,000 numbers. If you fed raw waveforms directly into a Transformer, a 10-second conversation would approach 500,000 scalar steps — far exceeding most LLMs’ context windows.

So the first step in speech processing is always compression.

How Speech Gets Compressed into Tokens

Compression happens in two steps.

Step one: convert the high-frequency waveform into a low-frequency frame sequence. Raw waveform at tens of thousands of samples per second goes through an encoder and becomes roughly a dozen frames per second (e.g., 12.5 frames/s), each frame being a 128-dimensional continuous vector. This compresses the time axis by over a thousand times, but the output is still continuous values — you cannot look up “number N” in a fixed vocabulary.

Text tokens can travel the standard LLM pipeline of “finite vocabulary + cross-entropy + sampling” because each token is discrete, corresponding to exactly one entry in a vocabulary. So a second step is needed — discretizing the continuous vectors.

This discretization process is called residual vector quantization. In simple terms: match each continuous frame vector against vectors in a discrete codebook, finding the closest index. Matching once per frame is not precise enough, so you match multiple layers — the first layer finds an approximation, the residual error goes to the second layer, the second layer’s error goes to the third, and so on. For example, each frame quantized into 8 layers, each layer one index. So 1 second of speech = 12.5 frames × 8 layers = 100 discrete tokens. These tokens are called codec tokens — “codec” being short for “coder-decoder.”

The compression ratio: 1 second of speech goes from 48,000 sample points to 100 discrete tokens. Nearly 500× compression.

This “waveform → continuous frames → discrete tokens” process is called audio encoding. The reverse — converting codec tokens back to waveform — has a corresponding process and is called vocoding. These two steps are relatively stable across different models; the main difference is codec selection (some codecs use 8 layers, others 4; some produce more tokens per frame, others fewer).

From Understanding to Speaking — A Four-Stage Pipeline

With compression and discretization as the foundation, an Omni model that “understands speech and responds in speech” naturally breaks into four stages.

Stage 1, Audio Encoding: compress the user’s speech waveform into discrete codec tokens. After this step, speech information becomes a token sequence that can be fed to a Transformer — mixed with text tokens, the model can process everything together.

Stage 2, Understanding (Thinker): an LLM or multimodal LLM reads these tokens (possibly alongside images and video), understands the user’s intent, and generates a text response. This is no different in essence from a standard chat model doing prefill + decode — the input sequence now contains audio tokens, but the computation pattern is unchanged.

Stage 3, Speech Synthesis (Talker): convert the text response and acoustic features (prosody, timbre, emotion — information that cannot be encoded in text alone) into output speech codec tokens. The Thinker only generated the words — “nice weather today” — but the pitch, speed, and emotional color of how this sentence is spoken must be determined by the Talker, using hidden states extracted by the Thinker while processing the audio input.

Stage 4, Audio Decoding (Vocoder): decode the codec tokens back into a playable waveform.

Encoding and Vocoder are relatively stable across different models. What truly causes architectural divergence across Omni models is the middle two stages — how Thinker and Talker are coupled.

Back to the Decision: Why Thinker and Talker Are Two Decode Loops

Now we can unpack the key conceptual leap.

In the four-stage pipeline above, both Thinker and Talker are “generating tokens,” but that is only a surface similarity — the tokens they generate are entirely different kinds, and their computation patterns are entirely different.

Thinker generates text tokens. Text tokens are sparse and low-frequency — roughly 3–5 tokens per second, a complete short sentence perhaps 20–50 tokens. Each token selects one index from a vocabulary of tens of thousands. The computation pattern is identical to standard LLMs: compute-bound during prefill, memory-bound during decode, bottlenecked by KV cache reads and writes.

Talker generates codec tokens. Codec tokens are dense and high-frequency — 12.5 frames × 8 layers = 100 tokens per second. You can think of them as audio “pixels”: each codec token does not directly correspond to a “word” or “character” concept, but rather to the acoustic state of the sound at a particular instant. Saying “nice weather today” requires only 6–8 tokens on the text side (Thinker), but potentially 200–300 tokens on the speech side (Talker).

And Talker does not work independently. After Talker generates the 0-th codec token for each timestep, a module called MTP immediately intervenes — conditioned on this 0-th token, it completes the remaining codec tokens for that timestep in parallel (layers 1 through 7), then writes the completion results back to Talker as input for the next step. Talker’s next step strictly depends on this write-back — this is a tight per-step feedback loop.

So while Thinker and Talker are conceptually two links in the same pipeline, computationally they are two different models, running two independent decode loops, each with its own weights and KV cache, each with its own generation cadence. Shoving them into the same scheduling loop is not going to work — this problem drives all the design that follows.

The Essence of This Problem: Choosing a Classification Axis

Once you understand the computational differences between Thinker and Talker, the SGLang Omni team’s first core decision becomes natural.

The definition of “Omni model” was never unified to begin with. Some teams consider any VLM that supports speech input to be Omni; others require simultaneous speech output; still others have achieved fully omni-modal output (text + speech + images). Classifying by input/output modality leaves blurry boundaries.

The original team’s judgment: classify not by modality, but by whether decoding is multi-stage. The quality of this judgment is a concrete example of what we mean by “how an elite team makes decisions” — a good classification axis does not follow surface features but finds the dimension that reveals computational essence.

Slice by this axis, and models naturally fall on two sides. One side is single-stage decode — MiMo Omni, Nemotron Omni, etc., where the decode process is prefill once then decode token by token, just like a standard LLM. SGLang main has already pushed these to the limit.

The other side is multi-stage decode — Qwen3-Omni, FishAudio S2 Pro (pure TTS with Dual-AR), Ming Omni (fully omni-modal output), and others. Their commonality: decoding is split into multiple heterogeneous stages, each with distinct computational characteristics. These are SGLang Omni’s targets.

Where is the engineering taste in this classification axis? First, it makes system boundaries predictable — the number of stages directly determines the computational topology, and topology is the direct input to system design. Second, it avoids redundant work — single-stage goes to SGLang main, SGLang Omni only handles multi-stage. Third, it has conceptual simplicity — one judgment cleanly partitions the entire model landscape. Good engineering classifications tend to be dichotomies, with only one dimension varying at the boundary.

Multi-Stage Brings Three System-Level Problems

Standard LLM inference is one Scheduler, one KV cache pool, one homogeneous decode loop — all requests share the same computation pattern. Multi-stage decode is two or more heterogeneous decode loops running in parallel — each with its own weights, its own KV cache, and real-time data dependencies between them.

Problem 1: Each Stage Has a Completely Different Compute Bottleneck

To understand this problem, let us revisit the compute-bound vs. memory-bound distinction. Here is an analogy to help remember.

Think of a GPU as an assembly line. A worker at the station has two tasks: operating machines for computation (matrix multiplication, attention), and fetching raw materials from the warehouse (reading KV cache). Prefill-phase work is “heavy compute, light fetching” — the worker spends most of the time operating machines, the machines run at full capacity: this is compute-bound. Decode-phase work is “light compute, heavy fetching” — the worker spends most of the time running to the warehouse, fetching, returning, the machines frequently idle: this is memory-bound.

A good scheduler is like a foreman, allocating machine time based on the type of work — heavy-compute jobs (prefill) are batched to saturate the machines, fetch-heavy jobs (decode) have warehouse layout optimized to reduce transit time. Under standard LLMs, this management approach works well because all jobs are either “heavy compute” or “fetch heavy” — two types, clear patterns.

Now let us look at what type of work each stage is under the multi-stage scenario.

Thinker needs no elaboration — exactly the same as before: prefill is “heavy compute” (compute-bound), decode is “fetch heavy” (memory-bound). All of SGLang main’s existing optimizations apply directly.

Talker + MTP is a type never seen in standard LLMs. Talker does autoregressive generation too, but each step does not read a long-sequence KV cache — its input is only the Thinker’s embedding for the current step (“what text should be generated at this moment”) and MTP’s feedback from the previous step (“the speech features just produced”). Attention is extremely light, trips to the warehouse are few and the amount fetched is small — it is not memory-bound. What MTP does resembles a small prefill, but each call only processes a few codec tokens, the computation volume is too small — the machines are far from saturated, it is not compute-bound.

So where is the bottleneck? The original article gives a very precise characterization: “kernel launch overhead and synchronization overhead become the dominant factors.”

This phenomenon barely exists in LLM inference. Why? Because LLM decode per-step matrix multiplications are large enough — for example, a 7B model doing one decode-step attention, the GEMM computation time might be several hundred microseconds, while launching a GPU kernel has an overhead of roughly ten to tens of microseconds, a small fraction. But Talker’s per-step computation is inherently tiny — the entire forward pass might be only tens of microseconds — at which point kernel launch itself becomes the dominant cost. It is like a worker making a small part that takes only 5 seconds, but each time, turning on the machine, turning it off, and checking the process takes 3 seconds — the overhead of the procedure is longer than the work itself.

This insight is not isolated. Behind GLM-5.1’s high-speed API (400 tokens/s), Zhipu’s TileRT inference engine discovered the exact same bottleneck. TileRT’s scenario is real-time interactive LLM decode at batch size = 1 — generating only one token at a time, per-step computation drops sharply, and kernel launch overhead’s share similarly surges. On the surface, the two teams are doing completely different things: SGLang Omni in the multimodal speech scenario, TileRT in the text LLM scenario. But strip it down to the root, and the conclusion is the same — when per-step computation becomes light enough, the bottleneck shifts from “not enough compute” or “not enough bandwidth” to “idling between kernels.” The two teams independently arrived at this recognition, and their chosen solutions are remarkably similar: TileRT compiles the entire model into a continuous pipeline at compile time to eliminate workstation isolation; SGLang Omni encloses Talker and MTP within a single forward call, using piecewise CUDA Graph to smooth out the gaps.

There is also the Vocoder. It is not a Transformer, it has no KV cache. It is a ConvNet, “heavy compute” type (compute-bound) — but fortunately, its bottleneck resource (compute) is different from LLM decode’s bottleneck resource (bandwidth). The two can run in parallel on the same GPU via CUDA MPS without interfering with each other.

These three types (compute-bound, memory-bound, kernel-launch-bound), if crammed into the same Scheduler, will hurt each other. Thinker’s throughput — how many requests it can process — will be disrupted by Talker’s frequent fine-grained operations. Talker’s latency — how long the user waits to hear the first syllable — will be dragged down by Thinker’s large-batch prefill monopolizing the GPU and causing queues. It is like putting aircraft carrier builders, screw-tighteners, and part-polishers on the same assembly line — the fast ones wait for the slow ones, and the slow ones get interrupted by the fast ones.

The original article’s conclusion is simple: scheduling decoupling is not optional, it is necessary. Thinker and Talker can only be scheduled asynchronously by two independent Schedulers.

Behind this problem, the team’s analytical style is on display. They did not vaguely say “multi-stage is more complex.” They took each stage individually, analyzed what type of computation it is, where its bottleneck resource lies, and how it differs from other stages. Once the local analysis is thorough, the system-level problems emerge on their own.

Problem 2: Two Decode Loops Have Completely Different Connection Patterns

The analysis of this problem continues the same “local analysis” approach — first look at what each pair of stages’ dependency looks like, then see what communication requirements that dependency imposes.

The first dependency: Thinker and Talker are asynchronously decoupled. Thinker generates text tokens and acoustic hidden states ahead of time, placing them in a shared buffer. Talker consumes from this buffer at its own pace. The two models maintain independent decode loops and do not need to synchronize every step — it is fine if Thinker is a few steps ahead. Under this pattern, the core communication requirement is a low-overhead streaming buffer, allowing slack.

The second dependency: Talker and MTP are synchronously tightly coupled. Every time Talker produces a 0-th codec token, MTP must immediately complete and write back — Talker’s next step strictly depends on this write-back. Under this pattern, every step’s latency is critical. If after Talker generates a token it must wait for a cross-process signal to trigger MTP, and after MTP computes it must send another signal back, the accumulated overhead becomes unacceptable.

The same system simultaneously has two communication requirements — one allowing slack, the other demanding ultra-low latency. That is the second problem.

Problem 3: Multiple Models Competing for One GPU’s VRAM

Standard LLM inference’s VRAM allocation is as straightforward as a two-part split: model weights occupy a fixed chunk, everything else goes to KV cache. To adjust the ratio, change one parameter (called mem_fraction_static in SGLang).

Under multi-stage, this simple logic falls apart immediately. Thinker’s weights are in VRAM (Qwen3-Omni’s Thinker is a 30B-A3B MoE), and Talker’s weights are also in VRAM. Thinker needs its own KV cache pool, Talker needs its own too. The vision encoder and audio encoder, when processing long videos, have enormous temporary activation peaks — the original article notes that one minute of video easily exceeds 30GB, while the encoder weights themselves are only about 2.5GB. Talker and MTP also need a feedback buffer between them.

The problem is not just “more people sharing one pie” — if it were simply a total capacity issue, adding more VRAM would solve it. The deeper problem is that these consumers have different load orders (some initialize first, some later), different offload strategies (some weights can be temporarily swapped out, others cannot), and the number “remaining available VRAM” changes at every step. The original one-dimensional VRAM allocation logic breaks down entirely — it assumed only one main consumer.

SGLang Omni’s Response: Every Design Choice Answers a Question

The three problems are three requirement specifications. Below we look at what choices the original team made for each problem, and why.

Scheduling Decoupling: Unified Interface, Independent Implementations

All stages follow the same inbox/outbox protocol externally, but their internal implementations differ.

Thinker runs on OmniScheduler, directly reusing all of SGLang main’s scheduling capabilities — continuous batching, mixed prefill/decode scheduling, KV cache management, tree cache, overlap scheduling — while dropping modules not yet needed in the Omni scenario such as tokenizer and grammar. Talker also runs on OmniScheduler, maintaining an independent scheduling loop from Thinker, asynchronously decoupled through relay. Stages that need no scheduling (preprocessing, encoder) use SimpleScheduler — a few lines of get → forward → put. Vocoder uses Code2WavScheduler.

Here is a key choice that shows the team’s judgment: Talker and MTP are merged into the same Stage.

Why? Because the dependency between Talker and MTP is synchronous and per-step extremely lightweight. Splitting them into two Stages with relay and ZMQ signals in between means “the per-step latency would balloon to unacceptable levels” — this is a direct quote from the original article.

So MTP’s completion and feedback write-back are entirely encapsulated within a single forward call of FeedbackARModelRunner. To the upper-layer Coordinator, one timestep of Talker + MTP is just a lightweight decode step — the Coordinator is completely unaware that MTP exists.

The original article specifically explains what this merging changes and what it does not. What changes: only the ordering of kernels and the boundary of the CUDA Graph. What does not change: Talker’s paged KV cache, MTP’s weights, and multi-head completion logic — they remain two complete models, merely sharing the same forward call. The tight feedback loop is confined within the Stage, with no cross-scheduler overhead.

The principle behind this choice can be summarized in one line: Stage boundaries are not determined by the physical boundaries of model modules, but by the tightness of dependencies. Tightly coupled things are kept together to save cross-stage communication latency; loosely coupled things are separated so each can optimize throughput independently. This is not drawing boundaries by “what looks like a module” — it is drawing boundaries by “what communication cost would hurt performance.”

Layered Communication: Tailored to Need, No One-Size-Fits-All

Problem 2 identified two dependency patterns — one asynchronous, one synchronous. SGLang Omni’s response is not a unified communication layer; it splits into a control plane and a data plane.

The control plane uses ZMQ, carrying event notifications (“upstream chunk written,” “new request arrived”). The data plane uses relay, carrying actual large tensor transfers — shared memory or CUDA IPC for zero-copy between GPUs on the same machine, NCCL across nodes.

Thinker and Talker’s asynchronous dependency uses both planes: Thinker writes data to relay, sends a DataReady signal, Talker receives the signal and consumes from relay at its own pace. Talker and MTP’s synchronous dependency goes through no cross-stage communication at all — entirely completed within one forward call of the same ModelRunner.

The taste of this design lies in not presupposing structure — how many layers communication should have is not dictated by dogma, but by the requirements themselves. As many dependency patterns as there are, that many communication paths exist.

Cross-Stage VRAM Budgeting: From a Single Global Ratio to Multi-Party Declarations

VRAM management under a single Scheduler could be expressed with a single global parameter. Under multi-stage, that parameter fails.

SGLang Omni’s solution is to switch to a multi-party budgeting system. Each stage declares total_gpu_memory_fraction in its own configuration — “what percentage of this GPU’s total VRAM do I need.” At startup, the system sums all declarations per GPU; if the total exceeds 100%, it rejects outright. If it passes, each AR stage allocates as much KV cache as possible within its own budget.

The encoder’s VRAM risk is discussed separately in the original article. Weights are only about 2.5GB, but activation peaks easily exceed 30GB. SGLang Omni’s handling of it reflects a pursuit of consistency: the encoder, like other stages, declares tp_size and GPU placement in StageConfig. TP-splitting activation peaks is not something done specially for the encoder — it is a mechanism shared by all stages. Good engineering is not “opened a special path for a special case” — it is “one mechanism solves a broad class of problems.”

Coordinator: Only Manages Topology, Not Implementation

The Coordinator does three things at the top layer: routes new requests to the entry Stage, collects and merges output from terminal Stages, and broadcasts abort messages to all relevant stages when needed.

Its design principle is also simple — stage-implementation agnostic. Regardless of what model or scheduler runs inside each Stage, it only knows the pipeline topology. This “manages only topology, not implementation” design keeps the Coordinator from bloating as the number of supported model types grows.

Looking Back: A Unified Perspective

The original article opens with a line that reads differently when revisited here: “An ML systems researcher has only one goal — study the computational characteristics of a given computation, and design an efficient and robust system tailored to those characteristics.”

This is not just the author’s personal conviction; it is the organizing logic of the entire article. Computational characteristics are primary: they determine the scope (which models belong to SGLang Omni, which to SGLang main), determine the system requirements (three problems derived from stage-level computational characteristics), determine the design choices (scheduling decoupling, layered communication, VRAM budgeting), and determine extensibility — because Stage abstraction is unified, Scheduler interfaces are unified, and communication and VRAM are framework-level, a new model only needs to declare its computational characteristics, dependency relationships, and VRAM budget.

Returning to what we originally wanted to learn from this article: how an elite inference system team makes decisions. The original provides answers at four levels. Problem definition — not following the trend of classifying by modality, but defining one’s own axis that reveals computational essence. Computational characteristic analysis — decomposing the system to the stage level, analyzing each stage’s bottleneck and dependency characteristics individually, deriving system-level constraints from local analysis. Design choices — every choice has clear prerequisites, rejected alternatives, and engineering rationale. Systematic perspective — all choices point to the same starting point: computational characteristics are primary.

None of these things are complex in isolation, but every step involves making choices, every step involves judgment. We analyze this article not because it introduces some advanced technique, but because the reasoning behind these choices is laid out in the open — for engineers, that is the best kind of textbook.