推理与性能模型架构

When Your GPU Runs Out of Memory: How the Offloading School Trains 100B+ Models on a Single GPU

An Underrated Number

Large models have too many parameters to fit in a single GPU’s memory. A simple solution is to store model parameters in the much larger CPU memory. The GPU fetches only the layer it needs for calculation and sends it back when done. This method is called offloading.

It sounds like a detour. Doesn’t moving data back and forth make it slow? It does, but the real question is how much. If the slowdown is minimal, this path is worth taking. Let’s look at the data.

Model Size Offloaded to CPU (MegaTrain) All in GPU Memory (Native PyTorch)
7B 284 TFLOPS 285 TFLOPS
14B 264 TFLOPS OOM (Out of Memory)
32B >250 TFLOPS OOM

This table shows real measurements from MegaTrain (April 2026 paper) on a single GH200. The results are clear. When a 7B model fits entirely in GPU memory, offloading and native performance are almost identical. At 14B, native PyTorch hits an out of memory error. The offloading solution not only runs but only sees a 7% drop in throughput. At 32B, while other systems crash, offloading continues to run steadily at over 250 TFLOPS.

What about accuracy? Does offloading make the model less capable? No. At the 7B scale, MegaTrain achieves 88.99% accuracy compared to 88.91% for native PyTorch. At 14B, MegaTrain reaches 92.52%. Native PyTorch cannot run at this size, but other systems of similar scale reach 92.41%, making the difference negligible.

If your impression of offloading is that it works in theory but is painfully slow in practice, you were mostly right before 2024. In the V100 era, ZeRO-Offload only reached about 30 TFLOPS. But as hardware changed, the entire calculation changed.

The key variable is the data path between the CPU and GPU. Standard servers use PCIe with a bandwidth of about 128 GB/s. The GH200 uses NVLink-C2C at 900 GB/s, which is seven times faster. This is not just an incremental improvement. It is a fundamental shift. Using the same offloading logic with a pipe seven times wider turns a slow process into one with almost no noticeable overhead.

There is some nuance. On an H200 with a standard PCIe connection, offloading still works. MegaTrain can train 72B or even 120B models, but the absolute speed is lower than on a GH200. High performance on premium hardware does not always translate directly to all setups.

How did this evolution happen? What is going on behind the scenes? Where are the limits?

The Core Issue: Memory, Not Compute

To understand offloading, you must first understand the problem it solves.

Training large models requires storing three types of data in memory: the model parameters (hundreds of GB), the gradients that tell parameters how to adjust (hundreds of GB), and the optimizer states that track the adjustment history (hundreds of GB more). A 70B parameter model needs about 840 GB for these three categories. The most powerful H200 GPU today has only 141 GB of memory. It simply cannot fit.

GPU compute power grows quickly, but memory capacity grows slowly. Model parameters grow even faster, moving from billions to trillions. The bottleneck has shifted from how fast we can calculate to how much we can remember at once.

The most obvious solution is to buy more GPUs and split the model across them. Leading labs do this for pre-training. However, it is expensive and often unnecessary. Post-training tasks like fine-tuning and instruction alignment do not require massive compute, but they still need to load the full model. You might only need to process a few million tokens instead of trillions. The compute demand is modest, but the memory demand is not.

The logic of offloading is straightforward. Since GPU memory cannot hold all the data, and server motherboards have hundreds of GB or even 1.5 TB of CPU memory, why not let the GPU fetch what it needs from the CPU and send it back when finished?

Your Computer Already Does This

This idea is not new. Your operating system does something similar every second.

Your computer has SRAM on the CPU chip (extremely fast, a few MB), L3 cache (fast, dozens of MB), RAM (medium speed, dozens of GB), and a hard drive (slow, several TB). The operating system uses virtual memory to swap inactive data from RAM to the disk and back. The CPU uses cache lines to move data from L3 to L2 and then L1. Each layer is slower than the one above it but offers an order of magnitude more capacity. The foundation of computer architecture is this trade-off between speed and capacity.

Training neural networks fits this pattern naturally. Models are built in layers. You must finish calculating layer 1 before moving to layer 2, and so on. At any given moment, only the data for the current layer is active. The data for dozens of other layers is just waiting in line. Backpropagation works the same way in reverse.

Why store all layers in the most expensive GPU memory? Why not manage it like virtual memory, keeping only the current layer on the GPU while storing the rest in CPU memory?

The problem is the transfer speed. Internal GPU memory bandwidth is in the 4 TB/s range. The path from CPU to GPU is 900 GB/s at best and 128 GB/s in common cases. If moving data takes longer than the calculation itself, the GPU sits idle. The entire plan becomes pointless.

The core engineering challenge of offloading is to hide the data transfer time behind the calculation time so the GPU never notices the data was not local.

Five Years of Evolution

Since 2020, this field has gone through four generations of evolution. Each generation addresses a different side of the same problem.

Generation 1: Offloading the Memory-Heavy Parts (ZeRO-Offload, 2020)

The Microsoft DeepSpeed team made a clever observation. Training requires three types of data, but they differ in size. Model parameters take up about one-sixth, gradients take another sixth, and optimizer states take up two-thirds. Optimizer states are the largest because they track the history of every parameter and must use high precision for numerical stability.

Since optimizer states take up the most space, why not offload only them? ZeRO-Offload keeps optimizer states and gradients in CPU memory while keeping model parameters in GPU memory. After the GPU calculates gradients, it sends them to the CPU. The CPU uses the optimizer to calculate updates and sends the updated parameters back to the GPU.

The logic is that optimizer calculations are light, involving simple operations for each parameter, but they require reading and writing a lot of memory. This is a memory-bandwidth-bound task, not a compute-bound one. CPUs are well-suited for this. They have enough memory bandwidth and compute power for these tasks, and this approach avoids moving massive amounts of data back and forth constantly.

This plan cuts memory requirements by about two-thirds. On a single V100, it can train a 10B parameter model with a throughput of about 30 TFLOPS, matching the performance of a 4-GPU setup without offloading.

The limitation is that model parameters must still fit in GPU memory. For larger models, the parameters alone will exceed the limit.

Generation 2: Offloading Parameters with SSD Support (ZeRO-Infinity, 2021)

If parameters also need to be offloaded, is CPU memory enough? For massive models, even CPU memory falls short. ZeRO-Infinity introduced three-tier storage: GPU memory (fastest, smallest), CPU memory (medium), and NVMe SSD (slowest, largest).

The main problem with three-tier storage is knowing what data is needed next so it can be moved before the GPU requires it. ZeRO-Infinity records the full parameter access sequence during the first training step and uses it to pre-fetch data in subsequent steps. It is like driving the same route to work every day. After a few days, you know where the traffic is and can change lanes early.

SSDs are much slower, at about 3.5 GB/s. ZeRO-Infinity uses pipelining to compensate. While the CPU calculates the current block, it reads the next block from the SSD and writes the previous block back. These three operations happen in parallel, hiding the slow SSD speed behind the calculation time.

This setup can train a 100B parameter model on 32 V100 GPUs, whereas 64 GPUs would be needed without offloading. However, it relies on a fixed access pattern. If the calculation pattern changes dynamically, it fails. Also, PCIe bandwidth was only about 32 GB/s at the time, so the fundamental bottleneck remained.

Generation 3: Flipping the Script, CPU Memory as the Main Actor (MegaTrain, 2026)

Previous generations viewed the GPU as the primary actor and CPU memory as a helper. MegaTrain reverses this. CPU memory is the permanent home for all data, and the GPU is just a compute engine with a cache. Parameters live in CPU memory. The system fetches only the layer needed for calculation and releases it immediately afterward.

This sounds similar to earlier methods, but there is a fundamental difference. Previous systems relied on the standard PyTorch framework, which assumes all data stays on the GPU during training. MegaTrain bypasses this assumption. It manages data movement itself, keeping only the current layer’s logic and a small cache on the GPU.

It uses two main techniques. The first is a double-buffering pipeline. While the GPU uses buffer A to calculate layer i, the CPU fills buffer B with data for layer i+1. When the layer is done, they swap. The GPU never waits for data. The second is a stateless compute template. The GPU stores no persistent model data, only an empty compute framework. Data flows in, gets processed, and flows out. This keeps GPU memory usage below the size of a single layer’s parameters.

On a GH200, offloading a 7B model is as fast as native training (284 vs 285 TFLOPS). A 32B model runs steadily at over 250 TFLOPS while other systems crash. On an H200 with 1.5 TB of CPU memory, a single card can train models up to 120B parameters.

Other Parallel Paths

Several other systems emerged in 2025 and 2026. TERAIO (NeurIPS 2025) noted that only 1% to 2% of data is active at any time. It established a direct data path between the GPU and SSD using GPUDirect Storage, bypassing the CPU. This was about 1.5 times faster than the ZeRO series. MemAscend focused on the fact that the staging area in CPU memory can become a bottleneck for MoE models with many small expert modules. It reduced staging memory usage by 72%. Ratel used a unified scheduler to make global decisions on where to put data, when to recompute, and when to offload, allowing fine-tuning of models up to 135B parameters.

The DeepSpeed team released SuperOffload at ASPLOS 2026, specifically optimized for the GH200. The paper reported about 310 TFLOPS in FP16. Official blogs reported even higher numbers in specific configurations. Regardless of the exact figure, the improvement over the original ZeRO-Offload is significant.

Another Path: Solving the Problem at the Hardware Level

All these systems solve the same problem: CPU memory and GPU memory are separate, connected by a narrow pipe. What if they were not separate?

Apple Silicon uses unified memory. The CPU and GPU share the same physical RAM, so there is no data transfer issue. The latest M5 Max has 128 GB of unified memory with 614 GB/s bandwidth. The M4 Ultra has 192 GB at 819 GB/s. Apple’s MLX framework performs training and inference directly on this architecture without needing offloading.

NVIDIA is taking a similar path in the data center. The GH200 uses NVLink-C2C to package an ARM CPU and a GPU together. The 480 GB of LPDDR5X on the CPU and 96 GB of HBM on the GPU form a unified address space with 900 GB/s of interconnect bandwidth. The next generation GB200 will expand this further. This hardware foundation makes the high performance numbers mentioned earlier possible.

The limitation of Apple Silicon is compute power. Unified memory solves capacity and bandwidth issues, but the raw power of Apple’s GPU cores lags behind NVIDIA’s data center GPUs. Apple’s own research papers acknowledge this. Apple Silicon is competitive for inference and fine-tuning small to medium models, but large-scale training remains NVIDIA’s domain.

The Bandwidth Wall: Why Optimism Has Limits

Despite the impressive numbers, there are reasons for caution.

Offloading is viable only if the time the GPU spends calculating a layer is greater than the time it takes to fetch the next layer from the CPU. Only then can the transfer be hidden.

A 14B parameter model in BF16 takes about 28 GB. On standard PCIe Gen4 with an effective bandwidth of 26 GB/s, moving these parameters takes about one second. The GPU calculates a layer much faster, in about 10 to 20 milliseconds. The transfer time is 50 to 100 times longer than the calculation time. The GPU spends most of its time waiting.

With the GH200’s NVLink-C2C at 900 GB/s, the same 28 GB takes about 31 milliseconds. This is in the same range as the 10 to 20 millisecond calculation time. Double buffering can hide almost all of it. This is why the GH200 results look so good.

Larger models actually make the situation better. Compute requirements grow with the square of the model width, and transfer requirements also grow with the square of the width. However, the base compute power of the GPU is much higher than the link bandwidth. Large models have longer calculation times per layer, making it easier to hide the transfer time. Small models are bandwidth-bound because they calculate too fast for the data to keep up.

This explains why performance for offloading systems degrades gracefully rather than crashing. As models grow, calculation times increase, making it easier to hide transfers and reducing the relative cost of offloading. The trade-off is that each training step is slower, but the slowdown is manageable and decreases as the model size increases.

When Should You Use Offloading?

The clearest use case is post-training on hardware with limited resources. If you have only one or two GPUs and want to fine-tune a 30B to 70B model, offloading is the difference between possible and impossible. Fine-tuning requires far fewer steps than pre-training, so the total time is usually acceptable.

The worst use case is large-scale pre-training. Pre-training involves trillions of tokens, and throughput losses accumulate over millions of steps. Renting more GPUs to fit the model entirely in memory is almost always more cost-effective than offloading on fewer GPUs. Leading labs use multi-GPU clusters with 3D parallelism (tensor, pipeline, and data parallelism) instead of CPU offloading. The latency of NVLink between GPUs is much lower than the PCIe latency between a GPU and a CPU.

There is also a middle ground. Many practitioners offload only the optimizer states, which take up two-thirds of the memory, while keeping parameters on the GPU. This is practical when parameters fit in memory but the optimizer states push it over the limit. Since optimizer calculations are memory-bandwidth-bound, running them on the CPU adds no extra cost.

The economics are also worth considering. If one GPU with offloading has 60% of the throughput of two GPUs without it, you save 50% on GPU costs but spend 67% more time, resulting in a net saving of about 25%. The more expensive the GPU, the more sense offloading makes. If GPUs are cheap or you are in a hurry, it is less attractive.

Where We Are Headed

The evolution of offloading has a clear direction. Early systems treated CPU memory as an overflow buffer, with GPU memory as the star. New systems treat CPU memory as the primary home for data, with GPU memory acting as a compute cache. This shift means that for large models, data actually lives in CPU memory, and the GPU’s role is to process the small portion it can hold at any time.

Hardware trends support this. NVIDIA’s Grace Hopper and Blackwell, along with Apple Silicon, are pushing toward unified memory. The CXL standard is also moving toward shared memory pools across devices.

At the same time, quantization and compression methods address the same problem from a different angle. Instead of moving data, they compress it to fit in GPU memory. The best modern systems combine both paths, using quantization to reduce size and offloading to handle the rest.

The honest conclusion is that offloading is now a mature and practical technology on the right hardware. It will not replace multi-GPU training or allow you to compete with top labs in pre-training. But for most practitioners working on post-training with limited hardware, it expands the feasible model size by an order of magnitude with only a modest drop in throughput. The software is now good enough that the bottleneck is hardware. As tighter CPU-GPU integration becomes standard, the line between GPU training and CPU-offloaded training may eventually disappear.