Most people will never train a large language model, and likely never will. But understanding the difficulty of training has two practical uses. First, it determines who in this industry can do it and who cannot, which in turn shapes the strategic value of open-source models, the underlying logic of compute pricing, and the trajectory of concentration in the AI industry. Second, and perhaps more practically: when someone deploys a pile of technical jargon to dramatize how hard pre-training is (whether for fundraising narratives, compute sales pitches, or technical self-promotion), you can distinguish which are genuine engineering constraints and which are exaggeration.
Training a large language model consists of two phases: pre-training (learning language and knowledge from massive text corpora) and post-training (aligning with human preferences). Pre-training consumes the vast majority of compute and capital, and it is the dividing line between who can train a frontier model from scratch and who can only fine-tune existing ones. This article covers pre-training only.
The discussion proceeds from scale intuition, then covers six dimensions in sequence: hardware, compute utilization, numerical stability, data, and resource planning.
Start with an intuitive sense of pre-training’s magnitude.
Meta trained Llama 3 405B using 16,384 H100 GPUs over 54 days. Google trained PaLM 540B on 6,144 TPU v4 chips. GPT-4’s scale has no official figures; industry estimates put it at roughly 25,000 A100s for approximately 95 days.
On cost, the Stanford AI Index Report 2026 estimates GPT-4 at around $100 million and Google’s flagship model at around $150 million. Meta’s full Llama 3 family (including all sizes and multiple training attempts) is estimated above $500 million. These figures are compute costs only and exclude data labeling, engineering labor, and failed experiments.
Anthropic CEO Dario Amodei stated publicly in April 2024 that training costs for models at the time were approaching $1 billion, and predicted they would reach $5–10 billion in 2025–2026. This statement was recorded in the FTC’s 6(b) study report. According to The Information, Anthropic’s training budget for 2026 is approximately $12 billion. These numbers reflect not just the compute cost of a single run, but also the accumulated cost of many failed experiments. Pre-training rarely succeeds on the first try; along the way, loss divergence, accumulated hardware failures, or data mixing errors may require rolling back to a checkpoint or starting over entirely.
At this scale, every technical problem discussed below shares a common characteristic: the problem itself may not be rare, but at the scale of ten thousand GPUs over months, the cost of handling it is amplified to the level of millions of dollars.
One GPU failing once a year sounds highly reliable. But ten thousand GPUs means 27 failures per day. Pre-training clusters fall into a painful regime: large enough that failures occur every hour, yet unable to do fully stateless fault tolerance the way large cloud services can (because training is stateful — all GPUs must advance in sync).
Meta’s analysis based on 150 million A100 GPU-hours shows that a cluster of 1,024 GPUs has a mean time between failures (MTBF) of about 8 hours; at 16,384 GPUs that drops to roughly 1.8 hours. The Llama 3 405B training run validated this empirically: 54 days produced 419 unexpected failures, averaging one every 3 hours, roughly half of which involved the GPU itself or HBM3 memory. An earlier reference case is the OPT-175B training log: 992 A100s ran for about two months, hardware failures caused at least 35 manual restarts, more than 100 GPU hosts were replaced, and effective GPU utilization was only 52–59%.
More troublesome than visible failures is Silent Data Corruption (SDC). SDC means a GPU produces incorrect computation results without reporting any error. The erroneous values propagate through gradient aggregation operations (AllReduce) to the entire cluster, ultimately rendering all model weights invalid. Google’s engineering team reported encountering SDC once every one to two weeks during training. The Llama 3 training run recorded 6 SDC events over 54 days. A paper at ACL 2025 specifically studies SDC in LLM training. Meta developed dedicated detection mechanisms for this; Google uses deterministic training to allow replaying and tracing error sources.
High failure frequency does not mean training is inevitably out of control. Meta reports that effective training time for Llama 3 remained above 90%, achieved through a comprehensive system of automatic error detection, SDC monitoring, and asynchronous checkpointing. NVIDIA’s rule of thumb from a USENIX SREcon 2026 talk: in a cluster of ten thousand GPUs, even with a per-GPU failure rate of only 0.01%, at least one GPU will fail every day. Failures cannot be eliminated, but engineering can contain their impact to an acceptable range — at the cost of substantial dedicated investment.
The hardware layer addresses whether GPUs can function. The next question is how much of the time functioning GPUs are doing useful computation. The metric is MFU (Model FLOPs Utilization) — the actual compute as a percentage of the GPU’s theoretical peak throughput.
For dense models (where all parameters participate in every computation), MFU typically runs 38–46%. Google’s PaLM 540B reached about 46% on TPU v4; Meta’s Llama 3 on H100 achieved roughly 38–43%. To put that concretely: of sixteen thousand GPUs, the equivalent of six to seven thousand are doing actual matrix computation.
MoE (Mixture of Experts) architecture pushes that number lower still. MoE activates only a fraction of parameters per token, which is efficient at inference time, but training requires frequent data exchange across the entire cluster (All-to-All communication). NVIDIA’s analysis of DeepSeek-V3 training found that, without optimization, the compute-to-communication time ratio for cross-node expert parallelism is roughly 1:1 — GPUs spend half their time waiting for data transfers. DeepSeek-V3’s measured MFU in FP8 precision is approximately 21.4%. ByteDance’s MegaScale-MoE, training a 352B MoE model on 1,440 Hopper GPUs, achieved MFU of only 28–32% even after extensive communication optimization.
Low MFU is not solely a communication problem. Modern pre-training requires simultaneously deploying multiple parallelism strategies to distribute models and data across thousands of GPUs: tensor parallelism (TP, splitting single-layer computation across GPUs), pipeline parallelism (PP, placing different layers on different GPU groups), data parallelism (DP, different GPUs processing different data), sequence parallelism (CP, splitting long sequences), and for MoE, expert parallelism (EP). One MLPerf configuration for Llama 3.1 405B uses TP=8, PP=9, CP=2, DP=4. These dimensions are mutually constrained, and different configurations have varying effects on communication patterns, memory usage, and efficiency. Finding the optimal configuration requires understanding the hardware topology (which GPUs are interconnected via NVLink at high bandwidth, which communicate across switches) and extensive ablation experiments.
A useful scale conversion: on a cluster of 16,000 H100s, every 1 percentage point of MFU gain is equivalent to 160 additional GPUs of effective compute. Over a two-month training run, a 1% MFU gap corresponds to roughly $500,000.
The previous two sections dealt with hardware and network issues. Even when the cluster runs stably and communication is efficient, the training process itself presents another layer of challenges: the precision of numerical computation and the stability of model training.
The lower the floating-point precision used by a GPU, the faster the computation and the lower the memory footprint, but the larger the numerical error. As of early 2026, BF16 (16-bit floating point) remains the default training precision for most frontier models; Llama 3 used BF16 throughout. The industry is transitioning toward FP8 (8-bit floating point), with DeepSeek-V3 as the leading example. DeepSeek-V3’s approach assigns precision granularly by layer and operator: matrix multiplications use FP8 (accumulated in FP32), activations and gradients use BF16, embedding layers, output heads, normalization, and attention operators remain at BF16/FP32, and optimizer states maintain high precision. This is currently the largest-scale FP8 training case on public record, and it makes clear that low-precision training is far from a simple switch flip — it is a layer-by-layer engineering trade-off. The even lower FP4 (4-bit floating point) training exists only in research papers for now. The NeurIPS 2025 Quartet paper demonstrated results on a 1B parameter model, but it requires native hardware support from NVIDIA’s Blackwell architecture and remains far from large-scale engineering deployment.
Another persistent challenge during training is loss spikes: the training loss suddenly shoots up, and if not handled promptly, model parameters degrade. Loss spikes are sometimes described as entirely unpredictable random events, but research at COLM 2025 has identified the root cause theoretically: sudden growth in gradient norms, driven by the interaction between residual paths and Layer Normalization. The original PaLM paper also documented the spike-handling procedure: skip the anomalous mini-batch, adjust the learning rate, restart from a checkpoint. ByteDance analyzed 428 production training failure events and categorized root causes into data/algorithm issues, hardware failures, and engineering bugs — each with a corresponding diagnostic path. Loss spikes remain frequent in practice, but they are now a problem with theoretical frameworks and engineering responses.
Hyperparameter selection (especially learning rate) also now has systematic methodology. μP (maximal update parameterization) allows searching for optimal hyperparameters on small models and then transferring them to large models via theoretically derived scaling rules. Research at ICLR 2025 proposed multi-power-law models that can predict loss curves across learning rate schedules. These methods have not fully eliminated tuning uncertainty (2025 research found that the interaction between weight decay and μP is still not fully theorized), but hyperparameter tuning has evolved from pure empirical trial-and-error into a system engineering discipline with theoretical guidance.
The previous three sections addressed “how to run” questions: hardware must be stable, communication must be efficient, numerics must be controllable. This section and the next turn to the “what to run” questions: where does training data come from, and how large should the model be.
Pre-training requires massive volumes of high-quality text. Epoch AI’s research published at ICML 2024 provides the most systematic quantification to date: the total volume of high-quality, publicly available human-generated text is approximately 300 trillion tokens (90% confidence interval 100T–1000T), and at current growth trends, the median year of exhaustion is 2028, with a range of 2026–2032. The data has not run out yet, but the boundary is visible.
The amount of data consumed by current frontier models is already substantial. Llama 3 used approximately 15 trillion tokens. FineWeb (a pre-training corpus open-sourced by Hugging Face in 2024) extracted 15 trillion tokens from CommonCrawl crawl data spanning 2013–2024, with a curated educational subset, FineWeb-Edu, at approximately 1.3 trillion tokens that outperforms other datasets ten times its size on downstream benchmarks. This shows the leverage of data quality: 1T rigorously cleaned tokens can be more effective than 10T coarsely filtered tokens.
Synthetic data (using existing models to generate training data) is one of the main strategies for delaying the data shortage, but it has costs. Research by Shumailov et al. published in Nature found that when models are repeatedly retrained on previous-generation model outputs, the tails of the original data distribution (low-frequency but informative portions) disappear across generations, and model outputs ultimately degrade. Subsequent research also shows that when synthetic data is mixed with real data and real data exceeds a certain proportion, degradation can be avoided. The industry is already using synthetic data at scale; Meta’s Llama 3.1 incorporates synthetic data as a training component, primarily for narrow-domain tasks such as code and mathematical reasoning.
Multimodal training (simultaneously training on text, images, and video) introduces additional challenges. A systematic study by Zhai et al. in 2023 found that incorporating visual data can reduce a model’s text reasoning capability, but the key factor is training data diversity: diverse visual-language instruction data leads to less forgetting, and insufficient data diversity is the primary risk.
The quantity and quality of data determine the raw material for training. On top of that sits a higher-level decision: given a compute budget, how large should the model be and how much data should it use?
Hoffmann et al.’s 2022 Chinchilla paper produced an influential conclusion: under compute-optimal conditions, the number of training tokens should be approximately 20 times the number of model parameters. By this ratio, a 70 billion parameter model requires about 1.4 trillion tokens. This conclusion reshaped industry thinking in 2022, making clear that many previous models (including GPT-3) were undertrained, and that training smaller models on more data could achieve better results.
But industrial practice since 2024 has systematically departed from this ratio. Llama 3 trained a 405B parameter model on 15 trillion tokens, a token-to-parameter ratio of about 37:1, far above 20:1. The reason is an inference-optimal strategy: deliberately over-training a smaller model so that its inference performance approaches that of a larger model, reducing deployment costs. The extra money spent on training can be recovered through lower model size at large-scale inference deployment.
Scaling law research after Chinchilla has further diverged. MosaicML’s 2023 research argued that if inference efficiency is considered, the ratio should reach 190:1. Research on MoE architectures found that large MoE models need token-to-parameter ratios as low as 8:1. Epoch AI’s follow-up work revisited the original Chinchilla data and gave a revised estimate of about 25.6:1. These divergences indicate that Scaling Laws remain an active research area, with optimal configurations varying substantially across architectures and optimization objectives.
The core difficulty here is that decisions about model size, data volume, and training duration must be made before training begins, these decisions are mutually coupled, each validation costs millions of dollars, and the optimal solution continues to shift as hardware costs and inference demands evolve.
The preceding six sections each examined a dimension of pre-training difficulty. Viewing them together makes the core character of pre-training clearer: the challenge is not that any single technical problem is exceptionally hard, but that multiple problems of moderate difficulty coexist simultaneously at the scale of ten thousand GPUs over months, and a mistake in any one area can invalidate the entire run.
These problems also differ in maturity. Hardware failures and communication overhead have mature engineering solutions (Meta maintained effective training time above 90% for Llama 3), at the cost of enormous capital and labor investment. Mixed precision, parallelism strategies, and data mixing have systematic methodologies (μP, Scaling Laws), but each experimental validation costs millions of dollars and the search space remains far from fully explored. Data efficiency ceilings, the boundaries of synthetic data, and interference elimination in multimodal training remain open problems where even the evaluation criteria are still evolving.
This stratification provides a practical judgment framework. When a team says they can do pre-training, three progressively deeper questions are worth asking: Can the cluster run stably for weeks? Is there sufficient budget to search parallelism strategies and data configurations? Is there independent research capability on data efficiency and architectural choices? The answers to these three questions roughly determine what level of model they can produce.
Conversely, when someone uses terms like Silent Data Corruption, Activation Spike, or Model Collapse to dramatize pre-training’s barriers, the same framework applies: which layer does this problem belong to? Does it already have engineering solutions, or is it still an open problem? Does the description match the public record? Pre-training is genuinely hard, but its difficulty can be concretely understood. It does not need to be mystified.