AI 编程模型架构

Why LLM Code Generation Often Looks Like It Should Work but Still Fails

Bottom line first: Apple’s ML-SSD paper, Embarrassingly Simple Self-Distillation Improves Code Generation, has limited direct relevance to ordinary end users. If you mainly write code through products like Claude Code or Cursor, you never touch decoding knobs like temperature because the product hides them. But it is still worth ten minutes of your attention because it offers a simple frame for understanding internal constraints, failure modes, and current capability limits in LLM code generation. That helps you judge when the model is reliable and when human intervention still matters.

What problem did the paper notice?

Apple’s paper Embarrassingly Simple Self-Distillation Improves Code Generation starts from a familiar observation: the same model can produce perfectly correct code in one attempt, then fail badly when the setting changes or you sample again.

Paper claim: one important source of that behavior is a mismatch between global decoding policy, one temperature applied across the whole sequence, and the local needs of code tokens. The paper distinguishes two kinds of positions in code generation.

Lock positions are places where syntax and semantics narrow the next-token choices to a very small set. For example, after for i in range(, the next token is usually constrained to a number, variable, or function call. At these positions, the model’s distribution should be sharp, but in practice it often carries a distractor tail: tokens that are barely legal syntactically but wrong semantically still get non-trivial probability and can derail sampling.

Fork positions are places where several continuations are genuinely reasonable and correspond to different solution paths. A graph problem may admit either DFS or BFS. At these positions, the model needs enough spread in its distribution to explore multiple valid approaches.

That creates the central conflict: lock positions need precision, fork positions need exploration, but a single global temperature can only impose one fixed tradeoff across the whole sequence. Lowering temperature helps locks and hurts forks. Raising it helps forks and hurts locks. The paper argues that this granularity mismatch is one important bottleneck in code generation.

The intuition behind the proposed fix

SSD is extremely simple: sample lots of code from a frozen base model, keep the raw outputs without verification, fine-tune the same model on those outputs, then evaluate with a separately tuned decoding temperature. No verifier, no teacher model, no RL.

Paper claim: this apparently circular process reshapes the model’s distribution in a context-dependent way. After SSD, distributions at lock positions become more concentrated, so the distractor tail gets suppressed; at fork positions, useful diversity is preserved. A coarse tradeoff that used to be handled only by one global temperature is partially moved into the weights, becoming a more local adaptation.

Our interpretation: pretraining gives the model a large library of code patterns, but those patterns are expressed with fairly uniform strength. SSD acts like a refocusing step. It does not obviously teach brand new coding knowledge. Instead, it encourages the model to express what it already knows with different sharpness depending on local context.

The paper also includes one important ablation: even after a broad global temperature sweep on the base model, the best pass@1 still remains meaningfully below the SSD-tuned model on LiveCodeBench v6, 42.4% vs 55.3% for Qwen3-30B-Instruct. That weakens the simpler explanation that SSD merely discovered a better global temperature, and supports the paper’s claim that something more local was changed inside the model.

Why this intuition is useful even if you never tune temperature yourself

Even if you never touch decoding parameters directly, the lock/fork frame is still useful. Some parts of code generation demand near-deterministic precision. Other parts demand real search over multiple viable paths. Many product behaviors can be read through that lens. When a model makes a silly local syntax mistake, or becomes overly conservative when a high-level choice is needed, the failure may come from a decoding tradeoff at the token level rather than from total ignorance.

It also helps explain a common practical experience: in many situations, asking the model to try again works better than trying to over-optimize a single sample. If the bottleneck really sits in the lock/fork tradeoff, multiple samples are one way of compensating for the fact that a single decode path cannot satisfy both demands equally well.

Evidence and boundaries

Inside code generation, the paper’s evidence chain is reasonably complete: multi-model gains, a decode-only gap ablation, distribution analysis, and a stress test showing SSD still helps even when the self-generated training data contains wrong code. That supports the paper’s argument that the gain comes from distribution reshaping rather than simply better labels.

The main boundary is external validation. The code repo is public at github.com/apple/ml-ssd, but as of April 2026 the model checkpoints were not yet released, and broad independent replication had not happened. The paper also does not establish that the same mechanism generalizes beyond code generation. So this is best treated as a useful and plausible explanatory frame, not a fully settled conclusion.

Resources: paper on arXiv, HTML version, GitHub repo.