Model ArchitectureInference & PerformanceScience & Tech Frontiers

Attention Residuals: Fixing Signal Dilution in the Depth Dimension of Transformers with Attention

Published Mar 18, 2026

Moonshot AI’s Kimi Team released a technical report on March 15, 2026, challenging a fundamental component of the Transformer architecture that has existed for nearly a decade and is used by every mainstream large model: the residual connection.

The Problem: Hidden-State Dilution

In a standard PreNorm Transformer, the way each layer works can be simplified as: adding the output of the current layer back to the accumulated results of all previous layers. Mathematically, this is h_l = h_{l-1} + f_l(h_{l-1}), added layer by layer. This design was originally introduced by ResNet to enable the training of deep networks, and its effectiveness was so high that it has been skipped as a “solved problem” for the past decade.

The outputs of all preceding layers are accumulated in a fixed, equally weighted manner. As the network grows deeper, the magnitude of this accumulated hidden state continues to grow, while the signal contributed by each individual layer accounts for an increasingly smaller proportion of this expanding total. The paper calls this phenomenon “hidden-state dilution.”

In signal processing terms: this is a process where the SNR (Signal-to-Noise Ratio) monotonically decreases with depth. A key feature extracted by layer 3, by the time it reaches layer 40, has already been submerged by the accumulated output of 37 layers, and there is no mechanism to allow layer 40 to selectively amplify the signal from layer 3.

The research team further points out that this fixed accumulation is structurally equivalent to a compressed, non-selective recurrence. This is precisely the core flaw exposed when RNNs were replaced by Transformers: RNNs used a fixed method to step-by-step compress sequence information, leading to the loss of long-distance signals. Transformers solved this problem in the sequence dimension using attention. However, in the depth dimension (between layers), the same fixed compression problem has always existed, accepted merely as a collateral cost of residual connections.

Core Intuition

The core idea of the paper can be summarized in one sentence: Since attention solved the fixed recurrence problem in the sequence dimension, use the same attention to solve the fixed accumulation problem in the depth dimension.

Specifically: when each layer receives input, it performs a softmax attention over the outputs of all preceding layers, allowing the model to learn “which previous layers’ representations are most important for the current layer’s computation.”

$$\mathbf{h}_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot \mathbf{v}_i$$

In implementation, each layer has a learnable pseudo-query vector w_l, and keys and values come from the representations of previous layers after RMSNorm. This RMSNorm step is crucial because it prevents layers with large output magnitudes from automatically dominating the attention weight calculation. All pseudo-query vectors are initialized to zero, so at the start of training, attention weights are uniform, equivalent to standard residual connections, avoiding early training instability.

The parameter overhead is minimal: only one additional learned vector and one normalization per layer. Inference latency increases by less than 2%, and training overhead is below 4% when using pipeline parallelism.

Engineering Solution: Block Attention Residuals

Full AttnRes requires each layer to attend to all previous layers, which results in an O(Ld) memory overhead in very deep networks. The paper proposes Block Attention Residuals: dividing layers into several blocks, where cross-layer attention is performed on block-level summaries. Experiments found that about 8 blocks could capture most of the gains, reducing memory to O(Nd).

Results

Results on the Kimi Linear architecture (a MoE model with 48B total parameters / 3B active parameters, trained on 1.4T tokens):

Block AttnRes achieved the same performance as the baseline trained with 1.25x the computation.
Consistent gains were observed across five models of different scales.
Improvements on some benchmarks: MMLU 73.5→74.6, GPQA-Diamond 36.9→44.4, HumanEval 59.1→62.2, C-Eval 79.6→82.5.
Training diagnostics show more controlled hidden-state magnitudes and more evenly distributed gradients across layers.

A Brief Comment

The interesting part of this paper is that the structural analogy is very clear: ten years ago, attention replaced fixed recurrence in the sequence dimension; now, the same tool is applied to the depth dimension to solve an isomorphic problem. Residual connections, being so fundamental and effective, fall into the typical blind spot of “it’s never been revisited because it’s always worked.”

The current validation scale (48B MoE / 3B active) is still small compared to GPT-5 or Claude-level models. Architecture papers often prove effective at medium scales, but whether they maintain their advantage at ultra-large scales has many historical counterexamples. This is the biggest point yet to be verified for this work.

Source: Moonshot AI / Kimi Team Technical Report, 2026-03-15