Moonshot AI’s Kimi Team released a technical report on March 15, 2026, challenging a fundamental component of the Transformer architecture that has existed for nearly a decade and is used by every mainstream large model: the residual connection.
In a standard PreNorm Transformer, the way each layer works can be
simplified as: adding the output of the current layer back to the
accumulated results of all previous layers. Mathematically, this is
h_l = h_{l-1} + f_l(h_{l-1}), added layer by layer. This
design was originally introduced by ResNet to enable the training of
deep networks, and its effectiveness was so high that it has been
skipped as a “solved problem” for the past decade.
The outputs of all preceding layers are accumulated in a fixed, equally weighted manner. As the network grows deeper, the magnitude of this accumulated hidden state continues to grow, while the signal contributed by each individual layer accounts for an increasingly smaller proportion of this expanding total. The paper calls this phenomenon “hidden-state dilution.”
In signal processing terms: this is a process where the SNR (Signal-to-Noise Ratio) monotonically decreases with depth. A key feature extracted by layer 3, by the time it reaches layer 40, has already been submerged by the accumulated output of 37 layers, and there is no mechanism to allow layer 40 to selectively amplify the signal from layer 3.
The research team further points out that this fixed accumulation is structurally equivalent to a compressed, non-selective recurrence. This is precisely the core flaw exposed when RNNs were replaced by Transformers: RNNs used a fixed method to step-by-step compress sequence information, leading to the loss of long-distance signals. Transformers solved this problem in the sequence dimension using attention. However, in the depth dimension (between layers), the same fixed compression problem has always existed, accepted merely as a collateral cost of residual connections.
The core idea of the paper can be summarized in one sentence: Since attention solved the fixed recurrence problem in the sequence dimension, use the same attention to solve the fixed accumulation problem in the depth dimension.
Specifically: when each layer receives input, it performs a softmax attention over the outputs of all preceding layers, allowing the model to learn “which previous layers’ representations are most important for the current layer’s computation.”
$$\mathbf{h}_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot \mathbf{v}_i$$
In implementation, each layer has a learnable pseudo-query vector
w_l, and keys and values come from the representations of
previous layers after RMSNorm. This RMSNorm step is crucial because it
prevents layers with large output magnitudes from automatically
dominating the attention weight calculation. All pseudo-query vectors
are initialized to zero, so at the start of training, attention weights
are uniform, equivalent to standard residual connections, avoiding early
training instability.
The parameter overhead is minimal: only one additional learned vector and one normalization per layer. Inference latency increases by less than 2%, and training overhead is below 4% when using pipeline parallelism.
Full AttnRes requires each layer to attend to all previous layers, which results in an O(Ld) memory overhead in very deep networks. The paper proposes Block Attention Residuals: dividing layers into several blocks, where cross-layer attention is performed on block-level summaries. Experiments found that about 8 blocks could capture most of the gains, reducing memory to O(Nd).
Results on the Kimi Linear architecture (a MoE model with 48B total parameters / 3B active parameters, trained on 1.4T tokens):
The interesting part of this paper is that the structural analogy is very clear: ten years ago, attention replaced fixed recurrence in the sequence dimension; now, the same tool is applied to the depth dimension to solve an isomorphic problem. Residual connections, being so fundamental and effective, fall into the typical blind spot of “it’s never been revisited because it’s always worked.”
The current validation scale (48B MoE / 3B active) is still small compared to GPT-5 or Claude-level models. Architecture papers often prove effective at medium scales, but whether they maintain their advantage at ultra-large scales has many historical counterexamples. This is the biggest point yet to be verified for this work.
Source: Moonshot AI / Kimi Team Technical Report, 2026-03-15