Attention Residuals

Authored by the Kimi Team at Moonshot AI (including Su Jianlin), this paper introduces a fundamental rethink of the "residual connection" by replacing fixed accumulation with learned softmax attention over the network's depth.

The Core Innovation: Standard Transformers sum layer outputs equally (h_l = h_l-1 + f_l(h_l-1)). Attention Residuals (AttnRes) instead uses learned "pseudo-queries" to let each layer retrieve specific information from earlier clean signals, effectively solving the "PreNorm Dilution" problem where early signals are drowned out by later noise. The Block AttnRes variant grouping layers makes this practical for massive models.

Significance and Impact: The paper demonstrates a 1.25x compute efficiency gain, allowing models to match standard performance with 25% less training budget. Massive reasoning improvements were observed, including +7.5 points on GPQA-Diamond. Su Jianlin frames this as "Sequence-Depth Duality": bringing the learned routing of the sequence dimension to the previously "fixed plumbing" of the depth dimension. The community has hailed it as the "final piece of the Transformer puzzle," with the official repository seeing immediate viral adoption.

Paper (arXiv)GitHub

Outputs 2

Attention Residuals Paper

paper

arXiv HTML

Attention Residuals Code

library

Stars 3.3k

GitHub Repository →

architecturescaling

Outputs 2

Attention Residuals Paper

Attention Residuals Code

More Links