"Post-LayerNorm Is Back: Stable, ExpressivE, and Deep." Revisits Post-LN Transformers — previously abandoned in favor of Pre-LN due to training instability at scale — and attributes the failure mode specifically to the ResNet-style residual path. Keel replaces the residual path with a Highway-style connection that preserves gradient flow from top to bottom layers, eliminating signal vanishing.

Enables stable Post-LN training at >1000 layers without specialized initialization or optimization tricks, recovering Post-LN's expressivity advantage. Foundational architecture work for very-deep Transformers. By Chen Chen and Lai Wei (ByteDance Seed).

Paper

foundationalarchitecturescaling