Reasoning with Latent Thoughts: On the Power of Looped Transformers
paperDemonstrates that looped Transformers — where a k-layer model is looped L times — can nearly match a kL-layer non-looped model on reasoning tasks while using far fewer parameters. A 12-layer model looped twice (12⊗2) outperforms a 24-layer 1B baseline on math word problems (34.3% vs 29.3%) and reasoning primitives (51.2% vs 47.5%), despite higher perplexity (7.90 vs 7.40).
Provides theoretical foundations showing looped models implicitly generate intermediate reasoning steps analogous to chain-of-thought: a model with m loops can simulate m steps of CoT reasoning. On synthetic tasks, single-layer models looped to target depth achieve near-perfect accuracy (99.6% on 32-operand addition, 99.5% on 32-hop induction). Accuracy scales logarithmically with effective depth, with looped models showing 1.19× higher scaling coefficient than equivalently deep non-looped models. Published at ICLR 2025.