A new axis of sparsity for large language models. Builds upon finding from Geva et al "Transformer Feed-Forward Layers Are Key-Value Memories" (2021).

Paper

arXiv: 2601.07372

architecturesparsityresearch