The paper that introduced the Transformer architecture, dispensing with recurrence and convolutions entirely in favor of self-attention mechanisms. Multi-head attention allows the model to attend to information from different representation subspaces at different positions simultaneously.

The Transformer is the single most influential architecture in modern AI. Every frontier language model (GPT, Claude, Gemini, Llama, DeepSeek), every vision transformer, and every multimodal model is built on this foundation. One of the most-cited papers in computer science history (~140K+ citations). NeurIPS 2017. By Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin.

Paper

arXiv: 1706.03762

Venue: NeurIPS 2017

foundational

Related