Attention Is All You Need (Transformer)
paperThe paper that introduced the Transformer architecture, dispensing with recurrence and convolutions entirely in favor of self-attention mechanisms. Multi-head attention allows the model to attend to information from different representation subspaces at different positions simultaneously.
The Transformer is the single most influential architecture in modern AI. Every frontier language model (GPT, Claude, Gemini, Llama, DeepSeek), every vision transformer, and every multimodal model is built on this foundation. One of the most-cited papers in computer science history (~140K+ citations). NeurIPS 2017. By Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin.
Paper
arXiv: 1706.03762
Venue: NeurIPS 2017