Vision Transformer (ViT)
paper"An Image is Worth 16x16 Words." Applied a pure Transformer directly to sequences of image patches for classification, with no convolutional layers. Achieved 88.55% on ImageNet when pre-trained on large datasets.
ViT proved that CNNs are not necessary for state-of-the-art vision, opening the door to unified architectures for text and images. Foundational for all modern vision models including CLIP, DALL-E, Stable Diffusion, and multimodal LLMs. ICLR 2021. ~30K+ citations. By Dosovitskiy et al. Apache 2.0.
Paper
arXiv: 2010.11929
Venue: ICLR 2021