Vision Transformer (ViT)

"An Image is Worth 16x16 Words." Applied a pure Transformer directly to sequences of image patches for classification, with no convolutional layers. Achieved 88.55% on ImageNet when pre-trained on large datasets.

ViT proved that CNNs are not necessary for state-of-the-art vision, opening the door to unified architectures for text and images. Foundational for all modern vision models including CLIP, DALL-E, Stable Diffusion, and multimodal LLMs. ICLR 2021. ~30K+ citations. By Dosovitskiy et al. Apache 2.0.

Paper (arXiv)GitHub

Paper

Venue ICLR 2021

Citations 21,576

arXiv HTML

foundationalvisionopen-source

Paper

Related