"Learning Transferable Visual Models From Natural Language Supervision" — Contrastive Language-Image Pre-training (CLIP) trained on 400M image-text pairs from the internet. Uses a dual-encoder architecture (ViT or ResNet for images, Transformer for text) with a contrastive objective to align visual and textual representations in a shared embedding space.

CLIP became foundational infrastructure for the entire text-to-image generation ecosystem: DALL-E 2, Stable Diffusion, Midjourney, and most image generation models use CLIP embeddings. Enabled zero-shot image classification competitive with supervised models. ~30K+ citations. ICML 2021. By Radford, Kim, Hallacy, Ramesh et al. MIT License.

Paper

arXiv: 2103.00020

Venue: ICML 2021

open-sourcevisionfoundationalmultimodal

Related