CLIP
paper"Learning Transferable Visual Models From Natural Language Supervision" — Contrastive Language-Image Pre-training (CLIP) trained on 400M image-text pairs from the internet. Uses a dual-encoder architecture (ViT or ResNet for images, Transformer for text) with a contrastive objective to align visual and textual representations in a shared embedding space.
CLIP became foundational infrastructure for the entire text-to-image generation ecosystem: DALL-E 2, Stable Diffusion, Midjourney, and most image generation models use CLIP embeddings. Enabled zero-shot image classification competitive with supervised models. ~30K+ citations. ICML 2021. By Radford, Kim, Hallacy, Ramesh et al. MIT License.
Paper
arXiv: 2103.00020
Venue: ICML 2021