DALL-E | Lab Index

"Zero-Shot Text-to-Image Generation" — a 12B parameter autoregressive Transformer that generates images from text descriptions. Uses a discrete variational autoencoder (dVAE) to compress images into 32×32 grids of visual tokens, then models text and image tokens jointly.

DALL-E pioneered text-to-image generation at scale, demonstrating that Transformers could generate coherent, creative images from natural language prompts — from "an armchair in the shape of an avocado" to photorealistic scenes. ICML 2021. By Ramesh, Pavlov, Goh, Gray et al. Proprietary.

Paper (arXiv)Announcement

Model Details

Architecture DENSE

Parameters 12B

Paper

Venue ICML 2021

Citations 1,134

arXiv HTML

visionfoundationalmultimodal

Model Details

Paper

Related