"Zero-Shot Text-to-Image Generation" — a 12B parameter autoregressive Transformer that generates images from text descriptions. Uses a discrete variational autoencoder (dVAE) to compress images into 32×32 grids of visual tokens, then models text and image tokens jointly.

DALL-E pioneered text-to-image generation at scale, demonstrating that Transformers could generate coherent, creative images from natural language prompts — from "an armchair in the shape of an avocado" to photorealistic scenes. ICML 2021. By Ramesh, Pavlov, Goh, Gray et al. Proprietary.

Model Details

Architecture DENSE
Parameters 12B

Paper

arXiv: 2102.12092

Venue: ICML 2021

visionfoundationalmultimodal

Related