Large-scale native multimodal world model that predicts the next vision-language state, pre-trained end-to-end on over 10T interleaved vision-language tokens from internet videos. Supports long-horizon vision-language generation, any-to-image (X2I) synthesis, and text-rich image creation. Includes Discrete Diffusion Adaptation (DiDA) for ~20x faster per-image inference. Performance comparable to Gemini 2.5 Flash Image on generation and editing tasks.

Model Details

Variants

Name Parameters Notes
Emu3.5 General-purpose multimodal, interleaved image-text
Emu3.5-Image Specialized for T2I/X2I tasks
Emu3.5-VisionTokenizer Vision encoding component
multimodalgenerationopen-weightembodied

Related