Emu3.5 | Lab Index

Large-scale native multimodal world model that predicts the next vision-language state, pre-trained end-to-end on over 10T interleaved vision-language tokens from internet videos. Supports long-horizon vision-language generation, any-to-image (X2I) synthesis, and text-rich image creation. Includes Discrete Diffusion Adaptation (DiDA) for ~20x faster per-image inference. Performance comparable to Gemini 2.5 Flash Image on generation and editing tasks.

Paper (arXiv)GitHub HuggingFace HuggingFace (Image variant)

Model Details

Variants

Name	Parameters	Notes
Emu3.5	—	General-purpose multimodal, interleaved image-text
Emu3.5-Image	—	Specialized for T2I/X2I tasks
Emu3.5-VisionTokenizer	—	Vision encoding component

multimodalgenerationopen-weightembodied

Model Details

Variants

Related