Emu3
paperNative multimodal model trained solely via next-token prediction, demonstrating that a single architecture can match task-specific methods across generation and perception. Published in the main issue of Nature — the second Chinese large model team (after DeepSeek) to achieve this, and China's first Nature paper on the multimodal large model route.
Outputs 2
Emu3: Next-Token Prediction is All You Need
paperResearch showing native multimodal models can be trained solely via next-token prediction.
arXiv: 2409.18869
Multimodal learning with next-token prediction (Nature)
paperEmu3 research published in the main issue of Nature, demonstrating that multimodal models trained solely via next-token prediction can match task-specific methods across generation and perception. Shows coherent high-fidelity video generation, interleaved vision-language generation, and vision-language-action modelling for robotic manipulation.
arXiv: 2409.18869
Venue: Nature