Extensive scaling laws study spanning 457 trained models examining native multimodal models (NMMs) trained from scratch on all modalities. Finds no inherent advantage to late-fusion over early-fusion architectures — early-fusion performs better at lower parameter counts, is more efficient to train, and easier to deploy.

Shows that incorporating Mixture of Experts allows models to learn modality-specific weights, significantly benefiting multimodal performance. ICCV 2025 Oral. By Shukor, Fini, Turrisi da Costa, Cord, Susskind, and El-Nouby.

Paper

arXiv: 2504.07951

Venue: ICCV 2025

multimodalresearchfoundational