Scaling Laws for Native Multimodal Models
paperExtensive scaling laws study spanning 457 trained models examining native multimodal models (NMMs) trained from scratch on all modalities. Finds no inherent advantage to late-fusion over early-fusion architectures — early-fusion performs better at lower parameter counts, is more efficient to train, and easier to deploy.
Shows that incorporating Mixture of Experts allows models to learn modality-specific weights, significantly benefiting multimodal performance. ICCV 2025 Oral. By Shukor, Fini, Turrisi da Costa, Cord, Susskind, and El-Nouby.
Paper
arXiv: 2504.07951
Venue: ICCV 2025