Single-transformer baseline for multi-modal understanding, simplifying vision-language model architecture. Published at ICML 2025.

Paper

arXiv: 2503.14694

Venue: ICML 2025

multimodalvisionresearch