Single-transformer baseline for multi-modal understanding, simplifying vision-language model architecture. Published at ICML 2025.

Paper

Venue ICML 2025
multimodalvisionresearch