Introduces native joint multimodal pre-training, variable visual position encoding, mixed preference optimization, and test-time scaling. InternVL3-78B achieves 72.2 on MMMU, competitive with GPT-4o and Claude 3.5 Sonnet.

Model Details

Architecture DENSE

Variants

Name Parameters Notes
InternVL3-1B 1B
InternVL3-8B 8B
InternVL3-38B 38B
InternVL3-78B 78B

Paper

arXiv: 2504.10479

multimodalopen-weightvisionfrontierreasoning

Related