Scientists' First Exam
evalMultimodal benchmark probing scientific cognitive abilities across three levels: scientific signal perception, scientific attribute understanding, and scientific comparative reasoning. Comprises 830 expert-verified VQA pairs spanning 66 multimodal tasks across 5 high-value disciplines: astronomy, chemistry, earth science, life science, and materials science.
Questions are constructed from native disciplinary data formats (molecular structures, spectra, radar charts) with an average of 2.3 scientific images per question (up to 18). Bilingual prompts (English/Chinese). GPT-o3 achieves only 34.08% and InternVL-3 only 26.52%, highlighting a large gap between current MLLMs and scientific reasoning requirements. By the ADLab team at Shanghai AI Laboratory.