MedXpertQA | Lab Index

A challenging benchmark for expert-level medical reasoning and understanding: 4,460 questions spanning 17 specialties and 11 body systems, sourced from professional medical exams and filtered to stress difficulty and reduce data leakage. It has two subsets — Text (text-only) and MM (multimodal, with clinical images) — and two question types, Reasoning (diagnostic/clinical reasoning) and Understanding (medical knowledge).

Frontier models still find it hard: on the multimodal subset OpenAI's o1 reaches only 56.3% (GPT-4o 42.8%), and on the text subset reasoning models such as DeepSeek-R1 (37.8%) and o3-mini (37.3%) lead but remain far from saturation — making it a useful probe for medical reasoning in thinking models. Integrated into the OpenCompass evaluation framework. ICML 2025; MIT-licensed. By the Tsinghua C3I group (lead authors Yuxin Zuo, Shang Qu) and Shanghai AI Laboratory (senior author Bowen Zhou).

Project page Paper (arXiv)GitHub Dataset (HuggingFace)

Paper

Venue ICML 2025

arXiv HTML Code HuggingFace

Authors: Yuxin Zuo · Shang Qu · Yifei Li · Zhangren Chen · Xuekai Zhu · Ermo Hua · Kaiyan Zhang · Ning Ding · Bowen Zhou

Evaluation Details

Questions 4,460

Scoring Multiple-choice accuracy across the Text and Multimodal (MM) subsets and the Reasoning / Understanding question types

Used in: OpenCompass

Top Scores

Model	Score	Date
o1 (multimodal subset)	56.3%	2024-12
DeepSeek-R1 (text subset)	37.8%	2025-01
o3-mini (text subset)	37.3%	2025-01

View Leaderboard →

benchmarkevaluationreasoningmultimodalmedical

Paper

Evaluation Details

Top Scores

Related