MedXpertQA
evalA challenging benchmark for expert-level medical reasoning and understanding: 4,460 questions spanning 17 specialties and 11 body systems, sourced from professional medical exams and filtered to stress difficulty and reduce data leakage. It has two subsets — Text (text-only) and MM (multimodal, with clinical images) — and two question types, Reasoning (diagnostic/clinical reasoning) and Understanding (medical knowledge).
Frontier models still find it hard: on the multimodal subset OpenAI's o1 reaches only 56.3% (GPT-4o 42.8%), and on the text subset reasoning models such as DeepSeek-R1 (37.8%) and o3-mini (37.3%) lead but remain far from saturation — making it a useful probe for medical reasoning in thinking models. Integrated into the OpenCompass evaluation framework. ICML 2025; MIT-licensed. By the Tsinghua C3I group (lead authors Yuxin Zuo, Shang Qu) and Shanghai AI Laboratory (senior author Bowen Zhou).
Paper
Evaluation Details
Top Scores
| Model | Score | Date |
|---|---|---|
| o1 (multimodal subset) | 56.3% | 2024-12 |
| DeepSeek-R1 (text subset) | 37.8% | 2025-01 |
| o3-mini (text subset) | 37.3% | 2025-01 |