A challenging benchmark for expert-level medical reasoning and understanding: 4,460 questions spanning 17 specialties and 11 body systems, sourced from professional medical exams and filtered to stress difficulty and reduce data leakage. It has two subsets — Text (text-only) and MM (multimodal, with clinical images) — and two question types, Reasoning (diagnostic/clinical reasoning) and Understanding (medical knowledge).

Frontier models still find it hard: on the multimodal subset OpenAI's o1 reaches only 56.3% (GPT-4o 42.8%), and on the text subset reasoning models such as DeepSeek-R1 (37.8%) and o3-mini (37.3%) lead but remain far from saturation — making it a useful probe for medical reasoning in thinking models. Integrated into the OpenCompass evaluation framework. ICML 2025; MIT-licensed. By the Tsinghua C3I group (lead authors Yuxin Zuo, Shang Qu) and Shanghai AI Laboratory (senior author Bowen Zhou).

Paper

Venue ICML 2025
Authors: Yuxin Zuo · Shang Qu · Yifei Li · Zhangren Chen · Xuekai Zhu · Ermo Hua · Kaiyan Zhang · Ning Ding · Bowen Zhou

Evaluation Details

Questions 4,460
Scoring Multiple-choice accuracy across the Text and Multimodal (MM) subsets and the Reasoning / Understanding question types
Used in: OpenCompass

Top Scores

Model Score Date
o1 (multimodal subset) 56.3% 2024-12
DeepSeek-R1 (text subset) 37.8% 2025-01
o3-mini (text subset) 37.3% 2025-01
benchmarkevaluationreasoningmultimodalmedical

Related