MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

The paper introducing MedXpertQA, a benchmark motivated by the observation that existing medical benchmarks have insufficient difficulty and clinical relevance and suffer from data leakage. The authors collect expert-level questions across 17 specialties and 11 body systems, then apply rigorous filtering and augmentation — including expert review and difficulty/diversity controls — to mitigate leakage and raise the difficulty bar, and add a multimodal (MM) subset with clinical images plus Reasoning and Understanding question types.

Evaluating leading models reveals a large gap to expert-level performance, positioning MedXpertQA as a probe for medical reasoning in frontier and reasoning models. ICML 2025; MIT-licensed and integrated into OpenCompass. By the Tsinghua C3I group (lead authors Yuxin Zuo, Shang Qu) and Shanghai AI Laboratory (senior author Bowen Zhou).

Paper (arXiv)GitHub Dataset (HuggingFace)

Paper

Venue ICML 2025

arXiv HTML Code HuggingFace

Authors: Yuxin Zuo · Shang Qu · Yifei Li · Zhangren Chen · Xuekai Zhu · Ermo Hua · Kaiyan Zhang · Ning Ding · Bowen Zhou

benchmarkevaluationreasoningmedicalresearch

Paper

Related