News
๐ [2025-02-20] Leaderboard is on! Check out the results of o3-mini, DeepSeek-R1, o1, and Qwen2.5-VL-72B!
๐ค [2025-02-09] We release the MedXpertQA dataset.
๐ฅ [2025-01-31] We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning!
Leaderboard
Search:
Select Columns to Display:
Model Filter:
Open-Source?
Vision?
Reasoning?
Organization
MedXpertQA Text | MedXpertQA MM | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
# |
Model |
Org |
Date |
Reasoning |
Understanding |
Avg |
Reasoning |
Understanding |
Avg |
AvgR |
AvgK |
Avg |
Opensource? |
Reasoning? |
Vision? |
Link |
1 |
o1 |
OpenAI |
2024-12 |
46.24 |
39.66 |
44.67 |
52.78 |
65.45 |
56.28 |
49.09 |
52.21 |
49.89 |
proprietary |
reasoning |
LMM |
|
2 |
QVQ-72B-Preview |
Qwen |
2024-12 |
22.08 |
20.71 |
21.76 |
33.54 |
33.57 |
33.55 |
27.09 |
26.95 |
27.06 |
open-source |
reasoning |
LMM |
|
3 |
GPT-4o |
OpenAI |
2024-11 |
30.63 |
29.54 |
30.37 |
40.73 |
48.19 |
42.80 |
35.05 |
38.58 |
35.96 |
proprietary |
vanilla |
LMM |
|
4 |
Claude-3.5-Sonnet |
Claude |
2024-10 |
19.88 |
25.81 |
21.31 |
33.33 |
32.85 |
33.20 |
25.76 |
29.22 |
26.65 |
proprietary |
vanilla |
LMM |
|
5 |
Gemini-1.5-Pro |
|
2024-09 |
19.18 |
21.22 |
19.67 |
32.85 |
37.36 |
34.10 |
25.16 |
29.05 |
26.16 |
proprietary |
vanilla |
LMM |
|
6 |
GPT-4o-mini |
OpenAI |
2024-07 |
17.09 |
20.20 |
17.84 |
28.22 |
27.62 |
28.05 |
21.95 |
23.80 |
22.43 |
proprietary |
vanilla |
LMM |
|
7 |
Gemini-2.0-Flash |
|
2024-12 |
20.53 |
20.71 |
20.57 |
35.48 |
41.70 |
37.20 |
27.06 |
30.88 |
28.04 |
proprietary |
vanilla |
LMM |
|
8 |
Qwen2.5-VL-72B |
Qwen |
2025-01 |
17.89 |
18.17 |
17.96 |
29.53 |
31.05 |
29.95 |
22.98 |
24.41 |
23.35 |
open-source |
vanilla |
LMM |
|
9 |
Qwen2-VL-72B |
Qwen |
2024-08 |
16.39 |
18.68 |
16.94 |
25.86 |
34.84 |
28.35 |
20.53 |
26.51 |
22.07 |
open-source |
vanilla |
LMM |
# |
Model |
Org |
Date |
Reasoning |
Understanding |
Avg |
Opensource? |
Reasoning? |
Vision? |
Link |
---|---|---|---|---|---|---|---|---|---|---|
1 |
o3-mini |
OpenAI |
2025-01 |
37.63 |
36.21 |
37.30 |
proprietary |
reasoning |
LLM |
|
2 |
DeepSeek-R1 |
DeepSeek |
2025-01 |
37.88 |
37.35 |
37.76 |
open-source |
reasoning |
LLM |
|
3 |
QwQ-32B-Preview |
Qwen |
2024-11 |
18.70 |
15.79 |
18.00 |
open-source |
reasoning |
LLM |
|
4 |
DeepSeek-V3 |
DeepSeek |
2024-12 |
23.91 |
24.96 |
24.16 |
open-source |
vanilla |
LLM |
|
5 |
Claude-3.5-Haiku |
Claude |
2024-10 |
16.71 |
21.05 |
17.76 |
proprietary |
vanilla |
LLM |
|
6 |
LLaMA-3.3-70B |
Meta |
2024-12 |
23.86 |
26.49 |
24.49 |
open-source |
vanilla |
LLM |
|
7 |
Qwen2.5-72B |
Qwen |
2024-09 |
18.54 |
20.03 |
18.90 |
open-source |
vanilla |
LLM |
|
8 |
Qwen2.5-32B |
Qwen |
2024-09 |
14.02 |
18.34 |
15.06 |
open-source |
vanilla |
LLM |
|
9 |
o1 |
OpenAI |
2024-12 |
46.24 |
39.66 |
44.67 |
proprietary |
reasoning |
LMM |
|
10 |
QVQ-72B-Preview |
Qwen |
2024-12 |
22.08 |
20.71 |
21.76 |
open-source |
reasoning |
LMM |
|
11 |
GPT-4o |
OpenAI |
2024-11 |
30.63 |
29.54 |
30.37 |
proprietary |
vanilla |
LMM |
|
12 |
Claude-3.5-Sonnet |
Claude |
2024-10 |
19.88 |
25.81 |
21.31 |
proprietary |
vanilla |
LMM |
|
13 |
Gemini-1.5-Pro |
|
2024-09 |
19.18 |
21.22 |
19.67 |
proprietary |
vanilla |
LMM |
|
14 |
GPT-4o-mini |
OpenAI |
2024-07 |
17.09 |
20.20 |
17.84 |
proprietary |
vanilla |
LMM |
|
15 |
Gemini-2.0-Flash |
|
2024-12 |
20.53 |
20.71 |
20.57 |
proprietary |
vanilla |
LMM |
|
16 |
Qwen2.5-VL-72B |
Qwen |
2025-01 |
17.89 |
18.17 |
17.96 |
open-source |
vanilla |
LMM |
|
17 |
Qwen2-VL-72B |
Qwen |
2024-08 |
16.39 |
18.68 |
16.94 |
open-source |
vanilla |
LMM |
# |
Model |
Org |
Date |
Reasoning |
Understanding |
Avg |
Opensource? |
Reasoning? |
Vision? |
Link |
---|---|---|---|---|---|---|---|---|---|---|
1 |
o1 |
OpenAI |
2024-12 |
52.78 |
65.45 |
56.28 |
proprietary |
reasoning |
LMM |
|
2 |
QVQ-72B-Preview |
Qwen |
2024-12 |
33.54 |
33.57 |
33.55 |
open-source |
reasoning |
LMM |
|
3 |
GPT-4o |
OpenAI |
2024-11 |
40.73 |
48.19 |
42.80 |
proprietary |
vanilla |
LMM |
|
4 |
Claude-3.5-Sonnet |
Claude |
2024-10 |
33.33 |
32.85 |
33.20 |
proprietary |
vanilla |
LMM |
|
5 |
Gemini-1.5-Pro |
|
2024-09 |
32.85 |
37.36 |
34.10 |
proprietary |
vanilla |
LMM |
|
6 |
GPT-4o-mini |
OpenAI |
2024-07 |
28.22 |
27.62 |
28.05 |
proprietary |
vanilla |
LMM |
|
7 |
Gemini-2.0-Flash |
|
2024-12 |
35.48 |
41.70 |
37.20 |
proprietary |
vanilla |
LMM |
|
8 |
Qwen2.5-VL-72B |
Qwen |
2025-01 |
29.53 |
31.05 |
29.95 |
open-source |
vanilla |
LMM |
|
9 |
Qwen2-VL-72B |
Qwen |
2024-08 |
25.86 |
34.84 |
28.35 |
open-source |
vanilla |
LMM |
Click the column name to sort.
- Main is the leaderboard of full MedXpertQA.
- Text is the subset MedXpertQA Text for text-only medical evaluation.
- MM is the subset MedXpertQA MM for multimodal medical evaluation.
If you'd like to submit to the leaderboard, please check this page.
About

MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, MedXpertQA Text for text medical evaluation and MedXpertQA MM for multimodal medical evaluation. The following figure presents an overview. The left side illustrates MedXpertQA's diverse data sources, image types, and question attributes. The right side compares typical examples from MedXpertQA MM and a traditional multimodal medical benchmark (VQA-RAD). Read more about MedXpertQA in our paper!
Citation
Copy@article{zuo2025medxpertqa, title={MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding}, author={Zuo, Yuxin and Qu, Shang and Li, Yifei and Chen, Zhangren and Zhu, Xuekai and Hua, Ermo and Zhang, Kaiyan and Ding, Ning and Zhou, Bowen}, journal={arXiv preprint arXiv:2501.18362}, year={2025} }
Disclaimer: MedXpertQA is for research purposes only. Models evaluated on MedXpertQA can produce unexpected results. We are not responsible for any damages caused by the use of MedXpertQA, including but not limited to, any loss of profit, data, or use of data.
Acknowledgements: The website is adapted from SWE-bench Website. We would like to thank the SWE-bench team for granting us permission to use their website template.
License: This project is licensed under the MIT License. See the LICENSE file for details.
Contact: lindsay2864tt@gmail.com,
dn97@mail.tsinghua.edu.cn.