MedXpertQA Leaderboard Submission Guide

All official submissions to the MedXpertQA leaderboard are maintained at MedXpertQA/issues .

1. Evaluate on MedXpertQA:

Download the data from Huggingface and perform the evaluation locally.

We use the zero-shot chain of thought (CoT) method for evaluation, which is detailed in the paper along with specific methods and prompts!

2. Organize Your Result Files:

Please add a "prediction" field in the original samples to store your model's predictions, while maintaining the original data format. An example is provided below:

{
    "id": "MM-26",
    "question": "A 70-year-old female patient...",
    "options": {
        "A": "Erythema infectiosum",
        "B": "Cutaneous larva migrans",
        "C": "Cold agglutinin disease",
        "D": "Cutis marmorata",
        "E": "Erythema ab igne"
    },
    "label": "C",
    "images": ["MM-26-a.jpeg"],
    "medical_task": "Diagnosis",
    "body_system": "Lymphatic",
    "question_type": "Reasoning",
    "prediction": "C",
    "prediction_rationale": "I understand...",
}

We strongly recommend adding a "prediction_rationale" field to include reasoning traces, especially for o1-like reasoning models. The goal of this requirement is mainly divided into two parts:

Verify the validity of your submission.
Analyze and showcase model behavior:

Provide the community with more insight into how cutting-edge methods work without requiring a code release.

What is a reasoning trace?

A reasoning trace is a string that describes the steps your system took to solve a task instance. It should provide a detailed account of the reasoning process that your system used to arrive at its solution. Typically, when running inference, the intermediate output generated by the system is the reasoning trace.

The reasoning trace should be...

Human-readable.
Reflects the intermediate steps your system took that led to the final solution.
Generated with the inference process, not post-hoc.

Note:

When submitting, please distinguish the benchmark you evaluated on by the filename (e.g., medxpertqa_mm_test_submission.jsonl and medxpertqa_text_test_submission.jsonl). We recommend evaluating on each sub-benchmark, if possible.

3. Submit Your Results

Create an Issue using the issue template, submit your result files, and provide the relevant information, such as the paper link and contact email.

Acknowledgements: We adapted part of the SWE-bench's submission guidelines, and we greatly appreciate their contributions to the community.