Qiao Jin, MD | Medical AI Researcher
🔗 Homepage | 🐦 Twitter / X | 💼 LinkedIn
<aside> 💡
Disclaimer: This personal blog reflects my own views and not those of my employer. Your comments and feedback are welcome!
</aside>
Most medical LLM evaluations take the form of question answering (QA). Broadly, these QA tasks fall into two categories: closed-ended tasks such as multi-choice questions (MCQs) and open-ended tasks such as generating free-form diagnostic plans or summarizing complex patient cases. Each format has trade-offs: closed-ended benchmarks are easy to evaluate, while open-ended tasks better reflect real-world scenarios.
Closed-ended tasks evaluate a specific part of the model’s answer using a predefined answer space. Common examples include MCQ datasets such as MedQA, PubMedQA, MedMCQA, and biomedical subsets of MMLU.
In addition to MCQ, math tasks like GSM8k that only evaluate the final numeric output in the answer also count as closed-ended tasks, as the reasoning part is usually not evaluated. In the general domain, various benchmarks follow this pattern for math problem solving and code generation. In medicine, we recently introduced MedCalc-Bench for evaluating the clinical calculation capabilities of LLMs, but more evaluation benchmarks on medical calculation & data science are needed.
When released, most medical LLMs are evaluated on the MCQ datasets mentioned above, and MedQA (USMLE) is perhaps the most commonly used MCQ benchmark in medicine. Because the set of possible answers is predefined, performance can be measured by straightforward metrics like choice accuracy without requiring clinical experts. Figure 1 below shows an example.
Figure 1. An example MedQA-USMLE question and LLM’s answer (Liévin et al, 2024).
However, MCQ is an unrealistic setting as there are just no choices in real life. In addition, there can be hidden flaws - LLMs predict the correct final choice with flawed rationales. Therefore, higher scores on MCQs do not necessarily translate into better clinical utility, but these datasets might still be useful as screening tools - if a model cannot even pass MedQA (>60% acc), it should never be considered for any clinical evaluation at all. Just like if a medical student cannot pass USMLE, they will never be granted a chance to take the real test in the clinic.
<aside> 💡
Personal take: Medical LLM developers might still need MCQs in the future, if MedQA / PubMedQA / MedMCQA are saturated, I think the community can benefit from new harder datasets for a quick & cheap screening of the model capabilities.
</aside>
Open-ended tasks, on the other hand, evaluate multiple dimensions of LLM outputs. For example, MultiMedQA 140 (100 from HealthSearchQA, 20 from LiveQA, and 20 from MedicationQA) is used in the evaluation of Med-PaLM. Open-ended benchmarks better capture real-world medical scenarios, yet few are commonly used in medicine because evaluation is very challenging. Traditional natural language generation (NLG) metrics such as BLEU and ROUGE scores do not often correlate with human judgements, and manual judgements (one example shown in the figure below) are time-consuming and thus not scalable.
Figure 2. An example of manual judgements for the open-ended MultiMedQA task (Singhal et al, 2023).
In the research community, there are some creative tries to go beyond closed-ended MCQs and design more open-ended tasks & evaluations in medicine. For example, recent studies like CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine), AMIE (Articulate Medical Intelligence Explorer), and AgentClinic use simulated AI agents to interact with LLMs in a controlled environment for the evaluation of clinical LLMs. This is definitely an interesting research direction and is worth further exploration.
<aside> 💡
Open-ended evaluation is the next frontier, but progress remains slow due to the lack of scalable & reliable scoring methods for free-form model responses.
</aside>