ACL2025

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber

摘要

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations. Questioning Multiple Choice Questions Multiple choice question answering (MCQA) is the standard for large language model (LLM) evaluations, prized for simplicity and similarity to human testing (Robinson and Wingate, 2023). When designing new benchmarks, MCQA seems easy to implement (Guo et al., 2023) , and when selecting new LLMs to use, MCQA leaderboards inform our decisions (Fourrier et al., 2024) . If you want to build a popular dataset, prove your LLM is smart, or even publish a position paper, it is hard to avoid MCQA. Standardized testing groups have long explored ways to better use MCQA for student testing (Angoff, 1971). But despite years of use in NLP (Turney et al., 2003) , few have asked: 1) should MCQA be a standard model evaluation format; and 2) are its datasets well-designed? This position paper argues: Evaluating LLMs with MCQA has flaws Q1. What's wrong with MCQA's format? A) It doesn't apply to many tasks ( §3.1) B) It's misaligned with LLM use cases ( §3.2) C) It doesn't fully test knowledge ( §3.3) Q2. What's wrong with MCQA datasets? A) Test sets are contaminated ( §5.1) B) They have unanswerable questions ( §5.2) C) They contain shortcuts ( §5.3) D) They're too easy for LLMs ( §5.4) Q3. How do LLMs struggle with MCQA? A) They lack robustness ( §6.1) B) They exhibit biases ( §6.2) C) They give unfaithful explanations ( §6.3) Q4. How can insights from education improve MCQA? A) Improve knowledge testing via generative formats ( §4) B) Combat test set leakage with fresh questions ( §5.1) C) Write MCQs informed by educational rubrics ( §5.2) D) Use calibration scoring to curb guessing ( §5.3.1) E) Find harder MCQs with item response theory ( §5.4.1) 7 gpt-3 has seen 45% of RACE's test set (Sainz et al., 2023) .