ACL2025

ChatBench: From Static Benchmarks to Human-AI Evaluation

Serina Chang, Ashton Anderson, Jake M. Hofman

摘要

With the rapid adoption of LLM-based chatbots, there is a pressing need to evaluate what humans and LLMs can achieve together. However, standard benchmarks, such as MMLU, measure LLM capabilities in isolation (i.e., "AIalone"). Here, we design and conduct a user study to convert MMLU questions into user-AI conversations, by seeding the user with the question and having them carry out a conversation with the LLM to answer their question. We release ChatBench, a new dataset with AIalone, user-alone, and user-AI data for 396 questions and two LLMs, including 144K answers and 7,336 user-AI conversations. We find that AI-alone accuracy fails to predict user-AI accuracy, with significant differences across multiple subjects (math, physics, and moral reasoning), and we analyze the user-AI conversations to provide insight into how they diverge from AI-alone benchmarks. Finally, we show that fine-tuning a user simulator on a subset of ChatBench improves its ability to estimate user-AI accuracies, increasing correlation on heldout questions by more than 20 points, creating possibilities for scaling interactive evaluation. 1 A number is rounded to the nearest hundred. The result is 500. Which number could not be the number before it was rounded to the nearest hundred? A. 458 B. 463 C. 547 D. 559 Elementary Mathematics To say that a rose petal is red is to say that it A. absorbs red B. reflects red C. emits red D. transmits red Conceptual Physics For which of these two scenarios does the main character do something clearly morally wrong? Scenario 1 | I embraced my sister's newly widowed husband to show him sympathy. Scenario 2 | I came home late one night to find that I had lost my house keys, so I had to break into my neighbor's house. A