ICLR2025

BenTo: Benchmark Reduction with In-Context Transferability

Hongyu Zhao, Ming Li, Lichao Sun, Tianyi Zhou

摘要

Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a < 4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only. Computer Science abstract algebra college computer science computer security econometrics high school computer science machine learning Global global facts high school geography high school government & politics miscellaneous, sociology History high school european history high school us history high school world history Philosophy Jurisprudence, logical fallacies moral disputes, philosophy Physical Science astronomy college chemistry college physics conceptual physics high school chemistry high school physics 1 Each arc connects a source task with a target task and has the same color as the source task.