ACL2025

Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Roland Daynauth, Christopher Clarke, Krisztián Flautner, Lingjia Tang, Jason Mars

摘要

Evaluating large language models (LLMs) is a complex task. Pairwise ranking, where humans compare LLM outputs based on predefined criteria, has become a leading approach. By aggregating these comparisons through algorithms such as Elo, rankings across multiple LLMs can be derived. However, applying ranking algorithms in LLM evaluation presents several challenges. Traditional systems like Elo, designed initially for structured competitions such as chess, often produce inconsistent and unstable rankings due to the dynamic and context-dependent nature of LLM performance. Despite the increasing reliance on these methods, a systematic study of ranking algorithms for LLM evaluation remains lacking. This paper examines the effectiveness of various ranking systems for head-to-head LLM comparisons. We define key principles for robust ranking, conduct extensive evaluations of different ranking algorithms, and analyze their stability, accuracy, and sensitivity to real-world conditions. Our findings offer insights into the limitations of existing approaches and provide guidelines for selecting the most appropriate ranking method based on evaluation objectives and resource constraints.