ICLR2025

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Lianghui Zhu, Xinggang Wang, Xinlong Wang

Abstract

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging due to existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement 1 . JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat. Under review as a conference paper at ICLR 2024 In this paper, we propose to evaluate LLMs through fine-tuned open-source LLMs, which serve as scalable judges (JudgeLM) achieving satisfactory agreement with the teacher judge. Our methodology incorporates scalable judges as evaluators in open-ended tasks, coupled with a high-quality dataset conducive to both training and evaluating the judge models. Within our framework, we adapt open-source LLMs to serve as judges and analyze their scaling ability in relation to model size (ranging from 7B to 33B) and volume of training data (extending from 3.5K to 100K). Our curated dataset comprises 105K seed questions, LLM answer pairs, and judgments from the teacher judge, GPT-4, as shown in Fig. 1a . Note that we generated two judgments for each seed task with and without reference answers. This dataset is partitioned, with 100K seed questions allocated for training (×2 larger than PandaLM) and the remainder for validation (×29 larger than PandaLM). mal performance only under specific prompt formats) as shown in Fig. 8 , 10, 12, 13. We propose methods to address them. Moreover, our JudgeLM system presents extended capabilities as shown in Fig. 1b , including grading single answers, judging multiple answers, judging multimodal models, and multi-turn chat. In contrast to arena-format methodologies, our approach is rapid and has a low cost. For instance, a model like JudgeLM-7B requires only 8 A100 GPUs and can evaluate 5000 response pairs in just 3 minutes. In comparison to closed-source LLM judges, JudgeLM ensures reproducibility and protects user privacy. When compared to concurrent open-source LLM judges, our system explores both the scaling ability and biases in LLM fine-tuning. Furthermore, the dataset we introduce stands as the most diverse and high-quality one, significantly benefitting subsequent research in judge model investigations. Our main contributions can be summarized as follows: • We introduce a high-quality, large-scale dataset for judge models, enriched with diverse seed tasks, LLMs-generated answers, and detailed judgments from GPT-4, laying the foundation for future LLMs evaluating research.