ICML2025

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason E. Weston, Tianlu Wang

摘要

LLM-as-a-Judge models generate chain-ofthought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench and PPE, despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models. Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge Example Input and Output of EvalPlanner User Instruction: [Text]There is a French pub called The Phoenix that has a low customer rating but is child friendly and is more than 30 dollars and is located in the riverside area near Crowne Plaza Hotel. [Question]According to the above [Text], extract the attributes (name, eatType) of the restaurant. Do not output your analysis, only output the result in the format of "attribute1[value1], attribute2[value2], ..." Response A: Name[The Phoenix], eatType[pub], location[Riverside area near Crowne Plaza Hotel], priceRange[More than 30 dollars] Response B: Name[The Phoenix], eatType[pub] Output: [Start of Evaluation Plan] To evaluate the response quality, we will follow these steps: 1. Check if the response provides the correct attributes: Verify that the response includes the attributes "name" and "eatType" as requested in the question. 2. Verify the accuracy of the attribute values: Check if the values provided for the attributes match the information given in the original text. 3. Check for any additional or missing information: Ensure that the response does not include any extra attributes or information not requested in the question. Step-by-step reference answer: 1. The response should include the attributes "name" and "eatType". 2. The values for the attributes should be: * name: The Phoenix * eatType: French pub 3. The response should not include any additional attributes or information. [End of Evaluation Plan] [Start of Plan Execution] Step 1: Check if the response provides the correct attributes Assistant A's response includes the attributes "Name", "eatType", "location", and "priceRange". However, the question only requests the attributes "nameänd "eatType". Assistant B's response includes the attributes "name" and "eatType", which matches the request. Step 2: Verify the accuracy of the attribute values ... Step 3: Check for any additional or missing information Assistant A's response includes additional attributes "location" and "priceRange" that are not requested in the question. Assistant B's response does not include any additional attributes or information.