EMNLP2025

The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, Ted Briscoe

摘要

Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machinegenerated reviews generally underperform human reviews on our four aspects. 1 Review Utility Evaluation Peer Review "The number of baselines is a bit small, which degrades its universality and generality." Aspect: Actionability Score: 2/5 Rationale: "The review [...] does not provide specific guidance [...] such as suggesting additional baselines to include or explaining how to enhance the universality and generality. The action is implicit, as the authors need to infer that they should add more baselines, and it is vague because it lacks concrete steps for improvement. [...]" Extract individual review comments from peer reviews Aspects annotation for each comment on 1-5 Scale Aggregate human annotations into full, major and low agreement Scale annotations with GPT-4o aligning with human data and generating score rationale Full (3/3) "The points raised in Section 5 would benefit from more in-depth analysis" Aspect Definitions Aspect Annotation Training & Inference De!ne RevUtil Aspects based on Aspect Literature and Reviewer Guidelines Actionability -Degree to which a comment explicitly states a concrete action to perform to improve the contribution Grounding + Speci!city -Explicit link to a part of the paper and speci!c details what to improve in this part Veri!ability -Measures to which extend the comment provides evidence and rationales to support its claims Helpfulness -Overall judgment of the review comment on how helpful it is for an author to improve their work Aspect Lit. Reviewer Guidelines Peer Review Low Train small-scale, practical models on to predict aspect scores and generate rationales Evaluate models on human and synthetic data RevUtil Synthetic 3 5 2 2 4 3 4 3 5 3 4 4 Comment