ACL2023

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir Radev

被引用 50 次

摘要

Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale, and an in-depth analysis of human evaluation is lacking. Therefore, we address the shortcomings of existing summarization evaluation along the following axes: (1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units and allows for a high interannotator agreement. (2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems on three datasets. (3) We conduct a comparative study of four human evaluation protocols, underscoring potential confounding factors in evaluation setups. (4) We evaluate 50 automatic metrics and their variants using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. The metrics we benchmarked include recent methods based on large language models (LLMs), GPTScore and G-Eval. Furthermore, our findings have important implications for evaluating LLMs, as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods. Statistical Power -High statistical power is difficult to reach for human evaluation of similar-performing systems. §4.1 -Increasing the sample size of human evaluation effectively raises statistical power. Summary Length -Summaries from different summarization systems show a large difference in average length. §4.2 -Difference in summary length is not well-reflected by automatic evaluation metrics. -Reference-free and reference-based human evaluation results have a near-zero correlation. Evaluation -Reference-free human evaluation strongly correlates with input-agnostic, annotator preference. Protocol Comparison -Annotator's input-agnostic preference has a strong positive correlation with summary lengths. §5.2 -Annotator's input-agnostic preference does not favor reference summaries. -Compared to smaller, fine-tuned models, zero-shot large language models (e.g. GPT-3) perform better under reference-free evaluation, but worse under reference-based evaluation. Evaluating -A higher-powered human evaluation dataset can lead to a more robust automatic metric evaluation, as shown by a tighter confidence interval and higher statistical power of metric evaluation. Automatic Metrics -Automatic metric performance differs greatly under different human evaluation protocols. §6.1 & §6.2 -Automatic metrics show relatively strong system-level correlation and moderate summary-level correlation with our robust human evaluation protocol.