ACL2020

Facet-Aware Evaluation for Extractive Summarization

Yuning Mao, Liyuan Liu, Qi Zhu, Xiang Ren, Jiawei Han

被引用 19 次

摘要

Commonly adopted metrics for extractive summarization focus on lexical overlap at the token level. In this paper, we present a facetaware evaluation setup for better assessment of the information coverage in extracted summaries. Specifically, we treat each sentence in the reference summary as a facet, identify the sentences in the document that express the semantics of each facet as support sentences of the facet, and automatically evaluate extractive summarization methods by comparing the indices of extracted sentences and support sentences of all the facets in the reference summary. To facilitate this new evaluation setup, we construct an extractive version of the CNN/Daily Mail dataset and perform a thorough quantitative investigation, through which we demonstrate that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods. 1 Reference: Three people in Kansas have died from a listeria outbreak. Lexical Overlap: But they did not appear identical to listeria samples taken from patients infected in the Kansas outbreak. (ROUGE-1 F1=37.0, multiple token matches but totally different semantics) Manual Extract: Five people were infected and three died in the past year in Kansas from listeria that might be linked to blue bell creameries products, according to the CDC. (ROUGE-1 F1=36.9, semantics covered but lower ROUGE due to the presence of other details) Reference: Chelsea boss Jose Mourinho and United manager Louis van Gaal are pals. Lexical Overlap: Gary Neville believes Louis van Gaal's greatest achievement as a football manager is the making of Jose Mourinho. Manual Extract: The duo have been friends since they first worked together at Barcelona in 1997 where they enjoyed a successful relationship at the Camp Nou. (ROUGE Recall/F1=0, no lexical overlap at all)