EMNLP2021

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan, Mitesh M. Khapra

32 citations

Abstract

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose CheckLists for better design and evaluation of automatic metrics. We design templates which target a specific criteria (e.g., coverage) and perturb the output such that the quality gets affected only along this specific criteria (e.g., the coverage drops). We show that existing evaluation metrics are not robust against even such simple perturbations and disagree with scores assigned by humans to the perturbed output. The proposed templates thus allow for a fine-grained assessment of automatic evaluation metrics exposing their limitations and will facilitate better design, analysis and evaluation of such metrics. 1 Task Criteria Machine Translation Adequacy: The generated translation should adequately represent all the information present in the reference. Question Generation Relevance: Is the question related to the source material they are based upon. Answerability: Is the generated question answerable given the context. Informativeness: The summary should convey the key points of the text. Non-redundancy: The summary should not repeat any points, and ideally have maximal information coverage within the limited text length. Abstractive Summarization Referential clarity: Any intra-sentence or cross-sentence references in the summary should be unambiguous and within the scope of the summary. Focus: The summary needs to have a focus and all the sentences need to contain information related to this focal point. Structure and Coherence: The summary should be a well-organized and coherent body of information Dialogue Generation Making sense: Does the bot say things that don't make sense? Engagingness: Is the dialogue agent enjoyable to talk to? Interestingness: Did you find the bot interesting to talk to? Inquisitivenes: Does the bot ask a good amount of questions? Listening: Does the bot pay attention to what you say? Avoiding Repetition: Does the bot repeat itself? (either within or across utterances) Humanness: Is the conversation with a person or a bot? Image Captioning Relevance: The caption should be specific and related to the image. Thoroughness: The caption should adequately describe the image. Data Coverage: Does the text include descriptions of all predicates presented in the data? Relevance: Does the text describe only such predicates which are found in the data?