ACL2023

MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types

Keerthiram Murugesan, Sarathkrishna Swaminathan, Soham Dan, Subhajit Chaudhury, R. Chulaka Gunasekara, Maxwell Crouse, Diwakar Mahajan, Ibrahim Abdelaziz, Achille Fokoue, Pavan Kapanipathi, Salim Roukos, Alexander Gray

Abstract

With the growing interest in large language models, the need for evaluating the quality of machine text compared to reference (typically human-generated) text has become focal attention. Most recent works focus either on taskspecific evaluation metrics or study the properties of machine-generated text captured by the existing metrics. In this work, we propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts. Inspired by the recent efforts in several NLP tasks for finegrained evaluation, we introduce a set of 13 mismatch error types such as spatial/geographic errors, entity errors, etc, to guide the model for better prediction of human judgments. We propose a neural framework for evaluating machine texts that uses these mismatch error types as auxiliary tasks and re-purposes the existing single-number evaluation metrics as additional scalar features, in addition to textual features extracted from the machine and reference texts. Our experiments reveal key insights about the existing metrics via the mismatch errors. We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation. Error Type Abbr Definition Example Sentence Grammatical/Usage Error GramErr Faulty or incorrect use of the grammar and syntax. ref: Two paintings are on the wall. gen: Two painting is on the wall. Predicate Error PredErr Error in the predicate or its usage with respect to the reference text. ref: John entered the kitchen. gen: John found the kitchen. Entity Error EntErr Mismatch in the primary arguments of the predicate. ref: A dog chased a cat. gen: A dog chased a rat. Predicate Ordering Error PredOrdErr Error in causal or temporal ordering of the predicates/events. ref: The police arrested the suspect then he was taken to prison. gen: The suspect was taken to the prison then the police arrested him. Hyponyms/Hypernyms Errors HypErr Violations in hypernym/hyponym usage. ref: Jim studied mechanical engineering. gen: Jim studied architectural science. Numerical Error NumErr Error in numerical, quantifiers or related to numbers (ordinals, cardinals, etc) ref: Martha ate four apples. gen: Martha ate six apples. Spatial/Temporal Error STErr Error in spatial or geographic information (location, time, etc). ref: Dave lives in south Chicago. gen: Dave lives in south Chile. Attribute/Modifier Error AttrErr Mistakes in additional information concerning the predicates and entities. (not covered by numerical, spatial, geographic) ref: Greg has two small dogs. gen: Greg has two big dogs. Question Error QuestErr Error/change in the nature of the question's intention. ref: Did you take the dog to the vet? gen: When did you take the dog to the vet? Negation NegErr Negated compared to the reference text. ref: Susan took the gift. gen: Susan did not take the gift. Missing Information MissInfo Missing key details from the reference text. ref: Bob drove to the hospital and saw a doctor. gen: Bob saw a doctor. Out of Reference OutofRef Contains additional details not present in the reference text. ref: Jack and Jane are friends. gen: Jack and Jane are friends. Jack plays football. Redundant/Repetition RepErr Same/similar information repeated more than once.