ACL2025

A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation

Yang Zhong, Diane J. Litman

1 citation

Abstract

Ensuring factual consistency in summarization remains a challenge, especially for longdocument evaluation. While automated, reference-free evaluation models are essential given the impracticality of large-scale human assessment for lengthy texts, challenges persist in evaluating different systems on how to handle different summary granularities and evolving model generations. In this work, we conduct a systematic study on diverse factualconsistency evaluation systems across four long-document datasets, encompassing summaries generated by models from non-LLMs to proprietary LLMs. Our analysis reveals that fine-grained continuous scores can provide more reliable assessments of different evaluation systems' capabilities than binary classification. We also examine the relationship between sentence-level and summary-level model performance, highlighting its dependency on dataset characteristics. Moreover, our study reveals that advanced systems can achieve higher recall in error detection for older summaries, yet struggle with false positives and fine-grained error detection. Our analysis and case studies provide further insights into designing robust factuality evaluation systems, which are becoming increasingly in demand as generative models advance rapidly. Ex 1: They were all men who paid for their ship with their lives. Label: SentE; Generator: Phi-2; All systems failed to identify the error. Author comment: The crew did not pay for their ship with their lives; rather, they paid for their greed when encountering the cursed derelict ship. This factual error is more nuanced and pinpoints the challenge of deriving the correct relations when using specific entities. Ex 2: The medicos are aware of the dangers of contagious diseases from the beginning rather than being explicitly warned later. Label: EntE; Generator: Phi-2; All systems failed to identify the error. Author comment: The medicos are not aware of the dangers of contagious diseases from the beginning. The error is embedded in the conceptual understanding of the context, not simply on the surface level. Ex 3: Map is a chronic debilitating disease in ruminants. Label: SenE; Generator: T5 model, GPT4o and Gemini detected the error, while all others failed. Author comment: In the original document, "the general characteristics of Johne's disease with respect to the pathogenesis and immune response to MAP, as well as recent advances in development of vaccines were briefly examined" suggests that MAP itself is not a disease, but the causative agent of Johne's disease. Ex 4: The story explores themes of isolation, adaptation, and the risks and ethical dilemmas of colonization and medical experimentation. Label: OutE; Generator: GPT-4; Only linguistic-based models detect the error, and all other models fail to identify it. Author comment: The sentence introduces an interpretative element that is not explicitly stated in the transcript. This suggests that interpretative elements are becoming harder to detect, even for LLMs that generate the summary itself. Ex 5: Starrett Blade, a space pirate, is trapped by the feared Devil Garrett and fights for his life. Label: EntE; Generator: GPT3.5 model; only GPT4o and StructS (BS) succeeded in detecting the errors. Author comment: The source text begins with, "Trapped by the most feared space pirates, Devil Garrett, Starrett Blade was fighting for his life." Further main document texts show that Starrett is, in fact, a hunter of space pirates. Here, the model messed up the entity attribution.