ACL2024

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

Gal Yona, Roee Aharoni, Mor Geva

Abstract

Factual questions can typically be answered correctly at different levels of granularity. For example, both " August 4, 1961" and "1961" are correct answers to the question "When was Barack Obama born?". Standard question answering (QA) evaluation protocols, however, do not take this into account explicitly and instead compare a predicted answer against reference answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multigranularity version of the ENTITYQUESTIONS dataset. 1 We evaluate models using a range of decoding methods on GRANOLA-EQ, including a new algorithm called Decoding with Response Aggregation (DRAG), that is geared towards aligning the answer granularity with the model's uncertainty. Our experiments show that large language models with standard decoding methods tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities, revealing that standard evaluation and decoding schemes may underestimate the knowledge encapsulated in language models.