ACL2024
SPAGHETTI: Open-Domain Question Answering from Heterogeneous Data Sources with Retrieval and Semantic Parsing
Heidi C. Zhang, Sina J. Semnani, Farhad Ghassemi, Jialiang Xu, Shicheng Liu, Monica S. Lam
摘要
We introduce SPAGHETTI: Semantic Parsing Augmented Generation for Hybrid English information from Text Tables and Infoboxes, a hybrid question-answering (QA) pipeline that utilizes information from heterogeneous knowledge sources, including knowledge base, text, tables, and infoboxes. Our LLM-augmented approach achieves state-ofthe-art performance on the COMPMIX dataset, the most comprehensive heterogeneous opendomain QA dataset, with 56.5% exact match (EM) rate. More importantly, manual analysis on a sample of the dataset suggests that SPAGHETTI is more than 90% accurate, indicating that EM is no longer suitable for assessing the capabilities of QA systems today. Figure 1 : Given an input query, SPAGHETTI gathers factual information from four sources to generate a prediction. In parallel, we parse the query to logical form to query Wikidata, run retrieval to find information from Wikipedia text, tables, and infoboxes, and generate a response using LLM, only keeping a claim if it is verified. main contribution is a hybrid LLM-based system 041 (Fig. 1 ), SPAGHETTI, that combines information 042 retrieval with semantic parsing for question an-043 swering and achieves SOTA of 56.5% exact match 044 rate on COMPMIX, the most comprehensive open-045 domain QA dataset on heterogeneous sources. 046 Second, we show that using evaluation meth-047 ods closer to human judgment suggests that 048 SPAGHETTI is more than 90% accurate on COMP-049 MIX, indicating there is little room for improve-050 ment. This suggests that measuring the accuracy 051 of LLM-based QA systems with the exact-match 052 metric against hand-annotated answers is obsolete. 053 2 Related Work 054 TextQA, TableQA. and KBQA have all been indi-055 vidually studied extensively (Zhao et al., 2023a; 056 Retrieval-augmented generation is a common ap-125 proach for grounding LLMs in textual knowledge 126 sources like Wikipedia. To avoid LLM hallucina-127 tion, Semnani et al. (2023) proposes the WikiChat 128 pipeline that combines retrieval with verification of 129 LLM-generated response, achieving significantly 130 higher factual accuracy than GPT-4. We adopt a 131 similar approach when handling text. 132 We first extract Wikipedia text using Wikiextrac-133 tor 1 . ColBERT (Santhanam et al., 2022) is used 134 to retrieve Wikipedia passages that may answer a 135 given query, and each of the top-k retrieved pas-136 sages goes through a few-shot LLM summarizer. This work focuses specifically on open-domain QA 294 with heterogeneous knowledge sources, and we 295 only report results on the COMPMIX dataset due 296 to the limited availability of high-quality datasets 297 in this domain. A natural future work is to develop 298 more diverse and advanced datasets that further 299 push the need to utilize each knowledge source. 300 We evaluate on single-turn QA and do not work 301 with conversations in this paper, and SPAGHETTI 302 can be extended to handle fact-based conversational 303 questions or even chitchat that involves facts. 304 We have a relatively small sample size for human 305 evaluation, because the expert manually checks the 306 correctness of each example with Internet searches, 307 which is labor-intensive. However, we acknowl-308 edge that a larger sample size would increase the 309 statistical confidence of our evaluation. 310 Finally, we note that a number of Wikipedia 311 tables are not well-formatted after preprocessing 312 and linearization. Since Wikipedia tables are em-313 bedded as HTML elements that allow for idiosyn-314 crasies like a table with one cell spanning multi-315 ple columns or color-highlighted cells, some are 316 hard to parse correctly. Solving such edge cases 317 engineering-wise would further improve TableQA.