ACL2024

Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction

Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogala, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz

DOI 出版方

摘要

Advancements in AI and natural language pro-001 cessing have revolutionized machine-human 002 language interactions, with question answer-003 ing (QA) systems playing a pivotal role. The 004 knowledge base question answering (KBQA) 005 task, utilizing structured knowledge graphs 006 (KG), allows to handle extensive knowledge-007 intensive questions. However, a significant gap 008 exists in KBQA datasets, especially for low-009 resource languages. Many existing construc-010 tion pipelines for these datasets are outdated 011 and inefficient in human labor, not utilizing 012 modern assisting tools like Large Language 013 Models (LLM) to reduce the workload. To ad-014 dress this, we have designed and implemented a 015 modern, semi-automated approach for creating 016 datasets, encompassing tasks such as KBQA, 017 Machine Reading Comprehension (MRC), and 018 Information Retrieval (IR), specifically tailored 019 for low-resource environments. We executed 020 this pipeline and introduced the PUGG dataset, 021 the first Polish KBQA dataset, along with novel 022 datasets for MRC and IR. Additionally, we pro-023 vide a comprehensive implementation, insight-024 ful findings, detailed statistics and evaluation 025 of baseline models. 026 1 Introduction 027 Question answering (QA) systems serve as a so-028 phisticated interface between humans and comput-029 ers. To further enhance their utility, we need QA 030 systems that are capable of answering questions 031 based on extensive knowledge (Petroni et al., 2021). 032 The knowledge base question answering (KBQA) 033 task addresses this need by using structured knowl-034 edge graphs (KG), to provide accurate and relevant 035 answers (Lan et al., 2021). KBQA leverage these 036 graphs, which are rich with interconnected enti-037 ties and relationships, to decode complex queries 038 and deliver precise answers. Importantly, systems 039 that reason over KGs are more resistant to the phe-040 nomenon of hallucinations, common in large lan-notation. MRC is essential for AI to understand 083 and analyze texts like a human reader (Rajpurkar 084