NeurIPS2021

EventNarrative: A Large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation

Anthony M. Colas, Ali Sadeghian, Yue Wang, Daisy Zhe Wang

26 citations

Abstract

We introduce EventNarrative, a knowledge graph-to-text dataset from publicly available open-world knowledge graphs. Given the recent advances in event-driven Information Extraction (IE), and that prior research on graph-to-text only focused on entity-driven KGs, this paper focuses on event-centric data. However, our data generation system can still be adapted to other types of KG data. Existing large-scale datasets in the graph-to-text area are non-parallel, meaning there is a large disconnect between the KGs and text. The datasets that have a paired KG and text, are small scale and manually generated or generated without a rich ontology, making the corresponding graphs sparse. Furthermore, these datasets contain many unlinked entities between their KG and text pairs. EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text, six times larger than the current largest parallel dataset. It makes use of a rich ontology, all the KGs entities are linked to the text, and our manual annotations confirm a high data quality. Our aim is two-fold: to help break new ground in event-centric research where data is lacking and to give researchers a well-defined, large-scale dataset in order to better evaluate existing and future knowledge graph-to-text models. We also evaluate two types of baselines on EventNarrative: a graph-to-text specific model and two state-of-the-art language models, which previous work has shown to be adaptable to the knowledge graph-to-text domain. Introduction Natural language generation (NLG) is a rapidly developing area of natural language processing (NLP). With the advent of transformer-based language models, such as BERT and ERNIE [45], NLG has seen some recent advances in abstractive summarization, dialog response generation, and generative question answering. These tasks in NLG have all been accompanied by previously curated large-scale parallel datasets, where parallel denotes a tightly coupled input/output, allowing for generalized fine-tuning, including: the CNN/DM dataset [13, 28] and Gigaword [40] for abstractive summarization, Persona-Chat [54] and DSTC7 [7] for dialogue response generation, and CoQA [38] for generative question-answering. While the aforementioned NLG tasks have had a history of curated large-scale datasets to finetune on, the task of knowledge graph-to-text has been missing this scale of parallel data. Knowledge graph-to-text generation is the process of taking structured data in the form of a knowledge graph (KG), which is a collection of subject-predicate-object (s, p, o) triples, and describing the graph through natural language sentence(s). KGs describe real-world entities and their properties and tend to be incomplete. There are over 1,300 publicly available KGs with over 100B triples [14, 26] , containing structured knowledge about biomedicine, geography, socioeconomic, life sciences, chemistry, publications, etc., and many cross-domain KGs. Some popular KGs include Wikidata [47], YAGO [37], ICEWS [31], and DBpedia [20]. These KGs are both user curated and extracted from 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks.