EMNLP2021

Graphine: A Dataset for Graph-aware Terminology Definition Generation

Zequn Liu, Shukai Wang, Yiyang Gu, Ruiyi Zhang, Ming Zhang, Sheng Wang

7 citations

Abstract

Precisely defining the terminology is the first step in scientific communication. Developing neural text generation models for definition generation can circumvent the laborintensity curation, further accelerating scientific discovery. Unfortunately, the lack of large-scale terminology definition dataset hinders the process toward definition generation. In this paper, we present a large-scale terminology definition dataset Graphine covering 2,010,648 terminology definition pairs, spanning 227 biomedical subdisciplines. Terminologies in each subdiscipline further form a directed acyclic graph, opening up new avenues for developing graph-aware text generation models. We then proposed a novel graphaware definition generation model Graphex that integrates transformer with graph neural network. Our model outperforms existing text generation models by exploiting the graph structure of terminologies. We further demonstrated how Graphine can be used to evaluate pretrained language models, compare graph representation learning methods and predict sentence granularity. We envision Graphine to be a unique resource for definition generation and many other NLP tasks in biomedicine. 1 * * Corresponding author 1 Our Dataset is available at https://zenodo.org/ record/5320310#.YSxHgI77Q2w . Our code is available at https://github.com/zequnl/Graphex Terminology: Enhancer of variegation Definition: Genotype g1 is an enhancer of variegation if, and only if, some genotype g2 has a variegated phenotype and the degree of variegation caused by g1, g2 is greater than that caused by g2 alone. Terminology: Modifier of variegation Definition: Phenotype that is reduction in or loss of a stereotypical behavioral response to touch.