NeurIPS2022

A new dataset for multilingual keyphrase generation

Frédéric Piedboeuf, Philippe Langlais

被引用 5 次

摘要

Keyphrases are key components in efficiently dealing with the everincreasing amount of information present on the internet. While there are many recent papers on English keyphrase generation, keyphrase generation for other languages remains vastly understudied, mostly due to the absence of datasets. To address this, we present a novel dataset called Papyrus, composed of 16427 pairs of abstracts and keyphrases. We release four versions of this dataset, corresponding to different subtasks. Papyrus-e considers only English keyphrases, Papyrus-f considers French keyphrases, Papyrus-m considers keyphrase generation in any language (mostly French and English), and Papyrus-a considers keyphrase generation in several languages. We train a state-of-the-art model on all four tasks and show that they lead to better results for non-English languages, with an average improvement of 14.2% on keyphrase extraction and 2.0% on generation. We also show an improvement of 0.4% on extraction and 0.7% on generation over English state-of-the-art results by concatenating Papyrus-e with the Kp20K training set. Other efforts at keyphrase extraction on languages other than English include DEFT2012 and DEFT2016 challenges, where keyphrases extraction and indexation are tested on French articles from the humanities [38; 2; 14]. From what we can infer, this dataset is no longer maintained, and we weren't successful in obtaining it. Many other techniques have been • Papyrus-a: From the multiple abstracts of a document, generate keyphrases in the same languages as the abstracts (many to many languages). We show a toy example in Table 1 . For Papyrus-a, we simply concatenate the abstracts together in the order they were present in Papyrus, and we do the same for keyphrases.