EMNLP2021

Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage?

Kenneth Church, Yuchen Bian

7 citations

Abstract

This survey/position paper discusses ways to improve coverage of resources such as Word-Net. Rapp estimated correlations, ρ, between corpus statistics and psycholinguistic norms. ρ improves with quantity (corpus size) and quality (balance). 1M words are enough for simple estimates (unigram frequencies), but at least 100M are required for pairs of words (word associations, edges). Knowledge Graph Completion (KGC) attempts to learn missing links in WN18. Unfortunately, WN18 is flawed with information leaking from train to test. More seriously, WN18 is based on SemCor (just 200k words) and dated (collected in 1960s). KGC cannot learn anything that happened since the 1960s, or associations requiring 100M words.