EMNLP2023

To Split or Not to Split: Composing Compounds in Contextual Vector Spaces

Christopher Jenkins, Filip Miletic, Sabine Schulte im Walde

被引用 1 次

摘要

We investigate the effect of sub-word tokenization on representations of German noun compounds: single orthographic words which are composed of two or more constituents but often tokenized into units that are not morphologically motivated or meaningful. Using variants of BERT models and tokenization strategies on domain-specific restricted diachronic data, we introduce a suite of evaluations relying on the masked language modelling task and compositionality prediction. We obtain the most consistent improvements by pre-splitting compounds into constituents.