EMNLP2021

Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

Nattapong Tiyajamorn, Tomoyuki Kajiwara, Yuki Arase, Makoto Onizuka

被引用 16 次

摘要

We propose a method to distil languageagnostic meaning embedding using a multilingual sentence encoder. By removing languagespecific information from the original embedding, we retrieve an embedding that fully represents the meaning of the sentence. The proposed method relies only on parallel corpora without any human annotations. Our meaning embedding allows for efficient cross-lingual sentence similarity estimation using a simple cosine similarity calculation. Experimental results of both the quality estimation of machine translation and cross-lingual semantic textual similarity tasks reveal that our method consistently outperforms the strong baselines using the original multilingual embeddings. The method also consistently improves the performance of any pre-trained multilingual sentence encoder, even in low-resource language pairs, where only tens of thousands of parallel sentence pairs are available. 1