EMNLP2022

English Contrastive Learning Can Learn Universal Cross-lingual Sentence Embeddings

Yau-Shian Wang, Ashley Wu, Graham Neubig

被引用 18 次

摘要

Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE (Gao et al., 2021) to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSim-CSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving lowresource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. 1