ACL2022

Combining Static and Contextualised Multilingual Embeddings

Katharina Hämmerl, Jindrich Libovický, Alexander Fraser

Abstract

Static and contextual multilingual embeddings have complementary strengths. Static embeddings, while less expressive than contextual language models, can be more straightforwardly aligned across multiple languages. Contextual language models are more powerful. We combine the strengths of static and contextual models to improve multilingual representations. We extract static embeddings for 40 languages from XLM-R, validate those embeddings with cross-lingual word retrieval, and then align them using VecMap. This results in high-quality, highly multilingual static embeddings. Then we apply a novel continued pre-training approach to XLM-R, leveraging the high quality alignment of our static embeddings to better align the representation space of XLM-R. We show positive results for multiple complex semantic tasks. We will release the static embeddings and the continued pretraining code. Introduction Multilingual contextual encoders like XLM-R (Conneau et al., 2020a) and mBERT (Devlin et al., 2019) , despite being trained without parallel data, exhibit "surprising" cross-linguality (Wu and Dredze, 2019; Conneau et al., 2020b) and have demonstrated strong performance on multilingual and cross-lingual tasks (e.g., Hu et al., 2020; Lauscher et al., 2020; Kurfalı and Östling, 2021; Turc et al., 2021). However, their languageneutrality, meaning how well languages are aligned with each other, has clear limits (Libovický et al., 2020; Cao et al., 2020, inter alia). In particular, more typologically distant language pairs tend to be less well-aligned than more similar ones, affecting transfer performance. By contrast, cross-lingual alignment is wellstudied for static embeddings (e.g., Mikolov et al., 2013; Vulić et al., 2020), and they can be aligned using simple transformation matrices, resulting in high quality multilingual embeddings. However, 042 static embeddings are considerably less expressive 043 than contextual models and have in many applica-044 tions been superseded by them. 045 This paper aims to combine the strengths of 046 static and contextual models, and explore how they 047 may benefit from each other. Our method requires 048 no parallel corpus. Monolingual static embeddings 049 have been extracted from BERT by Gupta and Jaggi 050 (2021). We show that their approach can be ap-051 plied to multilingual embeddings. To our knowl-052 edge, we are the first to explore the extraction of 053 static embeddings from a multilingual contextual 054 model. We distill static embeddings for 40 lan-055 guages from XLM-R, showing that the resulting 056 embeddings are already somewhat cross-lingually 057 aligned, but that their alignment can be improved 058 using established tools (Section 3). These vectors 059 are of high monolingual and cross-lingual quality 060 despite being distilled using only 1M sentences 061 per language. Second, we present a novel contin-062 ued pre-training approach for the contextual model, 063 combining masked language modelling (MLM) 064 with an alignment loss that leverages the well-065 aligned static embeddings (Section 4). This results 066 in improved multilingual contextualised embed-067 dings which work well for complex semantic tasks. 068 2 Contextual and Static Embeddings 069 XLM-R (Conneau et al., 2020a) and mBERT (De-070 vlin et al., 2019) have been successful in multi-and 071 cross-lingual transfer despite being trained only on 072 monolingual corpora. However, the 100 languages 073 in XLM-R-or 104 in mBERT-are not repre-074 sented equally well (cf. Wu and Dredze, 2020), 075 either in terms of data size or downstream perfor-076 mance. Both Singh et al. (2019) and Libovický 077 et al. (2020) found that mBERT clusters its rep-078 resentations of languages in a way that mirrors 079 typological language family trees. However, repre-080 sentations being well-aligned across languages is 081 how best to extract multilingual embeddings from 131 the model: First, using X2Static (Gupta and Jaggi, 132 2021) worked better than aggregation (Bommasani 133 et al., 2020) even with a small amount of data. One 134 important difference with Gupta and Jaggi's work 135 is that for our task the sentence-level variant of 136 X2Static yielded better results than the paragraph-137 level version. Crucially, we also found that embed-138 dings extracted from layer 6 of XLM-R performed 139 noticeably better than embeddings extracted from 140 the output layer. The latter fits with findings for 141 mBERT by Muller et al. (2021) that the middle Notable exceptions are Korean, Thai, Tagalog, and 173 Vietnamese, where the embeddings showed some 174 success before alignment but were not useful after-175 wards. It may be that the induced dictionaries did 176 not work well for these languages or that the static 177 embedding spaces were too different (cf. Vulić 178 et al., 2020). In these cases, we use the "unaligned" 179 embeddings for further experiments. 180 Tables 1 and 2 show that after alignment a