EMNLP2023
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
Elizabeth Salesky, Neha Verma, Philipp Koehn, Matt Post
摘要
<p>Pretrained multilingual translation models using either pixel or subword (bpe) representations trained on the many-to-one parallel TED-59 dataset, accompanying the EMNLP'23 paper "<a href="https://arxiv.org/abs/2305.14280"&gt;Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer</a>." <br><br>Models can be interacted with on the command line or through a script similarly to other fairseq models, but require our code extension for rendered text with pixel representations. Each model zip file contains: the fairseq model checkpoint, vocab files, language list file, and relevant sentencepiece model(s).</p><p>We additionally package the TED-59 data here in raw extracted format for ease of comparison (original dataset release and paper by <a href="https://github.com/neulab/word-embeddings-for-nmt"&gt;Qi et. al 2018</a>). </p><p>For more information, see our:</p><ul><li>Paper describing the method and training data [<a href="https://arxiv.org/abs/2305.14280">arXiv</a>]</li><li>Code repository with scripts [<a href="https://github.com/esalesky/visrep/tree/multi">github</a>]</li></ul>