EMNLP2023

Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer

Elizabeth Salesky, Neha Verma, Philipp Koehn, Matt Post

摘要

Pretrained multilingual translation models using either pixel or subword (bpe) representations trained on the many-to-one parallel TED-59 dataset, accompanying the EMNLP'23 paper "<a href="https://arxiv.org/abs/2305.14280"&gt;Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer</a>."  Models can be interacted with on the command line or through a script similarly to other fairseq models, but require our code extension for rendered text with pixel representations. Each model zip file contains: the fairseq model checkpoint, vocab files, language list file, and relevant sentencepiece model(s).We additionally package the TED-59 data here in raw extracted format for ease of comparison (original dataset release and paper by <a href="https://github.com/neulab/word-embeddings-for-nmt"&gt;Qi et. al 2018</a>). For more information, see our:<ul><li>Paper describing the method and training data  [<a href="https://arxiv.org/abs/2305.14280">arXiv</a>]</li><li>Code repository with scripts  [<a href="https://github.com/esalesky/visrep/tree/multi">github</a>]</li></ul>