ACL2023

Transformed Protoform Reconstruction

Young Min Kim, Kalvin Chang, Chenxuan Cui, David R. Mortensen

7 citations

Abstract

Protoform reconstruction is the task of inferring how morphemes or words sounded in ancestral languages of a set of daughter languages. Meloni et al. (2021) achieved the stateof-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model-the Transformer. Our model outperforms their model on a suite of different metrics on two different datasets: Meloni et al.'s Romance data of 8,000+ cognates (spanning 5 languages) and a Chinese dataset (Hóu, 2004) of 800+ cognates (spanning 39 varieties). We also probe our model for potential phylogenetic signal contained in the model. Our code is publicly available 1 . * Equal contribution 1 https://github.com/cmu-llab/acl-2023 2 In fact, the proto-language from which Romance languages like Spanish and Italian are descended is not identical to Classical Latin but is, rather, a closely related and sparsely attested language sometimes called Proto-Romance or Vulgar Latin.