EMNLP2024

Verba volant, scripta volant? Don't worry! There are computational solutions for protoword reconstruction

Liviu P. Dinu, Ana Sabina Uban, Alina Maria Cristea, Ioan-Bogdan Iordache, Teodor-George Marchitan, Simona Georgescu, Laurentiu Zoicas

Abstract

We introduce a new database of cognate words and etymons for the five main Romance languages (Romanian, Italian, Spanish, Portuguese, French), the most comprehensive one to date with over 19,000 entries. We propose a strong benchmark for the automatic reconstruction of protowords for Romance languages by applying a series of machine learning models and features on these data. The best results reach 90% accuracy in predicting the protoword of a given cognate set, surpassing existing state-of-the-art results for this task and showing that computational methods can be very useful in assisting linguists with protoword reconstruction.