NeurIPS2023

A Theory of Unsupervised Translation Motivated by Understanding Animal Communication

Shafi Goldwasser, David F. Gruber, Adam Tauman Kalai, Orr Paradise

13 citations

Abstract

Neural networks are capable of translating between languages-in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. We propose a theoretical framework for analyzing UMT when no parallel translations are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure. We exemplify this theory with two stylized models of language, for which our framework provides bounds on necessary sample complexity; the bounds are formally proven and experimentally verified on synthetic data. These bounds show that the error rates are inversely related to the language complexity and amount of common ground. This suggests that unsupervised translation of animal communication may be feasible if the communication system is sufficiently complex. Recent interest in translating animal communication [2, 3, 9] has been motivated by breakthrough performance of Language Models (LMs). Empirical work has succeeded in unsupervised translation between human-language pairs such as English-French [23, 5] and programming languages such as Python-Java [33] . Key to this feasibility seems to be the fact that language statistics, captured by a LM (a probability distribution over text), encapsulate more than just grammar. For example, even though both are grammatically correct, The calf nursed from its mother is more than 1,000 times more likely than The calf nursed from its father . 2 Given this remarkable progress, it is natural to ask whether it is possible to collect and analyze animal communication data, aiming towards translating animal communication to a human language description. This is particularly interesting when the source language may be of highly social and intelligent animals, such as whales, and the target language is a human language, such as English. Challenges. The first and most basic challenge is understanding the goal, a question with a rich history of philosophical debate [38] . To define the goal, we consider a hypothetical ground-truth translator. As a thought experiment, consider a "mermaid" fluent in English and the source language * Authors listed alphabetically. 2 Probabilities computed using the GPT-3 API https://openai.com/api/ text-davinci-02 model. 37th Conference on Neural Information Processing Systems (NeurIPS 2023).