ACL2025

Programming by Example meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Atharva Naik, Darsh Agrawal, Hong Sng, Clayton Marr, Kexun Zhang, Nathaniel Romney Robinson, Kalvin Chang, Rebecca Byrnes, Aravind Mysore, Carolyn P. Rosé, David R. Mortensen

被引用 1 次

摘要

Historical linguists have long written "programs" that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called SOUND LAWS) However, writing these programs is error-prone, motivating the development of automated SOUND LAW INDUCTION (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs). While LLMs have been effective for code generation, recent work has shown that PBE with LLMs is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework regarding what constitutes a "similar distribution" for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results, we create a SOTA open source model for SLI as PBE (+6% pass rate with a third of the parameters of the next best open source LLM).