ACL2025
LexGen: Domain-aware Multilingual Lexicon Generation
Ayush Maheshwari, Atul Kumar Singh, N. J. Karthika, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan
1 citation
Abstract
Lexicon or dictionary generation across domains has the potential for societal impact, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping or corpora-based approaches. However, these approaches do not cater to domain-specific lexicon generation that consists of domain-specific terminology. This task becomes particularly important in specialized medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and scarcity of data involving domain-specific terms especially for low/midresource languages. In this paper, we propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. We also release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains. We conduct both zeroshot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages. Additionally, we also perform a post-hoc human evaluation on unseen languages. The source code and dataset is present at https://github.com/ Atulkmrsingh/lexgen . * Authors contributing equally. † Work done while pursuing PhD at IIT Bombay. 1 Throughout the paper, we use the terms 'lexicon' and 'dictionary' interchangeably.