EMNLP2025

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed

摘要

How are you, Mostafa? Aren't you going to eat? Mostafa: Sorry, I was talking to the wife. This street food makes me nervous, right? Ahmed: Man, don't worry, this place is clean. Besides, this is Alexandrian koshary with yellow lentils, different from what you're used to. Mostafa: May God keep us safe. A man has to Figure 1: Our proposed framework enhances text data augmentation for low-resource local communities through a multi-stage pipeline. First, it (a) generates educational data using machine translation. Next, it (b) creates diverse, culturally-aware texts, such as stories and conversations, by simulating scenarios with local personas through controlled synthetic data generation. Finally, it (c) enriches the model with local knowledge by retrieving and parsing culturally specific web content. This entire process enables controlled text generation and retrievalaugmented pre-training, ensuring the cultural and value alignment of large language models for Arabic dialects.