EMNLP2025

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Oscar Sainz, Naiara Pérez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero, Aimar Zabala, Ekhi Azurmendi, German Rigau, Eneko Agirre, Mikel Artetxe, Aitor Soroa

Abstract

Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1, 680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation. 1 User: What's the difference between primary vs secondary market research? Assistant: Primary market research involves collecting original data directly from the source, such as through surveys, focus groups, or experiments. This type of research is typically more expensive and time-consuming but provides unique and specific insights. Secondary market research, on the other hand, involves analyzing existing data that has already been collected by someone else, such as academic studies, industry reports, or government statistics. This type of research is often less expensive and faster to conduct but may not provide the same level of specificity or relevance as primary research. Here are some key differences between primary and secondary market research: Primary Market Research * Collects original data directly from the source * Can be tailored to specific research questions or objectives * Provides unique and specific insights * Typically more expensive and time-consuming * Examples: surveys, focus groups, experiments, customer interviews Secondary Market Research * Analyzes existing data collected by someone else * Often less expensive and faster to conduct * May not provide the same level of specificity or relevance * Examples: academic studies, industry reports, government statistics, online reviews, social media analytics When deciding between primary and secondary market research, consider the following factors: User: Zein da merkatu ikerketa primarioaren eta bigarren mailakoaren arteko aldea?