ICLR2026
Fluent Alignment with Disfluent Judges: Post-training for lower-resource languages
David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov
Abstract
We propose a post-training method for lower-resource languages that preserves the fluency of language models even when aligned by disfluent reward models. Preference optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and instruction-tuned language models capable of generating fluent synthetic data. To address this, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common alternatives: supervised finetuning on machinetranslated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data. Recent advances in reinforcement learning from AI feedback (RLAIF; Bai et al., 2022) offer a potential solution to this challenge. In on-policy reinforcement learning, the model learns from its own generated responses rather than from fixed datasets. This means we can potentially avoid exposing it to disfluent text altogether. The key insight is that a model that has learned fluent generation through extensive pretraining on native texts can maintain this fluency as long as it is never trained on unnatural examples during the alignment phase. 1 Fluency refers to the linguistic quality of text that makes it natural, grammatical, and easy to read. It should read as though written by a native speaker. It is independent of other qualities such as factual accuracy.