EMNLP2025

Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes

Mingyang Wang, Lukas Lange, Heike Adel, Yunpu Ma, Jannik Strötgen, Hinrich Schütze

摘要

Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps.However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated.We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing.Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy.Finally, we show that the script composition of reasoning traces closely aligns with that of the model's internal representations, indicating that language mixing reflects latent processing preferences in RLMs.Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs. 1 Question: How many of the first one hundred positive integers are divisible by 3, 4, and 5? Thinking: Hmm, let me think about the question.Well, if a number is divisible by multiple numbers, it's called a common multiple.So maybe I should start by finding the least common multiple (LCM) of 3, 4, and 5. . .So, the LCM of 3, 4, and 5 is 60.1100 60, 60 2 = 120100 ... So yes, only 60.Therefore, the answer is 1.Answer: There is only one number, 60, that is divisible by 3, 4, and 5 in the first 100 positive integers.Arabic Hindi Japanese 0.0 0.2 0.4 0.6 Accuracy DeepSeek-R1-Distill-Llama-70B Arabic Hindi Japanese 0.0 0.2 0.4 0.6 Accuracy DeepSeek-R1-Distill-Qwen-32B Arabic Hindi Japanese 0.0 0.2 0.4 Accuracy DeepSeek-R1-Distill-Qwen-14B English French Chinese 0.0 0.2 0.4 0.6 0.8 Accuracy English French Chinese 0.0 0.2 0.4 0.6 0.8 Accuracy English French Chinese 0.0 0.2 0.4 0.6 0.8 Accuracy No control Input script Latin script Han script Latin+Han scripts Figure 6: Accuracy on the K&K dataset under script-controlled reasoning.Latin or Han script control boosts performance for Arabic, Hindi, and Japanese, while native scripts yield the best results for English, French, and Chinese, highlighting the impact of script choice on reasoning efficacy.