ACL2025

What Language Do Non-English-Centric Large Language Models Think in?

Chengzhi Zhong, Qianying Liu, Fei Cheng, Junfeng Jiang, Zhen Wan, Chenhui Chu, Yugo Murawaki, Sadao Kurohashi

7 citations

Abstract

In this study, we investigate whether non-English-centric large language models, 'think' in their specialized language. Specifically, we analyze how intermediate layer representations, when projected into the vocabulary space, favor certain languages during generation-termed as latent languages. We categorize non-Englishcentric models into two groups: CPMs, which are English-centric models with continued pretraining on their specialized language, and BLMs, which are pre-trained on a balanced mix of multiple languages from scratch. Our findings reveal that while English-centric models rely exclusively on English as their latent language, non-English-centric models activate multiple latent languages, dynamically selecting the most similar one based on both the source and target languages. This also influences responses to culture difference questions, reducing English-centric biases in non-English models. This study deepens our understanding of language representation in non-Englishcentric LLMs, shedding light on the intricate dynamics of multilingual processing at the representational level. Our code is publicly available. 1