ACL2025
Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts
Youcheng Huang, Chen Huang, Duanyu Feng, Wenqiang Lei, Jiancheng Lv
摘要
Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior work has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). This paper takes a novel approach by exploring the intricate relationships between representations of concepts across different LLMs, drawing an intriguing parallel to the Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) The representations of a same concept in different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across multiple concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weakto-strong transferability exists between LLMs, whereby SVs extracted from smaller LLMs can effectively control behaviors of larger LLMs. 1 * Corresponding author. 1 We will release our code at https://github.com/ HamLaertes/Cross_Model_Trans .