ICLR2026

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

Li-Ming Zhan, Bo LIU, Yujie Feng, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu

被引用 1 次

摘要

Inference-time steering aims to alter an LLM’s responses without changing its parameters. A key challenge lies in selecting internal modules that most strongly govern the target behavior; existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. In this work, we introduce , a novel framework for identifying behavior-relevant modules (heads or layers) in Transformers. For each module, we train a vector-quantized autoencoder (VQ-AE) on its hidden activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces via a shared, learnable codebook. We quantify each module’s behavioral relevance by evaluating how effectively the VQ-AE encodings distinguish between behavior-aligned and behavior-violating responses using a binary classification metric. This relevance score informs both module selection and steering strength. We evaluate across eight LLMs from two model families (Llama and Qwen) and nine datasets spanning truthfulness enhancement, open-domain question answering under knowledge conflicts, and general alignment tasks. enables more effective inference-time interventions, yielding significant improvements on these steering tasks. Notably, it achieves an average relative improvement of 20% (up to 81.5%) over the seminal ITI method on truthfulness steering. Moreover, the modules selected by our method exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios. We provide the source code to reproduce all experimental results at https://github.com/liam0949/REAL_ICLR.