ICML2025

Steer LLM Latents for Hallucination Detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li

Abstract

Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our twostage framework first trains TSV on a small set of labeled exemplars to form compact and wellseparated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudolabeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications. How can we shape the latent space of an LLM for hallucination detection? Instead of fine-tuning the LLMs, which is computationally expensive and alters the model's parameters (Gekhman et al., 2024), we propose learning a lightweight vector, called Truthfulness Separator Vector (TSV). As illustrated in Figure 1b, this learnable vector is introduced during inference