ICLR2025

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, Han Li

摘要

Attention modules are more 'truthful' than other modules in LLMs (e.g., FFN modules). ➢ Knowledge is mainly stored in the FFN module of the transformer layer in pre-trained language model [1]. ➢ Even if the self-attention module correctly focuses on the relevant token, the FFN module may still produce factuality hallucinations due to insufficient pre-training [2].