ACL2024

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf

Abstract

Interpretability research aims to bridge the gap between the empirical success and our scientific understanding of the inner workings of large language models (LLMs). However, most existing research in this area focused on analyzing a single mechanism, such as how models copy or recall factual knowledge. In this work, we propose a formulation of competition of mechanisms, which instead of individual mechanisms focuses on the interplay of multiple mechanisms, and traces how one of them becomes dominant in the final prediction. We uncover how and where the competition of mechanisms happens within LLMs using two interpretability methods, logit inspection and attention modification. Our findings show traces of the mechanisms and their competition across various model components, and reveal attention positions that effectively control the strength of certain mechanisms. 1 Figure 1: Top: An example showing that LLMs can fail to recognize the correct mechanism when multiple possible mechanisms exist. Bottom: Our mechanistic inspection of where and how the competition of mechanisms takes place within the LLMs. Geva et al., 2023). However, different from discov-039 ering what mechanisms exist in LLMs, we propose a 040 more fundamental question: how do different mech-041 anisms interact in the decision-making of LLMs? 042 We show a motivating example in Figure 1, where 043 the model fails to recognize the correct mechanism 044 when it needs to judge between two possible mech-045 anisms: whether to recall the factual knowledge on 046 who developed the iPhone (i.e., Mechanism 1) or 047 to follow its counterfactual redefinition in the new 048 given context (i.e., Mechanism 2). 049 We propose a novel formulation of competition of 050 mechanisms, which focuses on tracing each mech-051 anism in the model, and understanding how one 052 of them becomes dominant in the final prediction 053 by winning the "competition". Specifically, we 054 build our work on two single mechanisms that are 055 1 well-studied separately in literature: (1) the factual 056 knowledge recall mechanism, which can be located