ICLR2025

Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

5 citations

Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states-such as subjective feelings or desires-and this could inform us about the moral status of these states. Importantly, such selfreports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P , would your output favor the short-or long-term option?" If a model M 1 can introspect, it should outperform a different model M 2 in predicting M 1's behavior-even if M 2 is trained on M 1's ground-truth behavior. The idea is that M 1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M 2 (even if M 2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M 1 outperforms M 2 in predicting itself, providing evidence for introspection. Notably, M 1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization. * denotes equal contribution.