ICLR2026
LLMs Struggle to Balance Reasoning and World Knowledge in Causal Narrative Understanding
Khurram Yamin, Shantanu Gupta, Gaurav Rohit Ghosal, Zachary Chase Lipton, Bryan Wilder
Abstract
The ability to robustly identify causal relationships is essential for autonomous decision-making and adaptation to novel scenarios. However, accurately inferring causal structure requires integrating both world knowledge and abstract logical reasoning. In this work, we investigate the interaction between these two capabilities through the representative task of causal reasoning over narratives. Through controlled synthetic, semi-synthetic and real-world experiments, we find that stateof-the-art large language models (LLMs) often rely on superficial heuristics-for example, inferring causality from event order or recalling memorized world knowledge without attending to context. Furthermore, we show that simple reformulations of the task can elicit more robust reasoning behavior. Our evaluation spans a range of causal structures, from linear chains to complex graphs involving colliders and forks. These findings uncover systematic patterns in how LLMs perform causal reasoning and lay the groundwork for developing methods that better align LLM behavior with principled causal inference. RELATED WORKS Causal Reasoning in Large Language Models Jin et al. ( 2023 ) develop a benchmark for testing causal reasoning in LLMs given causal graphs, finding that language models can struggle with the task. However, the queries examined in Jin et al. ( 2023 ) require probability calculations, potentially conflating causal reasoning and arithmetic failures. Tan et al. (2022) shows the capability of a neural network trained on news data to label causal structures in individual sentences. Joshi et al. (2024b)