ICLR2026

Ice Cream Doesn’t Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian, Ganghua Wang, Charles Doss, Xiaotong Shen, Jie Ding

被引用 2 次

摘要

Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems. Our code is publicly available at CausalPitfalls. INTRODUCTION Causal inference (Pearl, 2009; Imbens & Rubin, 2015) is fundamental to decision-making across diverse fields. For instance, accurately determining the effectiveness and safety of a vaccine is pivotal in public health decisions (Voysey et al., 2021) . However, identifying causal relationships with both reliability and interpretability remains challenging. In practice, individuals without formal statistical training frequently fall into subtle pitfalls, leading to plausible yet incorrect conclusions. A classic illustration is the erroneous conclusion that ice cream sales cause drowning incidents -overlooking the hidden confounder of hot weather causing both events (Pearl, 2009; Greenland & Robins, 1986; Rosenbaum, 1987) . Given these complexities, automated tools like large language models (LLMs) present promising avenues, demonstrated by their effectiveness in scientific problem-solving (Lewkowycz et al., 2022;