ICML2025

Compositional Causal Reasoning Evaluation in Language Models

Jacqueline R. M. A. Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V. Nori, Javier González

Abstract

This study evaluates causal reasoning in large language models (LLMs) using 99 clinically grounded laboratory test scenarios mapped to Pearl's Ladder of Causation: association, intervention, and counterfactual reasoning. We focused on common lab tests such as Hemoglobin A1c (HbA1c), creatinine, and vitamin D, and paired them with clinically relevant causal factors, including age, gender, obesity, and smoking. Two LLMs GPT-o1 and Llama-3.2-8b-instruct were tested, with responses rated by four medically trained human experts. GPT-o1 demonstrated superior discriminative performance (AUROC overall = 0.80 ± 0.12) compared to Llama-3.2-8binstruct (0.73 ± 0.15), with higher association (0.75 vs. 0.72), intervention (0.84 vs. 0.70), and counterfactual scores (0.84 vs. 0.69). Sensitivity (0.90 vs. 0.84) and specificity (0.93 vs. 0.80) were also greater for GPT-o1. Reasoning ratings followed similar trends. Both models performed best on intervention questions and worst on counterfactuals, particularly "altered outcome" scenarios. Findings suggest GPT-o1 offers more consistent causal reasoning, but further refinement is needed before high-stakes clinical deployment.