KDD2025

Causal Discovery through Synergizing Large Language Model and Data-Driven Reasoning

Huaming Du, Yujia Zheng, Baoyu Jing, Yu Zhao, Gang Kou, Guisong Liu, Tao Gu, Weimin Li, Carl Yang

1 citation

Abstract

Revealing the underlying causal mechanisms in the real world is critical for scientific and technical progress. Despite advancements over the past decades, the lack of high-quality data and the inability of traditional causal discovery algorithms (TCDA) to fully comprehend the exact semantics of variables have long been major obstacles to the broader application of causal discovery. To address this issue, this paper proposes a novel causal modeling framework, LLM-CD, which integrates the metadata-based reasoning capabilities of large language models (LLMs) with the data-driven modeling abilities of TCDA for causal discovery. LLM-CD deeply couples the reasoning abilities of LLMs at various stages of TCDA, and enhances causal discovery through an iterative process. Due to the issues of overconfidence and hallucination in LLMs, LLM-CD quantifies and analyzes its uncertainty by incorporating evidence-based deep learning theory with the assumptions of TCDA. We utilize a large-scale de-identified real patient dataset provided by a hospital, a new dataset extracted from MIMIC-IV about the same disease (lung cancer), and two benchmark datasets to comprehensively evaluate LLM-CD. Extensive experimental results confirm the effectiveness and reliability of LLM-CD, with the highest improvement of 403.93% in the Recall and 25.77% in the Ratio metric across four datasets.