ACL2025

PIPER: Benchmarking and Prompting Event Reasoning Boundary of LLMs via Debiasing-Distillation Enhanced Tuning

Zhicong Lu, Changyuan Tian, PeiguangLi PeiguangLi, Li Jin, Sirui Wang, Wei Jia, Ying Shen, Guangluan Xu

4 citations

Abstract

While Large Language Models (LLMs) excel in diverse domains, their validity in event reasoning remains underexplored. Most existing works merely stagnate at assessing LLMs' event reasoning with a single event relational type or reasoning format, failing to conduct a complete evaluation and provide a practical solution for capability enhancement. In this paper, we propose PIPER, the first comprehensive benchmark for Probing Into the Performance boundary of LLMs in Event Reasoning. Motivated by our evaluation observations and error patterns analysis, we meticulously craft 10K diverse instruction-tuning demonstrations to alleviate event reasoning-oriented data scarcity. Additionally, a novel Debiasing and Distillation-Enhanced Supervised Fine-Tuning (D 2 E-SFT) strategy is presented, which facilitates adhering to context and fixating significant contextual event information to elevate the event reasoning capability. Specifically, D 2 E-SFT removes the given sample's context to construct an imagined sample, subtracting its logits to mitigate the bias of neglecting context and improve contextual faithfulness. To guide the model in emphasizing significant contextual event information, D 2 E-SFT employs a context-refined sample to achieve selfdistillation with the alignment of logits. Extensive experimental results demonstrate the effectiveness of our data and strategy in expanding the performance boundary of event reasoning. * Equal Contribution † Corresponding author. (a) (b) FC Barcelona's youth academy, La Masia. Did Messi's leadership in the 2022 World Cup final lead to Argentina's victory over France? (Causal) Where did Messi begin his football career before making his first-team debut in 2004? (Temporal) How would Messi's legacy be perceived if he hadn't won the 2022 World Cup? A. Unchanged B. Diminished C. Enhanced D. None. (Counterfactual) CRI CEI LLM QA Yes No NLI Question SCQ B D C A