ACL2023

Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models

Qingyu Tan, Hwee Tou Ng, Lidong Bing

被引用 24 次

摘要

Reasoning about time is of fundamental importance. Many facts are time-dependent. For example, athletes change teams from time to time, and different government officials are elected periodically. Previous time-dependent question answering (QA) datasets tend to be biased in either their coverage of time spans or question types. In this paper, we introduce a comprehensive probing dataset TEMPREASON to evaluate the temporal reasoning capability of large language models. Our dataset includes questions of three temporal reasoning levels. In addition, we also propose a novel learning framework to improve the temporal reasoning capability of large language models, based on temporal span extraction and time-sensitive reinforcement learning. We conducted experiments in closed book QA, open book QA, and reasoning QA settings and demonstrated the effectiveness of our approach 1 . * Qingyu Tan is under the Joint PhD Program between Alibaba and NUS. † Corresponding author. 1 Our code and data are released on https://github.com/ DAMO-NLP-SG/TempReason