ACL2024

TRAM: Benchmarking Temporal Reasoning for Large Language Models

Yuqing Wang, Yun Zhao

Abstract

Reasoning about time is essential for understanding the nuances of events described in natural language. Previous research on this topic has been limited in scope, characterized by a lack of standardized benchmarks that would allow for consistent evaluations across different studies. In this paper, we introduce TRAM, a temporal reasoning benchmark composed of ten datasets, encompassing various temporal aspects of events such as order, arithmetic, frequency, and duration, designed to facilitate a comprehensive evaluation of the TeR capabilities of large language models (LLMs). We evaluate popular LLMs like GPT-4 and Llama2 in zero-shot and few-shot scenarios, and establish baselines with BERT-based and domainspecific models. Our findings indicate that the best-performing model lags significantly behind human performance. It is our aspiration that TRAM will spur further progress in enhancing the TeR capabilities of LLMs. Our data and code are available at https:// github.com/EternityYW/TRAM-Benchmark . Q: It is also a love story , between Ace and Tobio, a trans woman. How often do they break up? A. Once B. Always C. Once per week Frequency (Commonsense) Q: A historic event is documented to have happened 'before you know it'. When did it take place? A. The next day B. Without hesitation C. Before long Ambiguity Resolution (Interpretation) Q: She noticed that all the wall clocks in the store were set to ten past ten. What's the more plausible CAUSE? A. It is a common display setting for clocks and watches. B. B. It was ten minutes past ten at that moment. Temporal Causality (Cause) Q: I woke up so late this morning. I was panicked when I saw what time it was. I had to be at work on time. I threw myself together quickly. Which of the two endings is the most plausible correct ending to the story? A. I was able to get a job at a local restaurant. B. I was still thirty minutes late. Temporal Storytelling Arithmetic