ICLR2025

NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

Asmar Nadeem, Faegheh Sardari, Robert Dawes, Syed Sameed Husain, Adrian Hilton, Armin Mustafa

摘要

Existing video captioning benchmarks and models lack causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark dataset generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics, enabling effective learning and generation of captions with causaltemporal narrative. Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models in articulating the causal and temporal aspects of video content: 17.88 and 17.44 CIDEr on the MSVD-CTN and MSRVTT-CTN datasets, respectively. Cross-dataset evaluations further showcase CEN's strong generalization capabilities. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning. For project details, visit https://narrativebridge.github.io/ . 'a car crashes and guys play beer pong', 'a car driving through an open field kicking up dirt', a car flipping over', 'a car get wracked', 'a car is being flipped over', 'a dirt vehicle riding and rolling', 'a dune buggy flipping over', 'a four wheeler wrecking', 'a monster truck flips on its side then several young men shout while playing beer pong', 'a person drives an offroad car around a field', 'a person flipping a go kart while a crowd cheers', 'a race truck is crashing', 'a truck rolls over itself and boys cheer on a friend', 'a truck tumbles over on itself', 'a tumbler crashes on a dirt road and then a group of guys play beer pong', 'a vehicle flips over', 'a type of monster truck crashes and men are shown celebrating', 'an off road vehicle crashing', 'crashing of a car while driving', 'footage from a monster truck style event followed by a frat party' CTN Caption Original Captions Cause: 'a car drove recklessly through an open field flipping over' Effect: 'the car was severely damaged and a group of guys started playing beer pong' Input Video: LLM Prompt LLM Evaluation Figure 1: Comparison of Original captions vs. Causal-Temporal Narrative (CTN) caption to illustrate the inclusion of causal-temporal narrative. sequence of events that is crucial for understanding the causal-temporal narrative. Figure 1 highlights the limitations of the original captions in existing benchmark datasets, such as MSR-VTT Xu et al. (2016). The original captions focus on isolated events or actions, such as "a car flipping over" or "a monster truck flips on its side," lacking contextual narrative and causal-temporal relationships between events. Consequently, models trained on these captions suffer from the same limitations. To bridge this gap, we introduce NarrativeBridge, a novel framework encompassing a new benchmark dataset and architecture tailored for causal-temporal narrative learning in video captioning. Our Causal-Temporal Narrative (CTN), a novel captions benchmark dataset, leverages a large language model (LLM) and few-shot prompting to generate enhanced video descriptions that explicitly encode causal and temporal sequences, as shown in Figure 1 . This establishes a clear connection between the cause (reckless driving) and the effect (damaged car and subsequent behavior of the group). Our CTN captions benchmark dataset enables models to better understand and articulate the causality, sequence, and significance of events within the broader video context Wilkens et al. (2003) . To ensure the quality and relevance of the generated captions, we employ an automatic evaluation framework that compares the CTN captions with the video content, keeping or discarding them based on a score threshold. Additionally, we conduct a human evaluation study that further validates the high quality of our CTN captions, demonstrating their accuracy, temporal coherence, and relevance. This CTN captions benchmark dataset addresses the limitations of existing benchmark datasets and emphasizes the importance of incorporating causal-temporal narrative understanding into video captioning models to generate accurate, informative, and contextually relevant descriptions. Existing SOTA video captioning methods that use different architectures such as LSTM Nadeem et al. ( 2023 ), GNN Hendria et al. (2023), and Transformer Wang et al. (2022), struggle to effectively learn causal-temporal narrative from the CTN captions. These architectures are designed to capture the overall semantics in videos but lack dedicated mechanisms to explicitly model the cause-effect r