CVPR2025

Event-Equalized Dense Video Captioning

Kangyi Wu, Pengna Li, Jingwen Fu, Yizhe Li, Yang Wu, Yuhan Liu, Jinjun Wang, Sanping Zhou

摘要

Dense video captioning aims to localize and caption all events in arbitrary untrimmed videos. Although previous methods have achieved appealing results, they still face the issue of temporal bias, i.e, models tend to focus more on events with certain temporal characteristics. Specifically, 1) the temporal distribution of events in training datasets is uneven. Models trained on these datasets will pay less attention to out-of-distribution events. 2) long-duration events have more frame features than short ones and will attract more attention. To address this, we argue that events, with varying temporal characteristics, should be treated equally when it comes to dense video captioning. Intuitively, different events tend to have distinct visual differences due to varied camera views, backgrounds, or subjects. Inspired by that, we intend to utilize visual features to have an approximate perception of possible events and pay equal attention to them. In this paper, we introduce a simple but effective framework, called Event-Equalized Dense Video Captioning (E 2 DVC) to overcome the temporal bias and treat all possible events equally. Experimental results on ActivityNet Captions and YouCook2 dataset validate the effectiveness of the proposed methods and show State-of-the-art (SOTA) performance on dense video captioning.