CVPR2024

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

Hao Wu, Huabin Liu, Yu Qiao, Xiao Sun

被引用 8 次

摘要

We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event bound-aries from unlabeled videos. By leveraging the capabil-ities of diverse large language models (LLMs), we gen-erate rich DVC-oriented caption candidates and optimize the corresponding pseudo boundaries under several metic-ulously designed objectives, considering diversity, event-centricity, temporal ordering, and coherence. Moreover, we further introduce a novel online boundary refinement strat-egy that iteratively improves the quality of pseudo bound-aries during training. Comprehensive experiments have been conducted to examine the effectiveness of the pro-posed technique components. By leveraging a substantial amount of unlabeled video data, such as HowToI00M [16], we achieve a remarkable advancement on standard DVC datasets like YouCook2 [31] and ActivityNet [13]. We out-perform the previous state-of-the-art Vid2Seq [27] across a majority of metrics, achieving this with just 0.4% of the unlabeled video data used for pre-training by Vid2Seq.