AAAI2025

HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

2 citations

Abstract

Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the visionlanguage translation still remains unsolved due to the semantic gap and misalignment between video content and described semantic concept. In this paper, we propose a Hierarchical Memory Model (HMM) -a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way. These memories can guide attention for efficient video representation extraction and semantic attribute selection in addition to modelling the long-term dependency for video sequence and sentences, respectively. Compared with traditional vision-based text decoder, the proposed attribute-based text decoder can largely reduce the semantic discrepancy between video and sentence. To prove the effectiveness of the proposed model, we perform extensive experiments on two public benchmark datasets: MSVD and MSR-VTT. Experiments show that our model not only can discover appropriate video representation and semantic attributes but also can achieve comparable or superior performances than state-ofthe-art methods on these datasets.