CVPR2025
V2Dial: Unification of Video and Visual Dialog via Multimodal Experts
Adnen Abdessaied, Anna Rohrbach, Marcus Rohrbach, Andreas Bulling
Abstract
In addition to the proposed spatial-temporal contrastive learning (STC) and spatial-temporal matching (STM), we trained our model with the following established visionlanguage objectives.