ICLR2025
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, Guiguang Ding
摘要
Text-Video Retrieval. Matching videos that correspond to specific query texts or vice versa. Recent studies focus on full fine-tuning of CLIP for TVR. Limitations. Introducing cumbersome modules to extract video features. Slow inference speed severely limits their real-world applications. The training process of CLIP4Clip with CLIP-ViT-B/16 requires 70.1GB GPU memory usage and takes 6.5 hours.