ICLR2025

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, Guiguang Ding

Abstract

Text-Video Retrieval.  Matching videos that correspond to specific query texts or vice versa.  Recent studies focus on full fine-tuning of CLIP for TVR. Limitations.  Introducing cumbersome modules to extract video features.  Slow inference speed severely limits their real-world applications.  The training process of CLIP4Clip with CLIP-ViT-B/16 requires 70.1GB GPU memory usage and takes 6.5 hours.