WWW2026

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Hao Liang, Qifeng Cai, Zhaoyang Han, Hejun Dong, Meiyi Qiang, Ruichuan An, Quanqing Xu, Bin Cui, Wentao Zhang

11 citations

Abstract

Long videos contain a vast amount of information, making videotext retrieval an essential and challenging task in multimodal learning and web-scale search. On today's Web, where users increasingly expect to locate not only relevant pages but also specific long videos or fine-grained clips, existing benchmarks fall short due to limited video duration, low-quality captions, and coarse annotation granularity. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machinegenerated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent fullvideo captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code link at https://lovrbench.github.io/ * Equal contribution. †Corresponding author.