CVPR2025

Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment

Huakai Lai, Guoxin Xiong, Huayu Mai, Xiang Liu, Tianzhu Zhang

Abstract

More Quanlitative Results Results at higher noise ratios. We conduct experiments with over 50% noise rates shown in Tab. 1. Our method achieves R@1 of 38.7 under 75% noise, which is higher than RVTR [4] under 50% noise, further demonstrating the robustness of our method. Results under different batch sizes. Different batch sizes can affect the agent construction. We experiment on batch size as shown in Tab. 2 and find that a small batch size affects agent selection, while an adequate size ensures its reliability. Potential of more noisy data. We conduct experiments from two aspects to demonstrate the potential of our method in leveraging more noisy data, as shown in Tab. 3. Clean only refers to training with only 50% of the clean data in the training set. The 2-nd row indicates that the remaining 50% of noisy data is added to the 50% clean data for training. The 3-rd row denotes further training with an additional 200K noisy data pairs from the WebVid dataset [1] . The following two points can be observed: First, compared to using only clean data, our method shows a more significant improvement when 50% noisy pairs are added. Second, after incorporating 200K noisy WebVid data pairs scraped from the web, the result gain of our method becomes even more pronounced. The above results fully demonstrate the potential of our method to utilize noisy data.