NeurIPS2025

ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Yiming Wang, Elisa Ricci, Paolo Rota

Abstract

advancing research in language-driven video understanding. Code and dataset are available at https://github.com/benedettaliberatori/convisbench . Introduction Humans can effortlessly compare pairs of videos, rapidly identifying similarities and differences by attending to a range of semantic aspects, such as the depicted activity, the people involved, or the environment. This intuitive comparative ability relies on a rich understanding of events that unfold across space and time. Cognitive science research confirms this: humans naturally perceive, encode, and retrieve events along key semantic concepts, selectively attending to particular attributes of experience [38, 43] . As a result, perceived similarity between two videos is not a fixed quantity, but it depends on which concepts are being prioritized. For example, two videos may appear highly similar in terms of the activity being performed, but diverge substantially in the setting or the agents involved (see Fig. 1, top). This observation motivates a shift from holistic similarity to a more structured, concept-aware notion of video comparison. Computational approaches to video similarity have traditionally focused on global similarity metrics, typically learned by comparing spatio-temporal embeddings [18, 20, 21] . The emergence of Large Multimodal Models (LMMs) [3, 24, 25, 49, 52] with video understanding capabilities has opened new possibilities for using natural language to describe and reason about differences between videos. Prior work has explored this by generating natural language descriptions of video differences, either through domain-specific cooking concepts [32] or fine-grained, action-specific skill differences [4] . However, these approaches remain limited to narrow domains and are purely descriptive, lacking structured, quantitative assessments of similarity across semantic concepts. As a result, comparative video understanding via language remains in its early stages, with existing benchmarks failing to capture the broad semantic diversity present in real-world scenarios. To address this gap, we introduce a new task, Concept-based Video Similarity estimation (ConViS). Inspired by human cognition and grounded in semantic structure, ConViS aims to quantify how similar two videos are on specific concepts, e.g., the activity, the location, or the order of actions (see Fig. 1 , bottom). ConViS enables concept-specific video understanding, supporting applications like targeted video retrieval (e.g., same activity with different subjects), anomaly detection based on particular factors (e.g., unusual object presence or action sequence), and fine-grained model evaluation by isolating the conceptual sources of failure (e.g., confusing similar-looking scenes with different actions). Building on the definition of ConViS, we introduce a novel benchmark, ConViS-Bench, to support model evaluation and foster further research. ConViS-Bench consists of video pairs spanning a broad range of domains, each annotated by multiple human evaluators with similarity scores conditioned on multiple semantic concepts and accompanied by textual descriptions. Alongside introducing a novel dataset associated with the newly proposed task, we extensively benchmark several recent LMMs to assess their ability in predicting concept-based video similarities. Our analysis of their relevance to human judgment reveals significant performance differences across various LMMs on ConViS, highlighting that certain concepts are more challenging for models to judge in terms of video similarity. For instance, while some models can reliably identify visual similarities, they consistently struggle with more abstract notions such as the temporal structure of events, an issue also noted in prior work [2, 23] . Lastly, we demonstrate the utility of concept-aware similarity in downstream tasks such as concept-conditioned video-to-video retrieval, showing how ConViS can enable nuanced and interpretable video analysis. Overall, our contributions are threefold: • We introduce the ConViS task, a new formulation of video similarity that moves beyond traditional global scoring and computes interpretable similarity scores across semantic concepts. • We release ConViS-Bench, a new benchmark dataset with human-annotated similarity judgments across multiple semantic concepts and diverse video domains. • We conduct an extensive evaluation of state-of-the-art (video-and image-based) models on ConViS-Bench, analyzing their current strengths and limitations in concept-aware video comparison. Related Work Our work is related to previous research on comparing pairs of images and videos using natural language. We also discuss previous studies aimed at assessing the capabilities of LMMs in several video understanding tasks.