ICLR2026

Video Scene Segmentation with Genre and Duration Signals

Jungu Cho, Seong Jong Ha, Hae-Gon Jeon

Abstract

Video scene segmentation aims to detect semantically coherent boundaries in long-form videos, bridging the gap between low-level visual signals and highlevel narrative understanding. However, existing methods primarily rely on visual similarity between adjacent shots, which makes it difficult to accurately identify scene boundaries, especially when semantic transitions do not align with visual changes. In this paper, we propose a novel approach that incorporates productionlevel metadata, specifically genre conventions and shot duration patterns, into video scene segmentation. Our main contributions are three-fold: (1) we leverage textual genre definitions as semantic priors to guide shot-level representation learning during self-supervised pretraining, enabling better capture of narrative coherence; (2) we introduce a duration-aware anchor selection strategy that prioritizes shorter shots based on empirical duration statistics, improving pseudoboundary generation quality; (3) we propose a test-time shot splitting strategy that subdivides long shots into segments for improved temporal modeling. Experimental results demonstrate state-of-the-art performance on MovieNet-SSeg and BBC datasets. We introduce MovieChat-SSeg, extending MovieChat-1K with manually annotated scene boundaries across 1,000 videos spanning movies, TV series, and documentaries. * Corresponding author Recent datasets for long-form video understanding, such as MovieChat-1K (Song et al. ( 2024 )) and TVQA (Lei et al. (2018) ), incorporate textual annotations including subtitles and dialogue to support multimodal reasoning tasks. However, many of these datasets lack explicit scene boundary annota-