ICLR2026

AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching

Xize Cheng, Chenyuhao Wen, Slytherin Wang, Yongqi Wang, Zehan Wang, Rongjie Huang, Tao Jin, Zhou Zhao

Abstract

Video Query Sound Separation (VQSS) aims to isolate target sounds conditioned on visual queries while suppressing off-screen interference-a task central to audiovisual understanding. However, existing methods often fail under conditions of homogeneous interference and overlapping soundtracks, due to limited temporal modeling and weak audiovisual alignment. We propose AlignSep, the first generative VQSS model based on flow matching, designed to address common issues such as spectral holes and incomplete separation. To better capture crossmodal correspondence, we introduce a series of temporal consistency mechanisms that guide the vector field estimator toward learning robust audiovisual alignment, enabling accurate and resilient separation in complex scenes. As a multiconditioned generation task, VQSS presents unique challenges that differ fundamentally from traditional flow matching setups. We provide an in-depth analysis of these differences and their implications for generative modeling. To systematically evaluate performance under realistic and difficult conditions, we further construct VGGSound-Hard, a challenging benchmark composed entirely of separation cases with homogeneous interference and strong reliance on temporal visual cues. Extensive experiments across multiple benchmarks demonstrate that AlignSep achieves state-of-the-art performance both quantitatively and perceptually, validating its practical value for real-world applications. More results and audio examples are available at: https://AlignSep.github.io . * Equal Contribution • We revisit the task of video-queried sound separation (VQSS) and provide a detailed analysis of its unique challenges, including homogeneous interference, overlapping soundtracks, and the need for precise audio-visual temporal alignment. • We propose AlignSep, a novel generative temporal-aligned VQSS framework based on conditional flow matching, designed to robustly model multi-conditioned generation by leveraging temporal visual cues and preserving cross-modal consistency. • We introduce VGGSound-Hard, a new benchmark specifically curated to evaluate temporal alignment under real-world homogeneous interference, consisting of co-occurring on-/off-screen same-category sound sources. • Extensive experiments on three benchmarks-MUSIC-Clean, VGGSound-Clean, and VGGSound-Hard-demonstrate that AlignSep achieves state-of-the-art performance in both quantitative metrics and human perceptual scores (e.g., MOS), validating its effectiveness in real-world audiovisual separation tasks.