CVPR2025
Supervising Sound Localization by In-the-wild Egomotion
Anna Min, Ziyang Chen, Hang Zhao, Andrew Owens
摘要
We present a method for learning binaural sound localization using egomotion as a supervisory signal. Over the course of a video, the camera's direction to a sound source will change as the camera moves. We train an audio model to predict sound directions that are consistent with visual estimates of camera motion, which we obtain using traditional methods from multi-view geometry. This provides a weak but plentiful form of supervision that we combine with traditional binaural cues. To evaluate this method, we propose a dataset of real-world audio-visual videos with egomotion. We show that our model can successfully learn from real-world data and that it performs well on sound localization tasks.