NeurIPS2023

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis

Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

被引用 68 次

摘要

Can machines recording an audio-visual scene produce realistic, matching audiovisual experiences at novel positions and novel view directions? We answer it by studying a new task-real-world audio-visual scene synthesis-and a first-of-itskind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset. We recommend that readers visit our project page for convincing comparisons: https://liangsusan-git.github.io/project/avnerf/ . Introduction We study a new task, real-world audio-visual scene synthesis, to generate target videos and audios along novel camera trajectories from source audio-visual recordings of known trajectories. By learning from real-world source videos with binaural audio, we aim to generate target video frames and spatial audios that exhibit consistency with the given camera trajectory visually and acoustically. This consistency ensures perceptual realism and immersion, enriching the overall user experience. As far as we know, attempts in the audio-visual learning literature [1-11] have yet to succeed in solving this challenging task thus far. Although there are similar works [12] [13] [14] [15] , these methods have constraints that limit their ability to solve this new task. Luo et al. [12] propose neural acoustic fields to model sound propagation in a room. Su et al. [13] introduce representing audio scenes by disentangling the scene's geometry features. These methods are tailored for estimating room impulse response signals in a simulation environment that are difficult to obtain in a real-world scene. Concurrent to our work, ViGAS proposed by Chen et al. [15] learns to synthesize new sounds by inferring the audio-visual cues. However, ViGAS is limited to a few viewpoints for audio generation. We introduce AV-NeRF, a novel NeRF-based method of synthesizing real-world audio-visual scenes. AV-NeRF enables the generation of videos and spatial audios, following arbitrary camera trajectories. It utilizes source videos and camera poses as references. AV-NeRF consists of two branches: A-NeRF, which learns the acoustic fields of an environment, and V-NeRF, which models color and density fields. We represent a static audio field as a continuous function using A-NeRF, which takes the listener's position and head direction as input. A-NeRF effectively models the energy decay of sound as the sound travels from the source to the listener by correlating the listener's position with the 37th Conference on Neural Information Processing Systems (NeurIPS 2023).