ICCV2023

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

Yujin Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon, Sangpil Kim, Jinkyu Kim

41 citations

Abstract

Prompt "A photo of beautiful beach with blue sky" Frame 4 (Magnitude Change) (Semantic Change) (Single Sound) Frame 3 Frame 2 Frame 1 Frame 5 Increase in Magnitude Change in Semantics Figure 1: The Power of Sound (TPoS) is a novel framework that generates audio-reactive video sequences. Built upon the Stable Diffusion model, our model first generates an initial frame from a user-provided text prompt (e.g. "a photo of a beautiful beach with a blue sky"), then reactively manipulates the style of generated images corresponding to the sound inputs (e.g. an audio sequence of fireplace). Our model is indeed able to generate a frame conditioned on semantic information of the sound (see 1st and 2nd rows where images are manipulated driven by sound inputs such as fireplace or wave sound), while realistically dealing with temporal visual changes conditioned on changes of sound, e.g., increasing magnitude of sounds (see second row) or wave → fireplace (see last row). TPoS creates visually compelling and contextually relevant video sequences in an open domain.