ICLR2025

Controlling Space and Time with Diffusion Models

Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David J. Fleet

摘要

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works which generally operate in limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a variety of tasks including single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned videoto-video translation, which we illustrate qualitatively on a variety of scenes. See https://4d-diffusion.github.io for video samples. Input 360 • rotation * Equal contribution. Unlike the typical setting of Neural Radiance Fields (Mildenhall et al., 2021) (NeRF) where tens-tohundreds of images are used as input for 3D reconstruction, pose-conditional diffusion models for NVS aim to extrapolate plausible, diverse, 3D consistent samples with as few as a single image input. Conditioning diffusion models on an image and relative camera pose was introduced by Watson et al. * 4DiM does not require sequential temporal ordering as in video models as the architecture is permutationequivariant over frames. All N images (conditioning and generated) are processed by the diffusion model. * For PNVS, we follow Yu et al. (2023a) and use a Markovian sliding window for sampling, as they find it is the stronger than stochastic conditioning (Watson et al., 2022).