CVPR2025

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Aleksei Bokhovkin, Quan Meng, Shubham Tulsiani, Angela Dai

Abstract

Figure 1. SceneFactor factors the complex task of text-guided 3D scene generation into forming a coarse semantic structure, followed by refined geometric synthesis. Rather than require a learned model to decide the location, type, size, and local geometry of scene elements directly, our generation of a coarse semantic box layout enables training a simpler task of layout-guided geometric synthesis. To achieve this factorized generation, we train semantic and geometric latent diffusion models. Crucially, the proxy semantic map generation enables user-friendly localized editing of generated scenes by editing in the semantic map with simple box operations (by clicking two box corners), without requiring re-synthesis of the full scene. Note that input text is colored by semantic categories for visualization purposes only.