NeurIPS2025

MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

Kerui Ren, Jiayang Bai, Linning Xu, Lihan Jiang, Jiangmiao Pang, Mulin Yu, Bo Dai

Abstract

2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization. Introduction Object compositing in 3D scenes remains a formidable challenge due to the interplay of color harmonization, shadow synthesis, light transport simulation, and multi-view consistency, all of which must be addressed to achieve photorealistic integration. This capability is fundamental to AR, robotics, and interactive media, where realism directly impacts user immersion and perception. Early object compositing research focuses primarily on isolated subtasks like scene relighting [51, 17] , shadow generation [24, 25] , and color harmonization [6, 12] , yielding promising yet fragmented solutions. However, The transition toward unified frameworks reveals intricate couplings between these components, necessitating adherence to physical principles governing light transport and occlusion phenomena. Diffusion-based pipelines such as ObjectStitch [36] and ControlCom [50] attempt single-image object insertion by synthesizing harmonious lighting and shadows within a background bounding box, but their reliance on stochastic sampling and the lack of large-scale, high-quality compositing datasets limit their robustness and generalization in real-world scenarios. In this work, we tackle the problem of seamlessly inserting novel objects into static 3D scenes captured from multiple viewpoints. Our goal is to relight each object so that its appearance, including ambient illumination, surface reflections, and cast shadows, matches the lighting of the scene, while also modeling the reciprocal effects of the object on its surroundings (e.g. secondary shadows and interreflections). We introduce MV-CoLight, a unified framework that preserves both geometric fidelity and photorealism across views by learning and enforcing lighting-consistent priors at both the image and scene levels. MV-CoLight adopts a two-stage training pipeline. In the 2D object compositing stage, we train a feed-forward model to capture scene-specific lighting characteristics, including background shadows and indirect illumination, from individual images. In the 3D object compositing stage, we transform these learned features into a 3D Gaussian representation using 3D Gaussian splatting [19] , ordering them via a Hilbert curve to ensure spatial coherence and enforce multi-view consistency. Leveraging recent advances in video-level instance segmentation and 3D-aware object insertion, our framework effectively eliminates common 2D mask artifacts while achieving efficient inference (0.07s per frame) without compromising stability or visual quality. To support training and evaluation, we introduce a large-scale synthetic dataset of over 480k composite scenes rendered in Blender. Each scene features a table from the Digital Twin Catalog [8], augmented with Poly Haven HDR environment maps and materials [29], and additional light sources for varied illumination. We render 16 uniformly sampled RGB views per scene, along with depth maps and segmentation masks. To simulate realistic compositing challenges, we mix foreground and background layers under different lighting conditions, creating deliberate lighting inconsistencies for training and evaluation. Further implementation details are provided in the supplementary material. Our main contributions are as follows: 1) a feed-forward architecture for multi-view object compositing that, unlike diffusion-based alternatives, offers improved computational efficiency and robustness with high visual quality; 2) a two-stage training framework that connects 2D object compositing with 3D Gaussian color fields via a Hilbert curve ordering mechanism, thereby enforcing geometrically consistent illumination priors and coherent multi-view shadows; and 3) curate a large-scale benchmark of over 480 K annotated multi-view scenes under varying lighting conditions, and demonstrate that our method achieves state-of-the-art performance across several public datasets. Related Works Object compositing, the seamless integration of foreground objects into background scenes, is a fundamental task in both image editing and 3D graphics. In the following, we briefly discuss three principal paradigms that have guided existing solutions.