ICLR2026

GOLDILOCS: GENERAL OBJECT-LEVEL DETECTION AND LABELING OF CHANGES IN SCENES

Almog Friedlander, Ariel Shamir, Ohad Fried

Abstract

We propose GOLDILOCS: a novel zero-shot, pose-agnostic method for object-level semantic change detection in the wild. While supervised Scene Change Detection (SCD) methods achieve impressive results on curated datasets, these models do not generalize and performance drops on out-of-domain data. Recent Zero-Shot SCD methods introduced a more robust approach with foundational models as backbone, yet they neglect the 3D aspect of the task and remain constrained to the image-pair setting. Conversely, 3D-centric SCD methods based on 3D Gaussian Splatting (3DGS) or NeRFs require multi-view inputs, but cannot operate on an image pair. Our key insight is that SCD can be reformulated as a 3D reconstruction problem over time, where geometric inconsistencies naturally indicate change. Although previous work considered viewpoint difference a challenge, we recognize the additional geometric information as an advantage. GOLDILOCS uses dense stereo reconstruction to estimate camera parameters and generate a pointmap of the commonalities between input images by filtering geometric inconsistencies. Rendering the canonical scene representation from multiple viewpoints yields reference images that exclude changed or occluded content. Rigid object changes are then detected through mask tracking, while nonrigid transformations are identified using SSIM heatmaps. We evaluate our method on a variety of datasets, covering both pairwise and multi-view cases in binary and multi-class settings, and demonstrate superior performance over prior work, including supervised methods. Published as a conference paper at ICLR 2026 We introduce GOLDILOCS (General Object-Level Detection and Labeling Of Changes in Scenes), a zero-shot framework for object-centric SCD. Our proposed framework identifies changed objects and labels the semantic type of change, inspired by experience of the three bears of the titular fairytale. Moreover, GOLDILOCS can detect changes even with a single image at T 0 and a single image at T 1 , given sufficient 3D overlap. Our method leverages a dense stereo 3D reconstruction model (Leroy et al., 2024) to estimate camera intrinsics, extrinsics, and per-pixel 3D structure. It then performs depth filtering to remove temporally inconsistent geometry and yields a canonical static reconstruction of the scene. By comparing each input image to the clean rendering and propagating masks with SAM2 (Ravi et al., 2024) , GOLDILOCS can identify and categorize scene changes as object-level additions, removals, movements, or non-rigid transformations. The latter are detected via SSIM maps (Wang et al., 2004 ) highlighting local structural distortions across time. Unlike prior work, GOLDILOCS is training-free, calibration-free, and generalizes to unconstrained, real-world imagery. We demonstrate state-of-the-art binary and multi-class change detection under zero-shot conditions on both image pairs and sets. RELATED WORK This section reviews prior work, from early image differencing approaches to emerging zero-shot and 3D representation-based methods. The discussion highlights how existing approaches balance label dependence, generalization, and geometric reasoning, motivating our proposed solution. Pre-Neural Network Methods. Earlier approaches to SCD included simple image differencing, likelihood ratio tests, probabilistic mixture models, and extended up to shading models and background modeling; see Radke et al. (2005) for a survey. While these early methods established the systematic frameworks for change detection, their performance was often undermined by the reliance on pre-processing steps that addressed illumination variations and geometric misalignment. Supervised Methods Supervised SCD methods rely on labeled image pairs and change maps. Early work (e.g. Sakurada & Okatani (2015)) combined CNN features with superpixels. More advanced designs include ChangeNet (Varghese et al., 2018), a Siamese-inspired CNN (Mueller & Thyagarajan, 2016; Rao et al., 2017), and transformer-based models (Wang et al., 2021; Chen et al., 2021) exploiting attention for feature differencing. SimSaC (Park et al., 2022) adds an optical-flow warping module, while C-3PO (Wang et al., 2023) utilize a backbone network and trains separate branches for change types. CYWS-3D (Sachdeva & Zisserman, 2023b) extends supervision into 3D with a frozen transformer backbone. These methods perform well in-domain but remain tailored to dataset-specific styles and image pair relationships, limiting generalization. Semi-, Weakly-and Self-Supervised Methods To mitigate reliance on labeled data, semisupervised (Lee & Kim, 2024), weakly-supervised (Sakurada et al., 2020) have been proposed. Additionally, SACD-Net (Furukawa et al., 2020a) learns in a self-supervised manner from unlabeled data by jointly optimizing viewpoint alignment and change detection. While these approaches reduce label cost, they often overlook the difficulty of collecting image pairs. Data augmentation