WWW2026

WPIS: From In-the-Wild Web Images to Physics-Aware 3D Scene Graphs for Physical Reasoning

Ke Ma, Cong Fu, Jianing Wang, Yifei Wang, Wenyuan Li, Xinggang Wang, Meng Wang, Tian Xia

Abstract

Intelligent agents operating in biomedical laboratories need physics-aware understanding beyond simple geometry or semantics. In-the-wild web images capture authentic lab interactions, yet methods designed for instrumented 3D capture and reconstruction struggle to turn this resource into functional knowledge. Consequently, current autonomous laboratory systems lack a queryable representation of affordances for precise object and liquid handling tasks. We pose a web-native question: how to convert single-view, uncalibrated web images into a structured, physics-aware scene representation. We introduce WPIS (Web- and Physics-Informed Scene-understanding), a pipeline that compiles Physics-aware 3D Scene Graphs (P-3DSGs) from web imagery by fusing open-vocabulary instance/mask cues with relative geometry, augmenting nodes with real-valued liquid states and fine-grained hand–object interaction (HOI) subgraphs, and attaching concise natural-language functional relations—without intrinsics, multi-view, or CAD priors. We release WebLab-3DSG, a 1,000-scene knowledge base pairing each RGB image with its P-3DSG JSON, relative depth, and a single-image point-cloud proxy. In an expert study, grounding an LLM in a P-3DSG improves answer quality by 40% over a strong VLM that reasons directly from RGB, with the largest gains on feasibility, HOI alignment, and safety constraints. WPIS offers a reproducible path to physics-aware reasoning from in-the-wild web imagery and a practical substrate for decision-making in autonomous lab settings.