ICLR2026
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
12 citations
Abstract
Ground Truth UVP GeoCalib Puffin (Ours) (a) Camera Controllable Generation (b) Camera Understanding Horizon Lines: <think> (Roll) The image exhibits a large clockwise Dutch angle, as evidenced by the tilted perspective where the tree and roofline are not aligned with the natural horizontal plane, suggesting a significant rotation of the camera. (Pitch) The vertical orientation indicates a large tilt-up, as the sky occupies a substantial portion of the frame, with the tree and roofline positioned lower, emphasizing an upward view. (FoV) The image exhibits a medium shot, capturing a portion of the tree and roof while maintaining a balanced composition that includes both the sky and the elements in the foreground. </think> <answer>0.5107, 0.5550, 0.7558</answer> Prompt: Reason the spatial distribution of this image in a thinking mode, and then estimate its camera parameters (roll, pitch, and FoV, in radius). Prompt: Given a scene description and corresponding camera parameters, merge them into a coherent prompt and generate an accurate visualization that highlights visual cues. Caption: A sleek, modern interior with a large, minimalistic window framed in light beige, offering a view of a textured blue stone wall reflecting glimpses of a nearby building and lush greenery. The camera parameters (roll, pitch, and FoV) are: -0.2761, -0.1348, 0.9415. <think> The slight tilt of the window frame relative to the vertical lines of the wall suggests a small counterclockwise Dutch angle in the horizontal orientation. The visible portion of the window and the wall indicates a medium shot, capturing a portion of the interior and exterior without showing the entire scene. The subtle downward angle of the camera, evident from the alignment of the windowsill and the wall, suggests a small tilt-down in the vertical orientation. </think> Error Map Generated Image Error=7.5°Error=2.7°L atitude Camera Map