ICLR2026
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
被引用 12 次
摘要
Ground Truth UVP GeoCalib Puffin (Ours) (a) Camera Controllable Generation (b) Camera Understanding Horizon Lines: <think> (Roll) The image exhibits a large clockwise Dutch angle, as evidenced by the tilted perspective where the tree and roofline are not aligned with the natural horizontal plane, suggesting a significant rotation of the camera. (Pitch) The vertical orientation indicates a large tilt-up, as the sky occupies a substantial portion of the frame, with the tree and roofline positioned lower, emphasizing an upward view. (FoV) The image exhibits a medium shot, capturing a portion of the tree and roof while maintaining a balanced composition that includes both the sky and the elements in the foreground. </think> <answer>0.5107, 0.5550, 0.7558</answer> Prompt: Reason the spatial distribution of this image in a thinking mode, and then estimate its camera parameters (roll, pitch, and FoV, in radius). Prompt: Given a scene description and corresponding camera parameters, merge them into a coherent prompt and generate an accurate visualization that highlights visual cues. Caption: A sleek, modern interior with a large, minimalistic window framed in light beige, offering a view of a textured blue stone wall reflecting glimpses of a nearby building and lush greenery. The camera parameters (roll, pitch, and FoV) are: -0.2761, -0.1348, 0.9415. <think> The slight tilt of the window frame relative to the vertical lines of the wall suggests a small counterclockwise Dutch angle in the horizontal orientation. The visible portion of the window and the wall indicates a medium shot, capturing a portion of the interior and exterior without showing the entire scene. The subtle downward angle of the camera, evident from the alignment of the windowsill and the wall, suggests a small tilt-down in the vertical orientation. </think> Error Map Generated Image Error=7.5°Error=2.7°L atitude Camera Map