CVPR2025

Scaling Properties of Diffusion Models For Perceptual Tasks

Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

摘要

We fine-tune a pre-trained Diffusion Model (DM) for visual perception tasks. We take a RGB image, and a conditional image (i.e. next video frame, occlusion mask, etc.), along with the noised image of the ground truth prediction. Our model generates predictions for visual tasks such as depth estimation, optical flow prediction, and amodal segmentation, based on the conditional task embedding. We train a generalist model that can perform all three tasks with exceptional performance.