CVPR2024

Seeing the Unseen: Visual Common Sense for Semantic Placement

Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs

被引用 3 次

摘要

Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding ‘what is not present’. Specif-ically, given an image (e.g. of a living room) and a name of an object (“cushion ”), a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Se-mantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), AR devices (automatically rendering an object in the user's space), and visually-grounded chatbots with common sense. Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images (e.g. via image search with object names) and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context, which is easy to find online, and then remove that ob-ject from the image via inpainting. This automated pipeline converts unstructured web data into a dataset comprising pairs of images with/without the object. With this proposed data generation pipeline, we collect a novel dataset, containing 1.3M images across 9 object categories. We then train a SP prediction model, called CLIP-UNet, on our dataset. The CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors, gener-alizes well to real-world and simulated images and exhibits semantics-aware reasoning for object placement. In our user studies, we find that the SP masks predicted by CLIP-UNet are favored 43.7% and 31.3% times when comparing against the 4 SP baselines on real and simulated images. In addition, leveraging SP mask predictions from CLIP-UNet enables downstream applications like building tidying robots in indoor environments.