CVPR2025

F-LMM: Grounding Frozen Large Multimodal Models

Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy

Abstract

Figure 1. An example of user-AI conversation around an image. Left: The current state-of-the-art grounding model GLaMM [60] is effective for grounded conversation when prompted by "answer with interleaved masks", but fails to follow user instruction to answer a single word (yes or no) and misunderstands the question as a referring segmentation prompt. Right: Our F-LMM preserves instructionfollowing ability while being able to perform visual grounding.