CVPR2025

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, Sangyoun Lee

Abstract

Figure 1. (a) A model structure that generates proposal masks using a mask generation model. (b) A model structure that refines the correlation between image and text. (c) The structure of the proposed ESC-Net. Our ESC-Net efficiently models the relationship between images and text by combining a pre-trained SAM block with pseudo prompts instead of an inefficient mask generation model. This approach enables much denser mask prediction compared to conventional correlation-based methods.