CVPR2024

RegionGPT: Towards Region Understanding Vision Language Model

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu

Abstract

Figure 1 . We introduce RegionGPT that enables complex region-level captioning, reasoning, classification, and expression comprehension capabilities for the multimodal large language model. Users can input regions of interest of any shape, utilizing ⟨region⟩ as a placeholder within the instruction at any position. Such placeholders are subsequently replaced with semantic region-level embeddings that are fed into the language decoder. Best viewed in color.