ICLR2025
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li
摘要
Figure 1: Overview of the Draw-and-Understand Framework. (a) Illustrating the task of visual prompting understanding. (b) The architecture of Visual Prompting MLLM (VP-MLLM), which consists of an image encoder, a visual prompt encoder, and an LLM. (c) The data generation process for training, which involves two components: reconstruction of open-source data and data generation assisted by GPT-4V.