ICLR2025

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, Hongsheng Li

摘要

Figure 1: Overview of the Draw-and-Understand Framework. (a) Illustrating the task of visual prompting understanding. (b) The architecture of Visual Prompting MLLM (VP-MLLM), which consists of an image encoder, a visual prompt encoder, and an LLM. (c) The data generation process for training, which involves two components: reconstruction of open-source data and data generation assisted by GPT-4V.