ACL2024

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang

摘要

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we propose VISION-FLAN, the most diverse public-available visual instruction tuning dataset to date, comprising 196 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. Complementing the proposed dataset, we further introduce a two-stage instruction tuning framework, in which VLMs are firstly tuned on VISION-FLAN and secondly, further tuned on GPT-4 synthesized data. Our experimental results demonstrate that by leveraging the two-stage tuning framework, VLMs trained on VISION-FLAN, achieve the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks.