CVPR2025

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu

Abstract

Figure 1. Performance overview of VLsI on vision-language benchmarks. (a) Accuracy on MM-Vet [100] for various model sizes, showing that VLsI (2B and 7B) achieves competitive performance compared to proprietary closed-source VLMs. (b) Comparative evaluation on multiple challenging benchmarks, where VLsI (green and blue) outperforms leading closed-source VLMs, including GPT-4V [79], Claude-3.5-Sonnet [1], and Gemini-1.5-Pro [87], highlighting its efficiency and effectiveness across diverse tasks.