ICLR2025

LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li

摘要

Figure 1: Performance comparison in three interleaved scenarios, including multi-image, multiframe (video), and multi-view (3D). Our LLaVA-Interleave model achieves SoTA performance across a variety of evaluation benchmarks.