ICLR2026

A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Sizhuo Zhou, Yifan Chang, Shenglin Zhang, Yu Dai, Kaipeng Zhang

被引用 3 次

DOI arXiv 出版方

摘要

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we propose SynJudge, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image-Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K-50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn's: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources. INTRODUCTION Multimodal understanding and generation are critical capabilities toward artificial general intelligence. In the past two years, multimodal large language models (MLLMs) (Liu et al., 2023; Chen et al., 2024c; Wang et al., 2024a) have shown remarkable performance in multimodal understanding and even surpassed humans in some areas, while we have also seen many impressive advances in high quality image generation (Esser et al., 2024b; Betker et al., 2023) . However, these models are often limited to generating either text or image outputs in isolation, while real-world scenarios typically require tightly interleaved multimodal outputs. Recently, pioneer unified LMMs, such as Janus-Pro (Chen et al., 2025b), have shown great potential. However, they struggle to generate instruction-following interleaved image-text outputs, manifesting issues such as semantic drift, low image-text synergy, and poor image quality. The main challenges lie in the limited scale, quality, and instructional richness of existing datasets. Even with existing datasets (Zhu et al., 2023; Laurenc ¸on et al., 2023; Chen et al., 2024a;b; Xu et al., 2024) , these challenges remain due to their critical limitations: (1) Limited scale: Focus on narrow tasks and typically contain no more than tens of thousands of samples, limiting their applicability to broader real-world scenarios; (2) Unstable quality: Built on web-crawled sources (Yang et al.,