ICLR2025

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, et al.

出版方

摘要

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised finetuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). 1 Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development. • To further enhance model performance, especially for text-rich image understanding, we further ablate the data choices for continual pre-training (Section 2.3). This includes 45M rich OCR data and 7M high-quality captions from public or generated by a MM1-based image captioner. • We also provide detailed ablation regarding dynamic image splitting, also known as AnyRes (Liu et al., 2024a) (Section 2.4), for high-resolution image comprehension. Finally, to enhance performance on knowledge-heavy benchmarks like MMMU (Yue et al., 2023a), we further study the impact of pre-training data (Appendix A.2). EMPIRICAL SETUP FOR ABLATIONS Unless otherwise noted, we follow the default settings below in our ablation studies. Model architecture and data preprocessing. We follow MM1 (McKinzie et al., 2024) and use the same architecture, focusing on the 3B dense model for all the ablation studies in this section. Specifically, • Static image splitting (Lin et al., 2023b) is enabled with 4 sub-image splits (plus an overview image), and each sub-image is resized to 672×672 resolution via position embedding interpolation. Note that we did not use dynamic image splitting during ablation for faster iteration of experiments. • As to the encoding of multi-image data, we enable image splitting only when the current training sample contains fewer than three images to avoid excessively long sequence lengths. • Similar to capabilities introduced in Ferret (You et al., 2023), MM1.5 directly supports referring and grounding. When requested, MM1.5 can produce bounding boxes in its textual output to ground its responses. Additionally, the model can interpret references to points and regions in the input image in the form of referring coordinates and bounding boxes. • The CLIP image encoder and the LLM backbone are based on in-house models, with C-Abstractor (Cha et al., 2024) serving as the vision-language connector. Model optimization. For both continual pre-training and SFT, we set the batch size as 256. We use the AdaFactor optimizer with a peak learning rate of 1e-5 and a cosine decay of 0. For continual pre-training, we train a maximum of 30k steps. During SFT, all models are optimized for one epoch. Continual pre-training. Models are initialized with the MM1 pre-trained checkpoint. By default, we conduct continual pre-training on 45M high-resolution OCR data (including PDFA, IDL, Renderedtext (Laurençon et al., 2024a) and DocStruct-4M (Hu et al., 2024a)) at this stage. In each training batch, data is equally sampled from those four datasets. Similar to the SFT stage, we use static image splitting, dividing each image into five sub-images, with each sub-image resized to 672×672 resolution. We find that this high-resolution setup is essential for continual pre-training. SFT data categorization. Grouping datasets into categories can be helpful for data balancing and simplifying the analysis (Laurençon et al., 2024a; Tong et al., 2024a). At a high level, we cluster datasets into single-image, multi-image, and text-only categories based on the number of images presented in each example. For the single-image group, we further classify each dataset into the following sub-categories: general, text-rich, refer&ground, science, math and code. See Table 5 in Appendix A.3 for the details of each category used for the ablation study, and Figure 8 for an overview of the group categories. Evaluation benchmarks. We group our benchmarks into categories based on what capabilities a benchmark primarily measures. Our benchmark groups include general, text-rich, refer&ground, knowledge, and multi-image. See Table 6 in Appendix A.4 for more details. We propose Category Average Score, the average score of all benchmark numbers for each sub-category, to represent the average performance on that capability. We focus on the categories of general, text-rich, and knowledge, as these capabilities are widely considered essential for MLLMs. To ev