ICLR2025

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

摘要

MM-IFEval-C (300 questions) MIA-Bench (About 1k constraints) 8 32 2.6 5.1 75.4 22.7 32 Categories of Constraints 400 high-quality samples 5.1 Average constraints 3 Evaluation metrics combined Instruction Imagine you are the musician in this image. Write about your thoughts and feelings while performing. R, e, a, d, i, n, a, f, u 1. Answer as if you are facing to the audience. 2. Use No more than 60 words…… * Equal contribution. Corresponding authors. sponses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2%), MIA (+7.6%), and IFEval (+12.3%). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine .