CVPR2025

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Qirui Jiao, Daoyuan Chen, Yilun Huang, Bolin Ding, Yaliang Li, Ying Shen

摘要

We provide more details and experiments of this work in the supplementary material and organize them as follows: • Section 9. Comparison with Existing Image Difference Datasets: We compare our IMG-DIFF dataset with existing image difference datasets in terms of characteristics and performance, highlighting the advantages of our dataset. • Section 10. Prioritizing Quality Over Quantity: We clarify that our choice to use 13K samples for testing is motivated by the typical size of task-specific datasets used for MLLM fine-tuning. Furthermore, by expanding the dataset to four times its original size, we confirm that the relationship between data size and performance gains is not linear. • Section 11. Expanding Diversity with Lexicons: We use a lexicon to generate object replacement data and test the new dataset. The results validate the effectiveness of this lexicon-based strategy in enhancing data diversity. • Section 12. Performance Based on Contrastive Chainof-Thought: We evaluate our dataset using the Contrastive Chain-of-Thought method. The results confirm that our dataset enables the fine-tuned model to more accurately describe image differences, thereby enhancing the model's VQA capability. • Section 13. Testing on MLLMs at Different Scales: We test the performance of our IMG-DIFF dataset across MLLMs of different scales. The results indicate that the performance gains brought by our dataset are not limited by scale. • Section 14. Top-Performing MLLMs in Image Difference Detection: We evaluate the difference detection capabilities of top-performing MLLMs, revealing significant room for improvement among SOTA models. • Section 15. Unnatural Images in the Dataset: We remove unnatural images from the generated data, fine-tune the models and evaluate their performance, revealing that unnatural images do not necessarily degrade model efficacy. • Section 16. Impact of our Dataset on Spatial Reasoning Performance: We evaluate whether our generated data enhances spatial reasoning capabilities in models, confirming its effectiveness. • Section 17. Ablation Studies: We explore the impact of varying filter intensities on the performance of the final dataset. As a result, we identify an optimal threshold that balances data quality and quantity. • Section 18. Additional Details of Experiments: We