ICML2025

Be Confident: Uncovering Overfitting in MLLM Multi-Task Tuning

Wenke Huang, Jian Liang, Guancheng Wan, Didi Zhu, He Li, Jiawei Shao, Mang Ye, Bo Du, Dacheng Tao

Abstract

Fine-tuning Multimodal Large Language Models (MLLMs) in multi-task learning scenarios has emerged as an effective strategy for achieving cross-domain specialization. However, multi-task fine-tuning appears performance degradation on open-response datasets. We posit that free-form answer generation primarily depends on language priors, and strengthening the integration of visual behavioral cues is critical for enhancing prediction robustness. In this work, we propose Noise Resilient Confidence Alignment to address the open-response overfitting challenge during multitask fine-tuning. Our approach prioritizes maintaining consistent prediction patterns in MLLMs across varying visual qualities. To achieve this, we synthesize distorted visual inputs and enforce token prediction confidence alignment towards normal visual branch. By explicitly linking confidence calibration to visual robustness, this method reduces over-reliance on language priors. We conduct extensive empirical evaluations across diverse multi-task downstream via popular MLLM architectures. The comprehensive experiment demonstrates our effectiveness, showcasing its ability to alleviate open-response overfitting while maintaining satisfying multi-task performance.