EMNLP2025

Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Baobao Chang, Minjia Zhang

1 citation

Abstract

Large vision-language models (LVLMs) have achieved impressive results in vision-language tasks. However, LVLMs suffer from hallucinations caused by language bias, which neglects images while over-relying on text. We identify two reasons for the bias: 1). Different training scales between the LLM pretraining and LVLM alignment stage. 2). The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, designed to address such bias with MuLtimodal DuAlattention MeChanIsm (MDA) aNd Soft-Image Guidance (SIG). Specifically, MDA adopts a parallel dual-attention mechanism that constructs separate attention for visual and text inputs to enhance integration of visual inputs across model. SIG uses a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs during inference. Experiments across different model architectures and scales demonstrate that LACING effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without additional resources. 1