ICLR2025

DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models

Kaishen Wang, Hengrui Gu, Meijun Gao, Kaixiong Zhou

摘要

Large Vision-Language Models (LVLMs) exhibit significant potential in multimodal tasks but often struggle with hallucinations-responses that are plausible yet visually ungrounded. In this work, we investigate the layer-wise prediction tendencies of LVLMs and conduct an in-depth analysis of their decoding mechanism. We observe that LVLMs tend to "overthink" during the final stages of decoding, making significant prediction shifts in the last few layers often favoring incorrect results, which leads to a surge in hallucinative outputs. Leveraging this localized pattern, we propose a novel decoding strategy inspired by the momentum analogy used in gradient descent-based optimizers. Our method enforces decoding consistency across layers in an adaptive manner during forward passes-an under-explored approach in existing works. This strategy significantly improves the reliability and performance of LVLMs in various multimodal tasks, while introducing only negligible efficiency overhead. The code is available at https://github.com/tunantu/DAMO .