EMNLP2025

Image Difference Captioning via Adversarial Preference Optimization

Zihan Huang, Junda Wu, Rohan Surana, Tong Yu, David Arbour, Ritwik Sinha, Julian J. McAuley

被引用 3 次

摘要

Image Difference Captioning (IDC) aims to generate natural language descriptions that highlight subtle differences between two visually similar images. While recent advances leverage pre-trained vision-language models to align fine-grained visual differences with textual semantics, existing supervised approaches often overly focus on dataset-specific language patterns and fail to capture fine-grained and context-aware preferences on IDC, due to limited annotation diversity and a lack of semantically informative negative examples during training, To address these limitations, we propose an adversarial direct preference optimization (ADPO) framework for IDC, which formulates IDC as a preference optimization problem under the Bradley-Terry-Luce model, directly aligning the captioning policy with pairwise difference preferences via Direct Preference Optimization (DPO). To model more accurate and diverse IDC preferences, we introduce an adversarially trained hard negative retriever that selects counterfactual captions, This results in a minimax optimization problem, which we solve via policy-gradient reinforcement learning, enabling the policy and retriever to improve jointly. By dynamically generating semantically challenging negatives, our method reduces reliance on dataset-specific patterns. Experiments on benchmark IDC datasets show that our approach outperforms existing baselines, especially in generating fine-grained and accurate difference descriptions. * These authors contributed equally. Query: Please describe what the difference is between the target image and the reference image Reference Image Target Image Chosen: "the person is folding a green paper in right image" Rejected: "the blue truck is now in the picture on the right" GT IDC:"the person is folding a green paper in right image" SFT can overly focus on dataset-specific language patterns, e.g., "the person", "right image". DPO with trivial comparisons fail to learn subtle differences Chosen: "the person is folding a green paper in right image" Rejected: "the person is folding a red paper in right image" Adversarial DPO with adversarial learned negative retrieval benefits to learn more difficult IDC with nuanced difference