ICLR2026

Seeing What’s Wrong: A Trajectory-Guided Approach to Caption Error Detection

Gabriel Afriat, Ryan Lucas, Xiang Meng, Yufang Hou, Yada Zhu, Rahul Mazumder

摘要

Error detection is critical for enhancing multimodal dataset reliability and downstream model performance. Existing error filters, while increasingly powerful, typically rely on a single similarity score per image-caption pair. This is limiting: captions with subtle errors (e.g., mislabeled objects, incorrect colors, or negations) can still score highly, while correct but imprecisely worded captions may score poorly. To address this, we introduce the notion of a caption trajectory: an ordered sequence of captions produced by iteratively editing a caption to maximize an image-text relevance score. This trajectory carries rich signals for error detection. Correct captions typically stabilize after minor edits, while erroneous captions undergo substantial improvements. Building on these insights, we introduce TRACED, a cost-efficient and model-agnostic framework that leverages trajectory statistics for more accurate caption error detection. Beyond detection, TRACED also serves as an interpretable tool for identifying the origins of errors. We further demonstrate that, in the case of error correction, this interpretable token-level error information can be provided to VLMs to enhance the alignment score of the generated captions. On MS COCO and Flickr30k, TRACED achieves up to 2.8% improvement in accuracy for error detection across three noise types. Our code is available at https://github.com/mazumder-lab/TRACED . Published as a conference paper at ICLR 2026 of this interpretable token-level error information on caption correction. We show that this information can be used to improve the alignment of the generated captions, and observe an improvement of up to 14.5% in the BLIP-alignment score for the corrected captions using TRACED compared to unguided caption correction. RELATED WORK