EMNLP2024

RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets

Bhathiya Hemanthage, Hakan Bilen, Phil Bartie, Christian Dondrup, Oliver Lemon

被引用 2 次

摘要

The Generalized Referring Expression Comprehension (GREC) task extends classic REC by generating image bounding boxes for objects referred to in natural language expressions, which may indicate zero, one, or multiple targets. This generalization enhances the practicality of REC models for diverse real-world applications. However, the presence of varying numbers of targets in samples makes GREC a more complex task, both in terms of training supervision and final prediction selection strategy. Addressing these challenges, we introduce RE-CANTFormer, a one-stage method for GREC that combines a decoder-free (encoder-only) transformer architecture with DETR-like Hungarian matching. Our approach consistently outperforms baselines by significant margins in three GREC datasets.