WWW2026

Pedestrian-Centric Discriminative and Fine-grained Semantic Mining for Text-based Person Retrieval

Yuheng Liang, Haipeng Chen, Yu Liu, Yingda Lyu, Xue Wang

Abstract

Text-based Person Retrieval (TPR) aims to retrieve specific pedestrian images from a gallery based on the given textual descriptions, serving as a fine-grained instance of cross-modal retrieval on the Web. Current mainstream approaches primarily leverage pre-trained models and attention mechanisms to enhance multi-modal representations. Despite notable progress, they still struggle with two major challenges: 1) Intra-instance semantic asymmetry, which mainly derives from the partial semantic relevance conveyed by each image-text pair; and 2) Inter-instance semantic ambiguity, which arises from the high similarity of image-text pairs with different identities. These issues result in suboptimal semantic alignment and degraded retrieval accuracy. To this end, we propose a novel Pedestrian-Centric Discriminative and Fine-grained Semantic Mining (DFSM) framework for TPR. Specifically, our DFSM method comprises two essential components: 1) Text-aware Visual Refinement (TVR), which mitigates visual redundancy by selecting semantically relevant patches under textual guidance, and refines them via adaptive clustering and merging; 2) Token-level Semantic Alignment (TSA), which formulates the matching relationship between image regions and text words as a conditional transport (CT) problem, effectively mining fine-grained semantic differences and enhancing instance discrimination. Extensive experiments on four benchmarks validate the advantages of DFSM in terms of retrieval accuracy and visual interpretability.