Song, X. and Jin, X. and Qi, J. and Liu, J. (2026) Dual alignment : Partial negative and soft-label alignment for text-to-image person retrieval. Information Fusion, 127: 103644. ISSN 1566-2535
TIReID_InformationFusion_1_.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (12MB)
Abstract
Text-to-image person retrieval is a task to retrieve the right matched images based on a given textual description of the interested person. The main challenge lies in the inherent modal difference between texts and images. Most existing works narrow the modality gap by aligning the feature representations of text and image in a latent embedding space. However, these methods usually leverage the hard label and mine insufficient or incorrect hard negatives to achieve cross-modal alignment, generating incorrect hard negative pairs so as to suboptimal performance. To tackle the above problems, we propose a dual alignment framework, Partial negative and Soft-label Alignment (PASA), which includes the partial negative alignment (PA) strategy and the Soft-label Alignment (SA) strategy. Specifically, PA pushes far away the hard negatives in the triplet loss by considering a certain amount of negatives within each mini-batch as hard negatives, preventing the distraction to the positive text–image pairs. Based on PA, SA further achieves the alignment between the similarity distribution on these hard negatives by the manner of soft-label, as well as the alignment between inter-modal and intra-modal. Extensive experiments on three public datasets, CUHK-PEDES, ICFG-PEDES and RSTPReid, demonstrate that our proposed PASA method can consistently improve the performance of text-to-image person retrieval, and achieve new state-of-the-art results on the above three datasets.