Wan, Jun and Gan, Min and Zhang, Lefei and Zhou, Jie and Liu, Jun and Du, Bo and Chen, C. L. Philip (2025) Fine-Grained Image Captioning by Ranking Diffusion Transformer. IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, 34. pp. 8332-8344. ISSN 1057-7149
TIP-34289-2025.R1_Proof_hi.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (8MB)
Abstract
The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision–language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision–language semantic alignment. We show that by collaborating RVE and RL via the novel RDT—and by gradually adding and removing noise in the diffusion process—more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: https://github.com/junwan2014/RDT