Fine-Grained Image Captioning by Ranking Diffusion Transformer

Wan, Jun and Gan, Min and Zhang, Lefei and Zhou, Jie and Liu, Jun and Du, Bo and Chen, C. L. Philip (2025) Fine-Grained Image Captioning by Ranking Diffusion Transformer. IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, 34. pp. 8332-8344. ISSN 1057-7149

[thumbnail of TIP-34289-2025.R1_Proof_hi]

Text (TIP-34289-2025.R1_Proof_hi)
TIP-34289-2025.R1_Proof_hi.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (8MB)

Abstract

The CLIP visual feature-based image captioning models have developed rapidly and achieved remarkable results. However, existing models still struggle to produce descriptive and discriminative captions because they insufficiently exploit fine-grained visual cues and fail to model complex vision–language alignment. To address these limitations, we propose a Ranking Diffusion Transformer (RDT), which integrates a Ranking Visual Encoder (RVE) and a Ranking Loss (RL) for fine-grained image captioning. The RVE introduces a novel ranking attention mechanism that effectively mines diverse and discriminative visual information from CLIP features. Meanwhile, the RL leverages the ranking of generated caption quality as a global semantic supervisory signal, thereby enhancing the diffusion process and strengthening vision–language semantic alignment. We show that by collaborating RVE and RL via the novel RDT—and by gradually adding and removing noise in the diffusion process—more discriminative visual features are learned and precisely aligned with the language features. Experimental results on popular benchmark datasets demonstrate that our proposed RDT surpasses existing state-of-the-art image captioning models in the literature. The code is publicly available at: https://github.com/junwan2014/RDT

Item Type:

Journal Article

Journal or Publication Title:

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Uncontrolled Keywords:

/dk/atira/pure/subjectarea/asjc/1700/1704

Subjects:

?? computer graphics and computer-aided designsoftware ??

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

234818

Deposited By:

ep_importer_pure

Deposited On:

15 Jan 2026 11:55

Refereed?:

Yes

Published?:

Published

Last Modified:

26 Jan 2026 00:37

URI:

https://eprints.lancs.ac.uk/id/eprint/234818