ViResGF-Net: Gated Multi-Scale Hybrid Vision Transformer for Robust Fundus Image Multi-Label Classification : Gated Multi-Scale Hybrid Vision Transformer for Robust Fundus Image Multi-Label Classification

Chen, Binghan and Xiang, Haolong and Wan, Jiayi and Bilal, Muhammad and Xu, Xiaolong (2025) ViResGF-Net: Gated Multi-Scale Hybrid Vision Transformer for Robust Fundus Image Multi-Label Classification : Gated Multi-Scale Hybrid Vision Transformer for Robust Fundus Image Multi-Label Classification. IEEE Journal of Biomedical and Health Informatics. ISSN 2168-2194

[thumbnail of ViResGF-Net: Gated Multi-Scale Hybrid Vision Transformer for Robust Fundus Image Multi-Label Classification]
Text (ViResGF-Net: Gated Multi-Scale Hybrid Vision Transformer for Robust Fundus Image Multi-Label Classification)
ViResGF_Net_Gated_Multi_Scale_Hybrid_Vision_Transformer_for_Robust_Fundus_Image_Multi_Label_Classification.pdf - Accepted Version
Available under License Creative Commons Attribution.

Download (8MB)

Abstract

With the acceleration of the global population aging process, fundus diseases such as cataracts and glaucoma have become major factors leading to visual impairment. In the field of ophthalmic diagnosis, the traditional diagnosis and treatment mode mainly relies on doctors to make pathological judgments by observing fundus images with the naked eye. However, due to differences in evaluation standards among doctors, there are discrepancies in diagnostic results for the same fundus photo. In addition, most doctors only specialize in specific fundus diseases, making it difficult to accurately diagnose cases where multiple diseases coexist. To tackle the aforementioned issues, this paper proposes a gated multi-scale hybrid vision Transformer model, designated as ViResGF-Net, for the multi-class classification of fundus diseases. The model integrates the dual-branch structure of Convolutional Neural Network (CNN) and Vision Transformer (ViT). While retaining the global modeling capability of ViT, it performs local feature extraction through the CNN branch and introduces the Feature Pyramid Network (FPN) structure to further enhance the local feature extraction capability of the CNN branch. In the feature fusion stage, a Gated Fusion Unit (GFU) module is added to fuse the feature vectors of the two branches. Finally, the MLP classifier gives the prediction results based on the integrated feature vectors. Through extensive experiments, our model achieved an accuracy of 93.56%, a precision of 92.99%, and an F1 score of 92.36%, all of which are better than those of other models.

Item Type:
Journal Article
Journal or Publication Title:
IEEE Journal of Biomedical and Health Informatics
Uncontrolled Keywords:
Research Output Funding/yes_externally_funded
Subjects:
?? yes - externally fundednobiotechnologyelectrical and electronic engineeringcomputer science applicationshealth information management ??
ID Code:
234105
Deposited By:
Deposited On:
08 Dec 2025 16:10
Refereed?:
Yes
Published?:
Published
Last Modified:
11 Dec 2025 09:18