Roy, S.K. and Jamali, A. and Biswas, K. and Hong, D. and Ghamisi, P. (2025) ViCxLSTM : An extended Long Short-term Memory vision transformer for complex remote sensing scene classification. International Journal of Applied Earth Observation and Geoinformation, 143: 104801. ISSN 0303-2434
Full text not available from this repository.Abstract
Scene classification plays a critical role in remote sensing image analysis, with numerous methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) developed to improve performance on high-resolution remote sensing (HRRS) imagery. However, the existing models struggle with several key challenges, including effectively capturing fine-grained local features and modeling long-range spatial dependencies in complex scenes. These limitations reduce the discriminative power of extracted features, which is critical for HRRS image classification. To overcome these issues, our study aims to design a unified model that jointly leverages local information extraction, global context modeling, and long-range dependency learning. We propose a novel architecture, ViCxLSTM, designed to enhance feature discriminability for HRRS scene classification. ViCxLSTM is a hybrid model that integrates a Local Pattern Unit (comprising convolutional layers and Fourier Transforms), an extended Long Short-Term Memory module (xLSTM), and a Vision Transformer. This integrated architecture enables the model to capture a wide range of spatial patterns, from local textures to long-range dependencies and global contextual relationships. Experimental evaluations show that ViCxLSTM achieves superior classification performance across diverse land use datasets, outperforming several state-of-the-art models, including ResNet-50, ResNet-101, ResNet-152, ViT, LeViT, CrossViT, DeepViT, and CaiT. The code will be provided freely accessible at https://github.com/aj1365/ViCxLSTM.