CPAL : Cross-prompting Adapter with LoRAs for RGB+X Semantic Segmentation

Liu, Ye and Wu, Pengfei and Wang, Miaohui and Liu, Jun (2025) CPAL : Cross-prompting Adapter with LoRAs for RGB+X Semantic Segmentation. IEEE Transactions on Circuits and Systems for Video Technology. ISSN 1051-8215

[thumbnail of paper (1)]
Text (paper (1))
paper_1_.pdf - Accepted Version
Available under License Creative Commons Attribution.

Download (3MB)

Abstract

As sensor technology evolves, RGB+X systems combine traditional RGB cameras with another type of auxiliary sensor, which enhances perception capabilities and provides richer information for important tasks such as semantic segmentation. However, acquiring massive RGB+X data is difficult due to the need for specific acquisition equipment. Therefore, traditional RGB+X segmentation methods often perform pretraining on relatively abundant RGB data. However, these methods lack corresponding mechanisms to fully exploit the pretrained model, and the scope of the pretraining RGB dataset remains limited. Recent works have employed prompt learning to tap into the potential of pretrained foundation models, but these methods adopt a unidirectional prompting approach i.e., using X or RGB+X modality to prompt pretrained foundation models in RGB modality, neglecting the potential in non-RGB modalities. In this paper, we are dedicated to developing the potential of pretrained foundation models in both RGB and non-RGB modalities simultaneously, which is non-trivial due to the semantic gap between modalities. Specifically, we present the CPAL (Cross-prompting Adapter with LoRAs), a framework that features a novel bi-directional adapter to simultaneously fully exploit the complementarity and bridging the semantic gap between modalities. Additionally, CPAL introduces low-rank adaption (LoRA) to fine-tune the foundation model of each modal. With the support of these elements, we have successfully unleashed the potential of RGB foundation models in both RGB and non-RGB modalities simultaneously. Our method achieves state-of-the-art (SOTA) performance on five multi-modal benchmarks, including RGB+Depth, RGB+Thermal, RGB+Event, and a multi-modal video object segmentation benchmark, as well as four multi-modal salient object detection benchmarks. The code and results are available at: https://github.com/abelny56/CPAL.

Item Type:
Journal Article
Journal or Publication Title:
IEEE Transactions on Circuits and Systems for Video Technology
Uncontrolled Keywords:
/dk/atira/pure/subjectarea/asjc/2200/2214
Subjects:
?? media technologyelectrical and electronic engineering ??
ID Code:
227920
Deposited By:
Deposited On:
05 Mar 2025 13:15
Refereed?:
Yes
Published?:
Published
Last Modified:
05 Mar 2025 13:15