Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition

Lu, Mingqi and Yang, Siyuan and Lu, Xiaobo and Liu, Jun (2024) Cross-Modal Contrastive Pre-training for Few-Shot Skeleton Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology, 34 (10). pp. 9798-9807. ISSN 1051-8215

Text (Cross-Modal Contrastive)
Cross-Modal_Contrastive.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (3MB)

Abstract

This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.

Item Type:

Journal Article

Journal or Publication Title:

IEEE Transactions on Circuits and Systems for Video Technology

Uncontrolled Keywords:

/dk/atira/pure/subjectarea/asjc/2200/2214

Subjects:

?? media technologyelectrical and electronic engineering ??

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

224207

Deposited By:

ep_importer_pure

Deposited On:

09 Oct 2024 09:45

Refereed?:

Yes

Published?:

Published

Last Modified:

11 Dec 2025 08:22

URI:

https://eprints.lancs.ac.uk/id/eprint/224207