M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

Tang, Hao and Liu, Jun and Yan, Shuanglin and Yan, Rui and Li, Zechao and Tang, Jinhui (2023) M3Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition. In: MM '23: Proceedings of the 31st ACM International Conference on Multimedia Action Recognition :. ACM, New York, pp. 1719-1728. ISBN 9798400701085

Full text not available from this repository.

Abstract

Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M3Net, a matching-based framework for FS-FG action recognition, which incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.Multi-view encoding captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data.Multi-view matching integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. Multi-view fusion consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M3Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

Item Type:
Contribution in Book/Report/Proceedings
ID Code:
223042
Deposited By:
Deposited On:
02 Dec 2024 15:45
Refereed?:
Yes
Published?:
Published
Last Modified:
02 Dec 2024 15:45