Llms are good action recognizers

Qu, Haoxuan and Cai, Yujun and Liu, Jun (2024) Llms are good action recognizers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 :. IEEE. ISBN 9798350353013

[thumbnail of Qu_LLMs_are_Good_Action_Recognizers_CVPR_2024_paper]
Text (Qu_LLMs_are_Good_Action_Recognizers_CVPR_2024_paper)
Qu_LLMs_are_Good_Action_Recognizers_CVPR_2024_paper.pdf - Accepted Version
Available under License Creative Commons Attribution.

Download (1MB)

Abstract

Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been pro-posed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowl-edge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model ar-chitectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we in-vestigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic pro-jection process to project each input action signal (i.e., each skeleton sequence) into its “sentence format” (i.e., an “action sentence”). Moreover, we also incorporate our frame-work with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.

Item Type:
Contribution in Book/Report/Proceedings
ID Code:
227548
Deposited By:
Deposited On:
28 Nov 2025 13:55
Refereed?:
Yes
Published?:
Published
Last Modified:
28 Nov 2025 23:15