Zou, Minghao and Liu, Shangkun and Zeng, Qingtian and Zhang, Xue and Yuan, Guiyuan and Hao, Xiaoshuai and Liu, Jun and Zhou, Wei (2026) Pose-Guided Multi-Cue Explicit Query Construction for Disambiguating Human-Object Interactions. IEEE Transactions on Circuits and Systems for Video Technology. ISSN 1051-8215
paper.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (1MB)
Abstract
Human-Object Interaction (HOI) detection remains challenging due to the semantic ambiguity of interaction categories and the limited discriminability of their feature representations. Existing approaches often improve recognition by employing sophisticated models or auxiliary textual annotations. While effective in certain gains, these solutions incur additional computational or annotation costs and struggle to capture intrinsic interaction regularities. To address these issues, we propose Pose-Guided Multi-Cue Explicit Query Construction (PM-EQC), a unified Transformer-based framework that builds upon collaborative modeling of appearance, spatial, and pose cues for discriminative interaction reasoning. At its core, the Collaborative Multi-Cue Query Constructor (CM-CQC) jointly models dependencies among visual cues to generate explicit query embeddings. CM-CQC further incorporates a hierarchical pose contextualization mechanism: global body configurations adaptively guide attention to local critical joints, yielding fine-grained pose embeddings and more precise interaction disambiguation. Owing to its modular design, PM-EQC integrates seamlessly with diverse backbones and benefits from their advances. Extensive experiments on PhysLab, HICO-DET, and V-COCO datasets demonstrate that PM-EQC achieves state-of-the-art performance, and the code is publicly available at https://github.com/ZMHSDUST/ PM-EQC.