AAAI2026

QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Yuxiao Wang, Wolin Liang, Yu Lei, Weiying Xue, Nan Zhuang, Qi Liu

被引用 1 次

摘要

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is ACTOR (Action-aware Crossmodal TransfORmer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a Perceptual Distilled Query Decoder (PDQD), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-ofthe-art performance and strong generalization. Code will be released upon publication. introduction Human-Object Interaction (HOI) detection aims to identify and localize human-object pairs in images while recognizing their interactions, generating structured ⟨human, action, object⟩ triplets. Beyond traditional detection tasks, HOI detection requires understanding semantic relationships between entities, demanding robust scene modeling capabilities. With applications spanning behavior recognition, image captioning, video analysis, and robotic perception, this task has attracted significant research attention (Liao et al. 2022; Wang et al. 2024c). Early HOI detection methods follow a two-stage pipeline: detecting humans and objects via detectors like Faster R-