WWW2026

BHGap: A Deep Iterative Prompting and Multi-stage Alignment Framework for Dynamic Facial Expression Recognition

Yichi Zhang, Yunqi Han, Jiayue Ding, Liangyu Chen

Abstract

Dynamic Facial Expression Recognition (DFER), as a crucial part of affective computing, has broad applications in many areas such as human-computer interaction and social media content analysis. Effectively integrating multimodal information, particularly audio-visual signals, remains the core challenge. However, existing approaches are generally constrained by two major challenges: (1) shallow and static fusion mechanisms, which fail to capture the dynamic co-evolution of audio-visual features during deep interaction; (2) implicit and coarse alignment strategies, which are insufficient to bridge the modality gap caused by heterogeneous feature distributions. To address these issues, we propose a novel framework, BHGap, which integrates deep iterative prompt generation with multi-stage feature alignment and fusion. The key idea is to reformulate audio-visual collaboration from a one-shot fusion event into a continuous, reciprocal generation process that spans every layer of frozen backbone encoders. Specifically, we design a State Space Model (SSM)-based cross-modal prompt generator that dynamically produces ''guidance prompts'' for the counterpart modality at each encoding layer, thereby enabling deep and fine-grained feature co-evolution. Beyond encoding, we further introduce a coarse-to-fine multi-stage alignment module: at the macro level, low-rank adversarial alignment is employed to establish spatio-temporal congruity between audio and video while reducing global distributional discrepancies; at the micro level, Maximum Mean Discrepancy (MMD) constraints combined with implicit differentiation optimization ensure fine-grained statistical consistency and semantic alignment. Extensive experiments on the public DFEW and MAFW datasets demonstrate that our method achieves state-of-the-art performance, offering a new paradigm of deep iterative fusion and explicit alignment for multimodal emotion recognition. Code is available at https://github.com/NDYZD666/-public-BHGap.