ICLR2026
MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
Xirui Hu, Yanbo Ding, Jiahao Wang, Tingting Shi, Yali Wang, Guo Zhi Zhi, Weizhan Zhang
2 citations
Abstract
a conference paper Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios. a conference paper pervision (H4S): couples occlusion supervision at high-noise steps with motion-level supervision at low-noise steps, providing 4D motion supervision and mitigating overfitting to human appearance. Overall, the contribution of MotionWeaver lies in introducing novel unified motion representations and formulating a holistic 4D-anchored paradigm where motion extraction, motion-latent fusion, and training supervision are consistently grounded in 4D space. Building on these designs, MotionWeaver constitutes an end-to-end framework that supports multi-humanoid animation and robustly addresses interactions and occlusions. To enhance the training of our model, we further curate a dataset containing 46 hours of multi-human videos, referred to as MultiHuman46, which features diverse interaction patterns and scenes. Additionally, we introduce DualDynamics, a benchmark of 300 videos, each showcasing two humanoid characters engaged in interaction-rich scenarios. These videos have undergone a rigorous filtering process to ensure quality. Quantitative and qualitative experiments demonstrate that MotionWeaver surpasses state-of-the-art methods, showcasing its generalization ability, identity preservation, and motion consistency in multi-humanoid scenarios. Our main contributions are summarized as follows: • We propose MotionWeaver, a novel framework built upon the unified motion representations and the 4D-anchored paradigm, designed for multi-humanoid image animation involving diverse humanoid forms, rich interactions, and frequent occlusions. • We introduce UCC to obtain unified motion representations, HSI and H4S to effectively construct a shared 4D space for fusing motion representations with video latents. • We curate the MultiHuman46 dataset, which encompasses 46 hours of multi-human videos, and create DualDynamics, a benchmark comprising 300 videos of multiple humanoid characters in interaction-rich scenarios. Extensive experiments demonstrate that MotionWeaver surpasses state-of-the-art methods in multi-humanoid scenarios. 2 RELATED WORK 2.1 DIFFUSION TRANSFORMERS Diffusion Transformers (DiTs) replace the traditional U-Net architecture with a Transformer model to denoise latent representations (