ICLR2025

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov

摘要

Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/ . Figure 1 : The proposed MoDE architecture (left) uses a transformer with causal masking, where each block includes noise-conditional self-attention and a noise-conditioned router that assigns tokens to expert models based on the noise level. This design enables efficient, scalable action generation. On the right, the router's activation of subsets of simple MLP experts with Swish-GLU activation during denoising is illustrated. multiple expert subnetworks and a routing model, that sparsely activates experts and interpolates their outputs, based on the input. We introduce Mixture-of-Denoising Experts Policy (MoDE), a scalable and efficient MoE Diffusion Policy. Our work is inspired by prior results showcasing the multitask nature of the denoising process (Hang et al., 2024) , where there is little transfer between the different phases in the denoising process. We present a novel noise-conditioned routing mechanism, that distributes tokens to our experts based on the current noise level. MoDE leverages noise-conditioned self-attention combined with a noise input token for enhanced noise-injection. Our proposed Policy surpasses previous Diffusion Policies with higher efficiency and demonstrates sota performance across 134 diverse tasks in challenging goal-conditioned imitation learning benchmarks: CALVIN (Mees et al., 2022b) and LIBERO (Liu et al., 2023). Through comprehensive ablation studies, we investigate the impact of various design decisions, including token routing strategies, noise-injection techniques, expert distribution and diverse pretraining on a large-scale robot dataset (Collaboration et al., 2023). We summarize our contributions below: • We introduce MoDE, a novel Mixture-of-Experts Diffusion Policy that achieves state-ofthe-art performance while using 90% fewer FLOPs and less active parameters than dense transformer baselines thanks to our noise-based expert caching and sparse MoE design. • We demonstrate MoDE's effectiveness across 134 tasks in 4 benchmarks, showing an average 57% performance increase over prior Diffusion Policies while maintaining improved computational efficiency. • We present detailed ablation studies that investigate the importance of routing strategies and noise-injection, visualizing expert utilization across denoising steps to identify key components of MoDE. RELATED WORK