ICLR2026

Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts

Ruofeng Yang, Yongcan Li, Bo Jiang, Cheng Chen, Shuai Li

被引用 3 次

摘要

Recent diffusion models demonstrate remarkable sample efficiency and fast optimization, contradicting standard estimation bounds that suffer from the curse of dimensionality $n^{-1/D}$ with the data dimension $D$ . Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ( $n_k$ modals with dimension $d_k$ ). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. Empirically, our MoE-latent MoG network significantly outperforms MoLRG Gaussian baselines and matches MoE-latent U-Net performance with $10\times$ fewer parameters, validating its practical suitability. Theoretically, we provide provable convergence guarantees for the optimization process and establish an estimation error bound of $R^4\sqrt{\sum_{k=1}^K n_k}\sqrt{\sum_{k=1}^K n_k d_k}/\sqrt{n}$ , successfully escaping the dimensionality curse. Collectively, with MoLR-MoG modeling, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process. Furthermore, we also show the potential of MoE structure for diffusion models from the manifold perspective.