ICLR2026

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow-Map Models

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon

被引用 16 次

摘要

Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state-of-the-art two-step FIDs of 1.97 (CIFAR-10), 1.32 (ImageNet 64×6464\times64), and 1.84 (ImageNet 512×512512\times512), using up to 9898% less training data and GPU time than CMs. On ImageNet 256×256256\times256, it attains 1-step FID 3.34 with 50\sim50% less training than MF from scratch (FID 3.43). On MSCOCO T2I, CMT reaches the best FID with 47\sim47% less training. This establishes CMT as a principled, efficient, and general framework for training flow map models. Code and models are available at https://github.com/sony/cmt.