[2509.24526] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models
Summary
The paper introduces Consistency Mid-Training (CMT), a novel method for enhancing the efficiency of training flow map models, achieving state-of-the-art results with significantly reduced resource requirements.
Why It Matters
CMT addresses key challenges in training flow map models, such as instability and high resource consumption. By providing a more efficient training framework, it has the potential to accelerate advancements in computer vision and machine learning applications, making cutting-edge techniques more accessible.
Key Takeaways
- CMT introduces a mid-training phase that stabilizes the training of flow map models.
- The method significantly reduces the amount of training data and GPU time needed.
- CMT achieves state-of-the-art FID scores on popular datasets like CIFAR-10 and ImageNet.
- The approach simplifies the learning process for flow map models, enhancing convergence speed.
- CMT is positioned as a general framework applicable to various flow map training scenarios.
Computer Science > Computer Vision and Pattern Recognition arXiv:2509.24526 (cs) [Submitted on 29 Sep 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models Authors:Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon View a PDF of the paper titled CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models, by Zheyuan Hu and 3 other authors View PDF HTML (experimental) Abstract:Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines a...