[2511.19797] Terminal Velocity Matching
Summary
The paper introduces Terminal Velocity Matching (TVM), a novel approach to generative modeling that enhances performance in one- and few-step scenarios by modeling transitions between diffusion timesteps.
Why It Matters
TVM addresses limitations in current generative models by providing a framework that ensures high fidelity in data generation. This advancement is crucial for applications in machine learning and computer vision, where efficient and accurate generative models are increasingly in demand.
Key Takeaways
- TVM generalizes flow matching for improved generative modeling.
- It models transitions between diffusion timesteps, enhancing fidelity.
- The method achieves state-of-the-art performance on ImageNet datasets.
- Architectural changes make TVM efficient for training with transformers.
- TVM provides an upper bound on the Wasserstein distance for model distributions.
Computer Science > Machine Learning arXiv:2511.19797 (cs) [Submitted on 24 Nov 2025 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Terminal Velocity Matching Authors:Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song View a PDF of the paper titled Terminal Velocity Matching, by Linqi Zhou and 3 other authors View PDF HTML (experimental) Abstract:We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Patter...