[2602.20497] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

[2602.20497] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

arXiv - AI 4 min read Article

Summary

The paper introduces LESA, a framework for accelerating diffusion models using learnable stage-aware predictors, achieving significant speedups while maintaining high-quality outputs.

Why It Matters

As diffusion models gain traction in image and video generation, optimizing their computational efficiency is crucial for practical applications. The LESA framework addresses this challenge by enhancing performance without compromising quality, making it relevant for researchers and practitioners in AI and computer vision.

Key Takeaways

  • LESA utilizes a two-stage training approach to optimize diffusion model performance.
  • The framework achieves up to 6.25x acceleration with improved quality metrics over existing methods.
  • Specialized predictors are employed for different noise levels, enhancing feature forecasting accuracy.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20497 (cs) [Submitted on 24 Feb 2026] Title:LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration Authors:Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang View a PDF of the paper titled LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration, by Peiliang Cai and 5 other authors View PDF HTML (experimental) Abstract:Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-...

Related Articles

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Llms

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Abstract page for arXiv paper 2603.18940: Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty ...

arXiv - Machine Learning · 3 min ·
[2512.20620] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps
Machine Learning

[2512.20620] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps

Abstract page for arXiv paper 2512.20620: Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness ...

arXiv - Machine Learning · 4 min ·
[2512.13607] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
Machine Learning

[2512.13607] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Abstract page for arXiv paper 2512.13607: Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

arXiv - Machine Learning · 4 min ·
[2512.02650] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Machine Learning

[2512.02650] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Abstract page for arXiv paper 2512.02650: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

arXiv - Machine Learning · 3 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime