[2604.08558] WAND: Windowed Attention and Knowledge Distillation for

[2604.08558] WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

arXiv - AI April 13, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.08558: WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

Computer Science > Computation and Language arXiv:2604.08558 (cs) [Submitted on 17 Mar 2026] Title:WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models Authors:Hanna Lee, Tan Dat Nguyen, Jaehoon Kang, Kyuhong Shim View a PDF of the paper titled WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models, by Hanna Lee and 3 other authors View PDF HTML (experimental) Abstract:Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency. Comments: Subjects: Computation a...

Originally published on April 13, 2026. Curated by AI News.

Llms

Transformer Math Explorer [P]

This is an interactive math reference for transformer models, presented via dataflow graphs, all the way down to elementary math. Covers ...

Reddit - Machine Learning · 1 min · 2 minutes ago

Machine Learning

how much of your time goes into environment setup vs actual model work?

For most people I've talked to, it's embarrassingly high. New machine? Set up CUDA again. New team member? Good luck for reproducing the ...

Reddit - ML Jobs · 1 min · about 1 hour ago

Machine Learning

How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

Hi! I am trying to sanity-check an assumption for diffusion video generation reproducibility. Suppose I run the same video diffusion mode...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

(Posting Here because removed by Chatgpt Complaints moderators because the model here is 4o, and refuse to believe there were any safety ...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

[2604.08558] WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

About this article

Related Articles

Transformer Math Explorer [P]

how much of your time goes into environment setup vs actual model work?

How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

No comments

Stay updated with AI News