[2509.25275] VoiceBridge: General Speech Restoration with One-step Latent Bridge Models
Summary
VoiceBridge introduces a novel one-step latent bridge model for general speech restoration, enhancing audio quality from various distortions using advanced neural techniques.
Why It Matters
This research addresses the limitations of existing speech enhancement models by providing a scalable solution that improves the quality of speech restoration across multiple tasks. As speech technology becomes increasingly integral to AI applications, advancements like VoiceBridge can significantly enhance user experiences in communication and media.
Key Takeaways
- VoiceBridge utilizes a one-step latent bridge model for efficient speech restoration.
- The model enhances waveform-latent space alignment through an energy-preserving variational autoencoder.
- It successfully tackles diverse speech restoration tasks without the need for distillation.
- Extensive validation shows superior performance in both in-domain and out-of-domain tasks.
- The approach combines denoising and generative capabilities for improved audio quality.
Computer Science > Sound arXiv:2509.25275 (cs) [Submitted on 28 Sep 2025 (v1), last revised 15 Feb 2026 (this version, v3)] Title:VoiceBridge: General Speech Restoration with One-step Latent Bridge Models Authors:Chi Zhang, Zehua Chen, Kaiwen Zheng, Jun Zhu View a PDF of the paper titled VoiceBridge: General Speech Restoration with One-step Latent Bridge Models, by Chi Zhang and 3 other authors View PDF HTML (experimental) Abstract:Bridge models have been investigated in speech enhancement but are mostly single-task, with constrained general speech restoration (GSR) capability. In this work, we propose VoiceBridge, a one-step latent bridge model (LBM) for GSR, capable of efficiently reconstructing 48 kHz fullband speech from diverse distortions. To inherit the advantages of data-domain bridge models, we design an energy-preserving variational autoencoder, enhancing the waveform-latent space alignment over varying energy levels. By compressing waveform into continuous latent representations, VoiceBridge models~\textit{various} GSR tasks with a~\textit{single} latent-to-latent generative process backed by a scalable transformer. To alleviate the challenge of reconstructing the high-quality target from distinctively different low-quality priors, we propose a joint neural prior for GSR, uniformly reducing the burden of the LBM in diverse tasks. Building upon these designs, we further investigate bridge training objective by jointly tuning LBM, decoder and discriminator togethe...