[2602.11298] Voxtral Realtime

[2602.11298] Voxtral Realtime

arXiv - AI 5 min read Article

Summary

Voxtral Realtime presents a novel streaming automatic speech recognition model achieving offline transcription quality with sub-second latency, trained end-to-end for optimal audio-text alignment.

Why It Matters

This research is significant as it addresses the growing demand for real-time speech recognition systems that maintain high accuracy. By introducing a model that operates with minimal delay while supporting multiple languages, it opens avenues for applications in various fields, including customer service, accessibility, and real-time communication.

Key Takeaways

  • Voxtral Realtime achieves offline transcription quality with sub-second latency.
  • The model is trained end-to-end, ensuring better alignment of audio and text.
  • It utilizes a new causal audio encoder and Ada RMS-Norm for improved performance.
  • The model supports 13 languages, broadening its applicability.
  • Voxtral Realtime's weights are released under the Apache 2.0 license, promoting open-source collaboration.

Computer Science > Artificial Intelligence arXiv:2602.11298 (cs) [Submitted on 11 Feb 2026 (v1), last revised 21 Feb 2026 (this version, v2)] Title:Voxtral Realtime Authors:Alexander H. Liu, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Sandeep Subramanian, Soham Ghosh, Srijan Mishra, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Avi Sooriyarachchi, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jaeyoung Kim, Jan Ludziejewski, Jason Rute, Joachim Studnia, John Harvill, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, ...

Related Articles

Machine Learning

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video i...

Reddit - Machine Learning · 1 min ·
Machine Learning

FLUX 2 Pro (2026) Sketch to Image

I sketched a cow and tested how different models interpret it into a realistic image for downstream 3D generation, turns out some models ...

Reddit - Artificial Intelligence · 1 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
Machine Learning

[D] TMLR reviews seem more reliable than ICML/NeurIPS/ICLR

This year I submitted a paper to ICML for the first time. I have also experienced the review process at TMLR and ICLR. From my observatio...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime