[2603.23346] RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
About this article
Abstract page for arXiv paper 2603.23346: RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Computer Science > Artificial Intelligence arXiv:2603.23346 (cs) [Submitted on 24 Mar 2026] Title:RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue Authors:Long Mai View a PDF of the paper titled RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue, by Long Mai View PDF HTML (experimental) Abstract:Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR -> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path -- a duplex S2S model -- speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path -- a cascaded ASR -> LLM pipeline -- generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow-path mode...