[2602.22871] Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Summary
The paper presents a novel framework called Stitching Noisy Diffusion Thoughts, which enhances reasoning in large language models by combining low-cost diffusion-sampled trajectories into a coherent rationale, improving accuracy and reducing latency in problem-solving tasks.
Why It Matters
This research addresses the limitations of existing aggregation strategies in large language models, particularly in reasoning tasks. By improving the way intermediate steps are utilized, it enhances the performance of AI systems in complex problem-solving, making it relevant for advancements in AI applications across various fields.
Key Takeaways
- Introduces a self-consistency framework for reasoning in large language models.
- Improves accuracy by up to 23.8% across math and coding tasks.
- Reduces latency by up to 1.8x compared to traditional models.
- Utilizes a modular approach separating exploration from evaluation.
- Demonstrates effectiveness particularly on harder reasoning problems.
Computer Science > Computation and Language arXiv:2602.22871 (cs) [Submitted on 26 Feb 2026] Title:Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching Authors:Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi View a PDF of the paper titled Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching, by Roy Miles and 5 other authors View PDF HTML (experimental) Abstract:Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across m...