[2510.24803] MASPRM: Multi-Agent System Process Reward Model
Summary
The MASPRM paper introduces a novel Multi-Agent System Process Reward Model that enhances performance during inference by guiding search and optimizing computation in multi-agent systems.
Why It Matters
This research addresses the critical need for efficient computation in multi-agent systems, particularly in scenarios requiring high performance. By improving the decision-making process during inference, MASPRM can significantly enhance the effectiveness of AI applications in various domains, making it relevant for developers and researchers in artificial intelligence and multi-agent systems.
Key Takeaways
- MASPRM assigns values to partial inter-agent transcripts, improving decision-making during inference.
- The model is trained using Monte Carlo Tree Search rollouts without requiring human annotations.
- Performance improvements include up to 13.4 points in Hit@1 over policy likelihood.
- The approach focuses computation on promising branches while pruning unpromising ones.
- Benchmarks include GSM8K, MATH, MMLU, and LogiQA, demonstrating versatility across tasks.
Computer Science > Multiagent Systems arXiv:2510.24803 (cs) [Submitted on 28 Oct 2025 (v1), last revised 12 Feb 2026 (this version, v2)] Title:MASPRM: Multi-Agent System Process Reward Model Authors:Milad Yazdani, Mahdi Mostajabdaveh, Zirui Zhou, Ying Xiong View a PDF of the paper titled MASPRM: Multi-Agent System Process Reward Model, by Milad Yazdani and 3 other authors View PDF HTML (experimental) Abstract:Practical deployment of multi-agent systems (MAS) demands strong performance at test time, motivating methods that guide search during inference and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns values to partial inter-agent transcripts for each action and each agent, and acts as a controller during inference. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts labeled only with terminal outcome rewards, without requiring human step-level annotations, by propagating returns to local targets. During inference, MASPRM guides step-level beam search (SBS) and MCTS, focusing computation on promising branches and pruning unpromising ones. We train and test MASPRM across different tasks and domains, using GSM8K, MATH, MMLU, and LogiQA as benchmarks. Averaged across these benchmarks, MASPRM improves Hit@1 over policy likelihood by up to $+13.4$ points and improves ranking quality, reducing Hit@1$->$Hit@5 gaps by up to $10.3$ points. MASPRM complements inference-time search by sc...