[2510.00602] Multi-Agent Stage-wise Conservative Linear Bandits
Summary
This paper presents MA-SCLUCB, an algorithm for multi-agent linear bandit problems, focusing on balancing exploration and exploitation while ensuring safety in collaborative settings.
Why It Matters
The research addresses the challenges of multi-agent systems in real-world applications, such as recommendation systems, where safety and performance must be balanced. By optimizing collaborative learning under conservative constraints, this work contributes to safer AI implementations in critical domains.
Key Takeaways
- MA-SCLUCB algorithm effectively balances exploration and exploitation in multi-agent settings.
- The proposed method ensures safety by maintaining expected rewards above a baseline.
- Collaboration among agents leads to improved performance despite local communication limitations.
- Communication overhead is manageable in well-connected networks, growing logarithmically.
- Stage-wise safety constraints introduce only lower-order regret, maintaining near-optimal performance.
Computer Science > Machine Learning arXiv:2510.00602 (cs) [Submitted on 1 Oct 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:Multi-Agent Stage-wise Conservative Linear Bandits Authors:Amirhossein Afsharrad, Ahmadreza Moradipari, Sanjay Lall View a PDF of the paper titled Multi-Agent Stage-wise Conservative Linear Bandits, by Amirhossein Afsharrad and 2 other authors View PDF HTML (experimental) Abstract:In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-\alpha)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|\lambda_2|)}}\right)$ w...