[2602.03175] Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback
Summary
This article presents a novel approach to multi-objective bandit problems through the Probe-then-Commit (PtC) strategy, demonstrating theoretical benefits of limited multi-arm feedback in resource selection scenarios.
Why It Matters
The findings provide valuable insights into optimizing resource allocation in complex systems, such as mobile edge computing and multi-radio access. By addressing the gap in existing multi-objective learning theories, this research can enhance decision-making processes in real-time applications.
Key Takeaways
- Introduces the Probe-then-Commit (PtC) algorithm for multi-objective bandits.
- Demonstrates a theoretical acceleration of performance through limited probing.
- Quantifies error and regret bounds, enhancing understanding of multi-arm feedback.
- Extends findings to multi-modal probing, integrating various data modalities.
- Addresses a significant gap in multi-objective learning theory.
Computer Science > Machine Learning arXiv:2602.03175 (cs) [Submitted on 3 Feb 2026 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback Authors:Ming Shi View a PDF of the paper titled Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback, by Ming Shi View PDF HTML (experimental) Abstract:We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among $K$ candidate links/servers (arms) whose performance is a stochastic $d$-dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emph{probe-then-commit (PtC)}: the agent may probe up to $q>1$ candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits ($q=1$) and full-information experts ($q=K$), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textsc{PtC-P-UCB}, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the $q$ probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the...