[2410.11855] Online GPU Energy Optimization with Switching-Aware Bandits
Summary
This paper presents EnergyUCB, a novel online GPU energy optimization method using a multi-armed bandit approach to balance performance and energy savings in high-performance computing systems.
Why It Matters
As GPUs increasingly dominate power consumption in high-performance computing, optimizing their energy efficiency is crucial. This research addresses the limitations of existing techniques by proposing a real-time solution that balances energy savings with performance, which is vital for sustainable computing practices.
Key Takeaways
- EnergyUCB optimizes GPU energy consumption in real-time using a multi-armed bandit framework.
- The method balances exploration and exploitation to enhance learning efficiency.
- It incorporates a switching-aware index to minimize performance degradation during frequency adjustments.
- Experimental results demonstrate significant energy savings with acceptable performance trade-offs.
- The QoS-constrained variant ensures compliance with user-defined performance budgets.
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2410.11855 (cs) [Submitted on 3 Oct 2024 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Online GPU Energy Optimization with Switching-Aware Bandits Authors:Xiongxiao Xu, Solomon Abera Bekele, Brice Videau, Kai Shu View a PDF of the paper titled Online GPU Energy Optimization with Switching-Aware Bandits, by Xiongxiao Xu and 3 other authors View PDF HTML (experimental) Abstract:Energy consumption has become a bottleneck for future computing architectures, from wearable devices to leadership-class supercomputers. Existing energy management techniques largely target CPUs, even though GPUs now dominate power draw in heterogeneous high performance computing (HPC) systems. Moreover, many prior methods rely on either purely offline or hybrid offline and online training, which is impractical and results in energy inefficiencies during data collection. In this paper, we introduce a practical online GPU energy optimization problem in a HPC scenarios. The problem is challenging because (1) GPU frequency scaling exhibits performance-energy trade-offs, (2) online control must balance exploration and exploitation, and (3) frequent frequency switching incurs non-trivial overhead and degrades quality of service (QoS). To address the challenges, we formulate online GPU energy optimization as a multi-armed bandit problem and propose EnergyUCB, a lightweight UCB-based controller that dynamically adjusts GPU core ...