[2410.11855] Online GPU Energy Optimization with Switching-Aware Bandits

[2410.11855] Online GPU Energy Optimization with Switching-Aware Bandits

arXiv - Machine Learning 4 min read Article

Summary

This paper presents EnergyUCB, a novel online GPU energy optimization method using a multi-armed bandit approach to balance performance and energy savings in high-performance computing systems.

Why It Matters

As GPUs increasingly dominate power consumption in high-performance computing, optimizing their energy efficiency is crucial. This research addresses the limitations of existing techniques by proposing a real-time solution that balances energy savings with performance, which is vital for sustainable computing practices.

Key Takeaways

  • EnergyUCB optimizes GPU energy consumption in real-time using a multi-armed bandit framework.
  • The method balances exploration and exploitation to enhance learning efficiency.
  • It incorporates a switching-aware index to minimize performance degradation during frequency adjustments.
  • Experimental results demonstrate significant energy savings with acceptable performance trade-offs.
  • The QoS-constrained variant ensures compliance with user-defined performance budgets.

Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2410.11855 (cs) [Submitted on 3 Oct 2024 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Online GPU Energy Optimization with Switching-Aware Bandits Authors:Xiongxiao Xu, Solomon Abera Bekele, Brice Videau, Kai Shu View a PDF of the paper titled Online GPU Energy Optimization with Switching-Aware Bandits, by Xiongxiao Xu and 3 other authors View PDF HTML (experimental) Abstract:Energy consumption has become a bottleneck for future computing architectures, from wearable devices to leadership-class supercomputers. Existing energy management techniques largely target CPUs, even though GPUs now dominate power draw in heterogeneous high performance computing (HPC) systems. Moreover, many prior methods rely on either purely offline or hybrid offline and online training, which is impractical and results in energy inefficiencies during data collection. In this paper, we introduce a practical online GPU energy optimization problem in a HPC scenarios. The problem is challenging because (1) GPU frequency scaling exhibits performance-energy trade-offs, (2) online control must balance exploration and exploitation, and (3) frequent frequency switching incurs non-trivial overhead and degrades quality of service (QoS). To address the challenges, we formulate online GPU energy optimization as a multi-armed bandit problem and propose EnergyUCB, a lightweight UCB-based controller that dynamically adjusts GPU core ...

Related Articles

Machine Learning

What happens when intelligent systems move beyond simple utility?

Right now people are experiencing shallow depth, token limits and diluted intelligence from frontier models. I'm inviting people to exper...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

AGI is the wrong term, how do we define progress?

If a term can mean anything from "passed a Turing test" to "achieved consciousness", it's not a spectrum - it's a category error. Current...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

Post Rebuttal ICML Average Scores? [D]

I have an average of 3.5. One of the reviewer gave us a 2 by bringing up a new issue he hadn't mentioned in his initial review, taking th...

Reddit - Machine Learning · 1 min ·
Machine Learning

Is "live AI video generation" a meaningful technical category or just a marketing term? [R]

Asking from a technical standpoint because I feel like the term is doing a lot of work in coverage of this space right now. Genuine real-...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime