[2501.08219] Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling
Summary
The paper explores the energy-performance tradeoffs in LLM inference across various workloads and GPU scaling, revealing significant insights into optimizing efficiency.
Why It Matters
As large language models (LLMs) become more prevalent, understanding their energy consumption and performance is crucial for developing sustainable AI technologies. This research provides actionable insights for optimizing LLM inference, which can lead to reduced energy costs and improved performance across diverse applications.
Key Takeaways
- Inference configurations for LLMs often apply uniformly despite workload variability.
- Lightweight semantic features can better predict inference difficulty than input length.
- Reducing GPU frequency can lead to significant energy savings with minimal latency increase.
- The decode phase of LLM inference dominates execution time and is largely insensitive to GPU frequency.
- Combining workload-aware model selection with phase-aware DVFS can enhance energy efficiency.
Computer Science > Machine Learning arXiv:2501.08219 (cs) [Submitted on 14 Jan 2025 (v1), last revised 24 Feb 2026 (this version, v4)] Title:Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling Authors:Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic View a PDF of the paper titled Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling, by Paul Joe Maliakel and 2 other authors View PDF HTML (experimental) Abstract:LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance behavior of LLM inference under GPU dynamic voltage and frequency scaling (DVFS). We evaluate five decoder-only LLMs (1B-32B parameters) across four NLP benchmarks using a controlled offline setup. We show that lightweight semantic features predict inference difficulty better than input length, with 44.5% of queries achieving comparable quality across model sizes. At the hardware level, the decode phase dominates inference time (77-91%) and is largely insensitive to GPU frequency. Consequently, reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase. We further provide a use case with an upper-bound analysis of the potential benefits of combining workload-aware model selection with phas...