[2602.22261] Sustainable LLM Inference using Context-Aware Model Switching
Summary
The paper presents a context-aware model switching approach for large language models (LLMs) to enhance energy efficiency during inference, achieving significant reductions in energy consumption while maintaining response quality.
Why It Matters
As AI applications proliferate, their energy consumption poses sustainability challenges. This research introduces a method to optimize LLM inference, potentially reducing environmental impact while improving efficiency, which is crucial for the future of AI deployment.
Key Takeaways
- Context-aware model switching can reduce energy consumption by up to 67.5%.
- The approach maintains a high response quality of 93.6% while improving response time for simple queries by approximately 68%.
- Combines caching, complexity scoring, and machine learning for efficient model selection.
- Demonstrates a scalable solution for sustainable AI without compromising performance.
- Highlights the importance of adaptive systems in AI for energy efficiency.
Computer Science > Machine Learning arXiv:2602.22261 (cs) [Submitted on 25 Feb 2026] Title:Sustainable LLM Inference using Context-Aware Model Switching Authors:Yuvarani, Akashdeep Singh, Zahra Fathanah, Salsabila Harlen, Syeikha Syafura Al-Zahra binti Zahari, Hema Subramaniam View a PDF of the paper titled Sustainable LLM Inference using Context-Aware Model Switching, by Yuvarani and 5 other authors View PDF Abstract:Large language models have become central to many AI applications, but their growing energy consumption raises serious sustainability concerns. A key limitation in current AI deployments is the reliance on a one-size-fits-all inference strategy where most systems route every request to the same large model, regardless of task complexity, leading to substantial and unnecessary energy waste. To address this issue, we propose a context-aware model switching approach that dynamically selects an appropriate language model based on query complexity. The proposed system uses a Context-Aware Model Switching for Energy-Efficient LLM Inference that combines caching for repeated queries, rulebased complexity scoring for fast and explainable decisions, machine learning classification to capture semantic intent, and a user-adaptive component that learns from interaction patterns over time. The proposed architecture was evaluated using real conversation workloads and three open-source language models (Gemma3 1B, Gemma3 4B and Qwen3 4B) with different computational costs, m...