[2603.21237] ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
About this article
Abstract page for arXiv paper 2603.21237: ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
Computer Science > Artificial Intelligence arXiv:2603.21237 (cs) [Submitted on 22 Mar 2026] Title:ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models Authors:Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, Jie Liu View a PDF of the paper titled ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models, by Haoyu Qiao and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representat...