[2510.15620] On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding
About this article
Abstract page for arXiv paper 2510.15620: On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding
Computer Science > Machine Learning arXiv:2510.15620 (cs) [Submitted on 17 Oct 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding Authors:Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen View a PDF of the paper titled On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding, by Jiahao Zhou and 4 other authors View PDF HTML (experimental) Abstract:Semantic top-K selection with cross-encoder rerankers underpins on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings progressively stabilize in intermediate layers, enabling early pruning prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, PRISM. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via overlapped layer streaming and chunked execution. We evaluate PRISM against state-of-the-art baselines on rerankers from 0.6 B to 8 B p...