[2510.15620] On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding
Nlp

[2510.15620] On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2510.15620: On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

Computer Science > Machine Learning arXiv:2510.15620 (cs) [Submitted on 17 Oct 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding Authors:Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen View a PDF of the paper titled On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding, by Jiahao Zhou and 4 other authors View PDF HTML (experimental) Abstract:Semantic top-K selection with cross-encoder rerankers underpins on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings progressively stabilize in intermediate layers, enabling early pruning prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, PRISM. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via overlapped layer streaming and chunked execution. We evaluate PRISM against state-of-the-art baselines on rerankers from 0.6 B to 8 B p...

Originally published on March 25, 2026. Curated by AI News.

Related Articles

Machine Learning

[R] First open-source implementation of Hebbian fast-weight write-back for the BDH architecture

The BDH (Dragon Hatchling) paper (arXiv:2509.26507) describes a Hebbian synaptic plasticity mechanism where model weights update during i...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] Could really use some guidance . I'm a 2nd year Data Science UG Student

I'm currently finishing up my second year of a three year Bachelor of Data Science degree. I've got the basics down quite well, linear re...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] Create datasets from TikTok videos

For ML experiments and RAG projects: Tikkocampus converts creator timelines into timestamped, searchable segments and then use it to perf...

Reddit - Machine Learning · 1 min ·
Memory chip giant SK hynix could help end 'RAMmageddon' with blockbuster US IPO | TechCrunch
Nlp

Memory chip giant SK hynix could help end 'RAMmageddon' with blockbuster US IPO | TechCrunch

SK hynix’s potential U.S. listing could raise $10-$14 billion to help it build more capacity, encourage others to follow, and end the 'RA...

TechCrunch - AI · 6 min ·
More in Nlp: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime