[2510.15620] On-device Semantic Selection Made Low Latency and Memory

[2510.15620] On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

arXiv - Machine Learning March 25, 2026 4 min read

About this article

Abstract page for arXiv paper 2510.15620: On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

Computer Science > Machine Learning arXiv:2510.15620 (cs) [Submitted on 17 Oct 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding Authors:Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen View a PDF of the paper titled On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding, by Jiahao Zhou and 4 other authors View PDF HTML (experimental) Abstract:Semantic top-K selection with cross-encoder rerankers underpins on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings progressively stabilize in intermediate layers, enabling early pruning prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, PRISM. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via overlapped layer streaming and chunked execution. We evaluate PRISM against state-of-the-art baselines on rerankers from 0.6 B to 8 B p...

Originally published on March 25, 2026. Curated by AI News.

Machine Learning

[R] First open-source implementation of Hebbian fast-weight write-back for the BDH architecture

The BDH (Dragon Hatchling) paper (arXiv:2509.26507) describes a Hebbian synaptic plasticity mechanism where model weights update during i...

Reddit - Machine Learning · 1 min · about 5 hours ago

Machine Learning

[D] Could really use some guidance . I'm a 2nd year Data Science UG Student

I'm currently finishing up my second year of a three year Bachelor of Data Science degree. I've got the basics down quite well, linear re...

Reddit - Machine Learning · 1 min · 1 day ago

Machine Learning

[P] Create datasets from TikTok videos

For ML experiments and RAG projects: Tikkocampus converts creator timelines into timestamped, searchable segments and then use it to perf...

Reddit - Machine Learning · 1 min · 1 day ago

Nlp

Memory chip giant SK hynix could help end 'RAMmageddon' with blockbuster US IPO | TechCrunch

SK hynix’s potential U.S. listing could raise $10-$14 billion to help it build more capacity, encourage others to follow, and end the 'RA...

TechCrunch - AI · 6 min · 1 day ago

[2510.15620] On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

About this article

Related Articles

[R] First open-source implementation of Hebbian fast-weight write-back for the BDH architecture

[D] Could really use some guidance . I'm a 2nd year Data Science UG Student

[P] Create datasets from TikTok videos

Memory chip giant SK hynix could help end 'RAMmageddon' with blockbuster US IPO | TechCrunch

No comments

Stay updated with AI News