[2509.12610] ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

[2509.12610] ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2509.12610: ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Computer Science > Databases arXiv:2509.12610 (cs) [Submitted on 16 Sep 2025 (v1), last revised 3 Mar 2026 (this version, v2)] Title:ScaleDoc: Scaling LLM-based Predicates over Large Document Collections Authors:Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang View a PDF of the paper titled ScaleDoc: Scaling LLM-based Predicates over Large Document Collections, by Hengrui Zhang and 3 other authors View PDF HTML (experimental) Abstract:Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy mo...

Originally published on March 04, 2026. Curated by AI News.

Related Articles

[2603.26680] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
Llms

[2603.26680] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Abstract page for arXiv paper 2603.26680: AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Align...

arXiv - AI · 4 min ·
[2603.26679] AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics Class with Context-Aware AI
Llms

[2603.26679] AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics Class with Context-Aware AI

Abstract page for arXiv paper 2603.26679: AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics...

arXiv - AI · 4 min ·
[2603.26673] Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies
Llms

[2603.26673] Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies

Abstract page for arXiv paper 2603.26673: Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching St...

arXiv - AI · 4 min ·
[2603.26668] Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter
Llms

[2603.26668] Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter

Abstract page for arXiv paper 2603.26668: Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo ...

arXiv - AI · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime