[2509.12610] ScaleDoc: Scaling LLM-based Predicates over Large

[2509.12610] ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

arXiv - Machine Learning March 04, 2026 4 min read

About this article

Abstract page for arXiv paper 2509.12610: ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Computer Science > Databases arXiv:2509.12610 (cs) [Submitted on 16 Sep 2025 (v1), last revised 3 Mar 2026 (this version, v2)] Title:ScaleDoc: Scaling LLM-based Predicates over Large Document Collections Authors:Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang View a PDF of the paper titled ScaleDoc: Scaling LLM-based Predicates over Large Document Collections, by Hengrui Zhang and 3 other authors View PDF HTML (experimental) Abstract:Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy mo...

Originally published on March 04, 2026. Curated by AI News.

Llms

[2603.26680] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Abstract page for arXiv paper 2603.26680: AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Align...

arXiv - AI · 4 min · less than a minute ago

Llms

[2603.26679] AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics Class with Context-Aware AI

Abstract page for arXiv paper 2603.26679: AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics...

arXiv - AI · 4 min · less than a minute ago

Llms

[2603.26673] Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies

Abstract page for arXiv paper 2603.26673: Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching St...

arXiv - AI · 4 min · less than a minute ago

Llms

[2603.26668] Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter

Abstract page for arXiv paper 2603.26668: Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo ...

arXiv - AI · 3 min · less than a minute ago

[2509.12610] ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

About this article

Related Articles

[2603.26680] AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

[2603.26679] AI Meets Mathematics Education: A Case Study on Supporting an Instructor in a Large Mathematics Class with Context-Aware AI

[2603.26673] Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies

[2603.26668] Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter

No comments

Stay updated with AI News