[2509.12610] ScaleDoc: Scaling LLM-based Predicates over Large Document Collections
About this article
Abstract page for arXiv paper 2509.12610: ScaleDoc: Scaling LLM-based Predicates over Large Document Collections
Computer Science > Databases arXiv:2509.12610 (cs) [Submitted on 16 Sep 2025 (v1), last revised 3 Mar 2026 (this version, v2)] Title:ScaleDoc: Scaling LLM-based Predicates over Large Document Collections Authors:Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang View a PDF of the paper titled ScaleDoc: Scaling LLM-based Predicates over Large Document Collections, by Hengrui Zhang and 3 other authors View PDF HTML (experimental) Abstract:Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy mo...