[2509.22211] LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Summary
LogiPart introduces a scalable framework for data exploration using local large language models, enhancing the efficiency of taxonomic discovery in text corpora.
Why It Matters
This research addresses the limitations of traditional topic models by providing a method that allows for efficient, hypothesis-driven exploration of large datasets. By leveraging local LLMs, LogiPart makes advanced data analysis accessible on consumer-grade hardware, which is crucial for researchers and practitioners in AI and data science.
Key Takeaways
- LogiPart decouples hierarchy growth from expensive LLM conditioning, improving efficiency.
- It achieves constant generative token complexity, making it scalable for large datasets.
- The framework demonstrates high accuracy in taxonomic bisections, reaching up to 96% routing accuracy.
- LogiPart enables exploratory analysis on consumer-grade hardware, broadening access for researchers.
- Qualitative audits confirm the framework's ability to uncover meaningful functional axes in data.
Computer Science > Computation and Language arXiv:2509.22211 (cs) [Submitted on 26 Sep 2025 (v1), last revised 17 Feb 2026 (this version, v3)] Title:LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning Authors:Tiago Fernandes Tavares View a PDF of the paper titled LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning, by Tiago Fernandes Tavares View PDF HTML (experimental) Abstract:The discovery of deep, steerable taxonomies in large text corpora is currently restricted by a trade-off between the surface-level efficiency of topic models and the prohibitive, non-scalable assignment costs of LLM-integrated frameworks. We introduce \textbf{LogiPart}, a scalable, hypothesis-first framework for building interpretable hierarchical partitions that decouples hierarchy growth from expensive full-corpus LLM conditioning. LogiPart utilizes locally hosted LLMs on compact, embedding-aware samples to generate concise natural-language taxonomic predicates. These predicates are then evaluated efficiently across the entire corpus using zero-shot Natural Language Inference (NLI) combined with fast graph-based label propagation, achieving constant $O(1)$ generative token complexity per node relative to corpus size. We evaluate LogiPart across four diverse text corpora (totaling $\approx$140,000 documents). Using structured manifolds for \textbf{calibration}, we identify an empirical reasoning threshold at the 14...