[2602.23370] Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
About this article
Abstract page for arXiv paper 2602.23370: Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
Computer Science > Computation and Language arXiv:2602.23370 (cs) [Submitted on 23 Dec 2025] Title:Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents Authors:Kaifeng Wu, Junyan Wu, Qiang Liu, Jiarui Zhang, Wen Xu View a PDF of the paper titled Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents, by Kaifeng Wu and 4 other authors View PDF HTML (experimental) Abstract:Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model document-level semantics; generative large language models can output paragraph boundaries, but inference is expensive and long inputs are difficult to support. To address these issues, we propose a discriminative segmentation model based on Qwen3-0.6B. On top of the backbone network, we add a cross-window context fusion layer and a boundary classification head, and combine them with an overlapping sliding-window strategy. Our model supports single-pass inputs of up to 13k tokens and can be extended to ultra-long documents for paragraph boundary detection. To further enhance downstream retrieval efficiency, we derive a vector fusion method with scalar correction, which compresses the representation of ultra-long segments into a single vector without semantic loss. Experiment...