[2602.15958] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Summary
The paper introduces DocSplit, a benchmark dataset and evaluation framework for document packet recognition and splitting, addressing challenges in document understanding.
Why It Matters
DocSplit fills a critical gap in document processing by providing a structured approach to evaluate and improve the capabilities of large language models in handling complex document packets. This is essential for industries like legal and healthcare where accurate document management is crucial.
Key Takeaways
- DocSplit is the first comprehensive benchmark dataset for document packet splitting.
- It includes five datasets of varying complexity, addressing real-world challenges.
- Novel evaluation metrics are proposed to assess model performance in document recognition.
- The benchmark reveals significant performance gaps in current models.
- The datasets are released to facilitate further research in document processing.
Computer Science > Computation and Language arXiv:2602.15958 (cs) [Submitted on 17 Feb 2026] Title:DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting Authors:Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop III, Niharika Jain, Spencer Romo, Bob Strahan, Boyi Xie, Diego A. Socolinsky View a PDF of the paper titled DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting, by Md Mofijul Islam and 8 other authors View PDF HTML (experimental) Abstract:Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges,...