[2505.14381] SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation
Summary
The paper presents SCAN, a novel approach for Semantic Document Layout Analysis that enhances Retrieval-Augmented Generation (RAG) systems, improving performance on visually rich documents.
Why It Matters
As Large Language Models and Vision-Language Models become integral in document processing, SCAN addresses the challenges of analyzing complex documents, offering significant performance improvements. This advancement is crucial for applications in AI-driven document retrieval and processing.
Key Takeaways
- SCAN improves both textual and visual RAG performance significantly.
- The model utilizes a coarse-grained semantic approach for efficient document analysis.
- Experimental results show performance gains of up to 10.4 points over conventional methods.
- Fine-tuning on annotated datasets enhances the model's accuracy.
- The approach is beneficial for applications involving rich document content.
Computer Science > Artificial Intelligence arXiv:2505.14381 (cs) [Submitted on 20 May 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation Authors:Nobuhiro Ueda, Yuyang Dong, Krisztián Boros, Daiki Ito, Takuya Sera, Masafumi Oyamada View a PDF of the paper titled SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation, by Nobuhiro Ueda and 5 other authors View PDF HTML (experimental) Abstract:With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs yields better RAG performance, but processing rich documents remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (SemantiC Document Layout ANalysis), a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systems that work with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. We trained the SCAN model by fine-tuning ob...