[2602.17687] IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
Summary
The paper introduces IRPAPERS, a benchmark for evaluating visual document retrieval and question answering, comparing image-based and text-based systems using a dataset of scientific papers.
Why It Matters
As AI systems increasingly handle multimodal data, understanding the effectiveness of visual document processing is crucial. IRPAPERS provides a structured approach to evaluate and improve retrieval methods, which can enhance scientific research efficiency and accuracy.
Key Takeaways
- IRPAPERS benchmark includes 3,230 pages from 166 scientific papers for testing retrieval systems.
- Image-based retrieval shows comparable performance to text-based methods, highlighting the potential of multimodal approaches.
- Hybrid systems combining text and image retrieval outperform unimodal systems, achieving higher recall rates.
- The dataset and code are publicly available, promoting further research in visual document processing.
- Different question types favor either text or image modalities, indicating the need for tailored retrieval strategies.
Computer Science > Information Retrieval arXiv:2602.17687 (cs) [Submitted on 5 Feb 2026] Title:IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering Authors:Connor Shorten, Augustas Skaburskas, Daniel M. Jones, Charles Pierse, Roberto Esposito, John Trengrove, Etienne Dilocker, Bob van Luijt View a PDF of the paper titled IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering, by Connor Shorten and 7 other authors View PDF HTML (experimental) Abstract:AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, ena...