Llms Machine Learning Nlp Computer Vision

[2602.17687] IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

arXiv - Machine Learning February 23, 2026 4 min read Article

Summary

The paper introduces IRPAPERS, a benchmark for evaluating visual document retrieval and question answering, comparing image-based and text-based systems using a dataset of scientific papers.

Why It Matters

As AI systems increasingly handle multimodal data, understanding the effectiveness of visual document processing is crucial. IRPAPERS provides a structured approach to evaluate and improve retrieval methods, which can enhance scientific research efficiency and accuracy.

Key Takeaways

IRPAPERS benchmark includes 3,230 pages from 166 scientific papers for testing retrieval systems.
Image-based retrieval shows comparable performance to text-based methods, highlighting the potential of multimodal approaches.
Hybrid systems combining text and image retrieval outperform unimodal systems, achieving higher recall rates.
The dataset and code are publicly available, promoting further research in visual document processing.
Different question types favor either text or image modalities, indicating the need for tailored retrieval strategies.

Computer Science > Information Retrieval arXiv:2602.17687 (cs) [Submitted on 5 Feb 2026] Title:IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering Authors:Connor Shorten, Augustas Skaburskas, Daniel M. Jones, Charles Pierse, Roberto Esposito, John Trengrove, Etienne Dilocker, Bob van Luijt View a PDF of the paper titled IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering, by Connor Shorten and 7 other authors View PDF HTML (experimental) Abstract:AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, ena...

Read Original Article

[2602.17687] IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

Summary

Why It Matters

Key Takeaways

Related Articles

I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition

Is anyone else concerned with this blatant potential of security / privacy breach?

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

No comments

Stay updated with AI News