[2602.22426] SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read
Summary
The paper introduces SimpleOCR, a method to enhance Multimodal Large Language Models (MLLMs) by rendering visualized questions, addressing their visual grounding capabilities.
Why It Matters
As MLLMs evolve, understanding their ability to read text in images is crucial. This research highlights a performance gap in existing models and proposes a novel training strategy that could significantly improve their visual text extraction capabilities, impacting various applications in AI and computer vision.
Key Takeaways
- Introduces Visualized-Question (VQ) setting to evaluate MLLMs' reading capabilities.
- Identifies a performance degradation of up to 12.7% in MLLMs when using VQ.
- Proposes SimpleOCR, a training strategy that enhances visual text extraction without architectural changes.
- Demonstrates significant performance improvements on OOD benchmarks with fewer training samples.
- Highlights the compatibility of SimpleOCR with advanced reinforcement learning strategies.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.22426 (cs) [Submitted on 25 Feb 2026] Title:SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read Authors:Yibo Peng, Peng Xia, Ding Zhong, Kaide Zeng, Siwei Han, Yiyang Zhou, Jiaqi Liu, Ruiyi Zhang, Huaxiu Yao View a PDF of the paper titled SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read, by Yibo Peng and 8 other authors View PDF HTML (experimental) Abstract:Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimiz...