[2602.15257] How to Train Your Long-Context Visual Document Model
Summary
This article presents a comprehensive study on training long-context visual document models, achieving state-of-the-art performance in visual question answering and long-context text tasks.
Why It Matters
As the demand for advanced AI models that can handle extensive contextual information grows, this research provides critical insights into effective training methodologies, enhancing the capabilities of visual document processing and question answering systems. The findings can influence future developments in AI, particularly in applications requiring deep understanding of both text and visual data.
Key Takeaways
- Training on context lengths matching evaluation contexts yields better performance.
- Using page indices during training significantly enhances long-document performance.
- Synthetic data pipelines facilitate self-improvement through continued pretraining and supervised finetuning.
- The study demonstrates that visual long context training can improve long-context text performance.
- A corrected benchmark dataset (MMLBD-C) is released to improve evaluation quality.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15257 (cs) [Submitted on 16 Feb 2026] Title:How to Train Your Long-Context Visual Document Model Authors:Austin Veselka View a PDF of the paper titled How to Train Your Long-Context Visual Document Model, by Austin Veselka View PDF HTML (experimental) Abstract:We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to lo...