Llms Machine Learning Data Science Computer Vision Nlp

[2602.15257] How to Train Your Long-Context Visual Document Model

arXiv - AI February 18, 2026 3 min read Article

Summary

This article presents a comprehensive study on training long-context visual document models, achieving state-of-the-art performance in visual question answering and long-context text tasks.

Why It Matters

As the demand for advanced AI models that can handle extensive contextual information grows, this research provides critical insights into effective training methodologies, enhancing the capabilities of visual document processing and question answering systems. The findings can influence future developments in AI, particularly in applications requiring deep understanding of both text and visual data.

Key Takeaways

Training on context lengths matching evaluation contexts yields better performance.
Using page indices during training significantly enhances long-document performance.
Synthetic data pipelines facilitate self-improvement through continued pretraining and supervised finetuning.
The study demonstrates that visual long context training can improve long-context text performance.
A corrected benchmark dataset (MMLBD-C) is released to improve evaluation quality.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15257 (cs) [Submitted on 16 Feb 2026] Title:How to Train Your Long-Context Visual Document Model Authors:Austin Veselka View a PDF of the paper titled How to Train Your Long-Context Visual Document Model, by Austin Veselka View PDF HTML (experimental) Abstract:We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to lo...

Read Original Article

[2602.15257] How to Train Your Long-Context Visual Document Model

Summary

Why It Matters

Key Takeaways

Related Articles

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

Built a training stability monitor that detects instability before your loss curve shows anything — open sourced the core today

This Is Not Hacking. This Is Structured Intelligence.

[D] Howcome Muon is only being used for Transformers?

No comments

Stay updated with AI News