[2602.15257] How to Train Your Long-Context Visual Document Model

[2602.15257] How to Train Your Long-Context Visual Document Model

arXiv - AI 3 min read Article

Summary

This article presents a comprehensive study on training long-context visual document models, achieving state-of-the-art performance in visual question answering and long-context text tasks.

Why It Matters

As the demand for advanced AI models that can handle extensive contextual information grows, this research provides critical insights into effective training methodologies, enhancing the capabilities of visual document processing and question answering systems. The findings can influence future developments in AI, particularly in applications requiring deep understanding of both text and visual data.

Key Takeaways

  • Training on context lengths matching evaluation contexts yields better performance.
  • Using page indices during training significantly enhances long-document performance.
  • Synthetic data pipelines facilitate self-improvement through continued pretraining and supervised finetuning.
  • The study demonstrates that visual long context training can improve long-context text performance.
  • A corrected benchmark dataset (MMLBD-C) is released to improve evaluation quality.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15257 (cs) [Submitted on 16 Feb 2026] Title:How to Train Your Long-Context Visual Document Model Authors:Austin Veselka View a PDF of the paper titled How to Train Your Long-Context Visual Document Model, by Austin Veselka View PDF HTML (experimental) Abstract:We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to lo...

Related Articles

Llms

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, ...

Reddit - Machine Learning · 1 min ·
Llms

Built a training stability monitor that detects instability before your loss curve shows anything — open sourced the core today

Been working on a weight divergence trajectory curvature approach to detecting neural network training instability. Treats weight updates...

Reddit - Artificial Intelligence · 1 min ·
Llms

This Is Not Hacking. This Is Structured Intelligence.

Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard c...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] Howcome Muon is only being used for Transformers?

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets tu...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime