[2603.28554] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
About this article
Abstract page for arXiv paper 2603.28554: Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.28554 (cs) [Submitted on 30 Mar 2026] Title:Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model Authors:Athos Georgiou View a PDF of the paper titled Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model, by Athos Georgiou View PDF HTML (experimental) Abstract:Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate sc...