[2604.00007] Dynin-Omni: Omnimodal Unified Large Diffusion Language Model
About this article
Abstract page for arXiv paper 2604.00007: Dynin-Omni: Omnimodal Unified Large Diffusion Language Model
Computer Science > Computation and Language arXiv:2604.00007 (cs) [Submitted on 9 Mar 2026] Title:Dynin-Omni: Omnimodal Unified Large Diffusion Language Model Authors:Jaeik Kim, Woojin Kim, Jihwan Hong, Yejoon Lee, Sieun Hyeon, Mintaek Lim, Yunseok Han, Dogeun Kim, Hoeun Lee, Hyunggeun Kim, Jaeyoung Do View a PDF of the paper titled Dynin-Omni: Omnimodal Unified Large Diffusion Language Model, by Jaeik Kim and 10 other authors View PDF HTML (experimental) Abstract:We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source uni...