[2508.08275] MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis
Summary
The paper presents MLLM-CTBench, a benchmark for continual instruction tuning of multimodal large language models, addressing the need for rigorous evaluation in adapting these models to real-world demands.
Why It Matters
As multimodal large language models evolve, effective continual instruction tuning is essential for their adaptability. MLLM-CTBench provides a structured evaluation framework that enhances understanding of model performance and resilience against knowledge degradation, which is crucial for advancing AI applications.
Key Takeaways
- MLLM-CTBench introduces a multidimensional evaluation framework for continual instruction tuning.
- Process-level reasoning quality is more resilient to forgetting than final-answer accuracy.
- Stronger baseline models show greater resistance to catastrophic forgetting.
- On-policy reinforcement fine-tuning (GRPO) offers stable cross-task knowledge retention.
- The study expands the scope of continual learning methods beyond supervised fine-tuning.
Computer Science > Computation and Language arXiv:2508.08275 (cs) [Submitted on 31 Jul 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis Authors:Haiyun Guo, Zhiyan Hou, Yandu Sun, Jinghan He, Yu Chen, Yuzhe Zhou, Yuheng Jia, Jinqiao Wang, Tat-Seng Chua View a PDF of the paper titled MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis, by Haiyun Guo and 8 other authors View PDF HTML (experimental) Abstract:Continual instruction tuning(CIT) during the post-training phase is crucial for adapting multimodal large language models (MLLMs) to evolving real-world demands. However, the progress is hampered by the lack of benchmarks with rigorous, protocol-consistent evaluation. To bridge this gap, we introduce MLLM-CTBench, a comprehensive benchmark for CIT of MLLMs, covering seven challenging tasks across six diverse domains. MLLM-CTBench makes three key contributions. First, we establish a multidimensional evaluation framework that jointly assesses final-answer accuracy and process-level reasoning quality, where Chain-of-Thought (CoT) traces serve as an observable signal to diagnose catastrophic forgetting beyond answer-only evaluation. Second, we conduct a large-scale evaluation of continual learning methods by systematically assessing eight representative algorithms from four major families under a unified protocol across task orders, p...