[2602.21225] Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal
Summary
This paper explores architecture-agnostic curriculum learning for document understanding, demonstrating efficiency gains in training time across different models.
Why It Matters
The findings provide insights into how curriculum learning can optimize training processes for document understanding models, potentially leading to more efficient AI systems. This is particularly relevant as the demand for effective document processing continues to grow in various applications.
Key Takeaways
- Progressive data scheduling can reduce training time by approximately 33%.
- Curriculum learning shows significant benefits for capacity-constrained models like BERT.
- No performance gains were observed for LayoutLMv3, indicating model capacity influences curriculum effectiveness.
- The study highlights the importance of task complexity in determining curriculum benefits.
- Findings suggest that curriculum learning can be a reliable strategy for compute reduction across different model families.
Computer Science > Computation and Language arXiv:2602.21225 (cs) [Submitted on 2 Feb 2026] Title:Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal Authors:Mohammed Hamdan, Vincenzo Dentamaro, Giuseppe Pirlo, Mohamed Cheriet View a PDF of the paper titled Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal, by Mohammed Hamdan and 2 other authors View PDF HTML (experimental) Abstract:We investigate whether progressive data scheduling -- a curriculum learning strategy that incrementally increases training data exposure (33\%$\rightarrow$67\%$\rightarrow$100\%) -- yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33\%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($\Delta$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast,...