[2602.14078] Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning
Summary
This paper presents a novel approach, Adaptive Entropy Annealing (aEPG), to enhance continual fine-tuning of large pretrained vision models by minimizing misclassification error through reinforcement learning techniques.
Why It Matters
The research addresses the critical issue of catastrophic forgetting in machine learning models when adapting to new tasks. By proposing a new training strategy that improves performance across various benchmarks, this work contributes to the ongoing development of more robust AI systems capable of continual learning.
Key Takeaways
- Introduces aEPG, a method that transitions from exploratory to exploitative learning.
- Demonstrates that lower entropy in output predictions enhances model adaptation.
- Outperforms traditional cross-entropy loss methods across diverse benchmarks.
- Revisits classification as a one-step Markov Decision Process for better performance.
- Highlights the importance of prioritizing high-confidence samples in training.
Computer Science > Machine Learning arXiv:2602.14078 (cs) [Submitted on 15 Feb 2026] Title:Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning Authors:Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet View a PDF of the paper titled Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning, by Yaqian Zhang and 3 other authors View PDF HTML (experimental) Abstract:Despite their success, large pretrained vision models remain vulnerable to catastrophic forgetting when adapted to new tasks in class-incremental settings. Parameter-efficient fine-tuning (PEFT) alleviates this by restricting trainable parameters, yet most approaches still rely on cross-entropy (CE) loss, a surrogate for the 0-1 loss, to learn from new data. We revisit this choice and revive the true objective (0-1 loss) through a reinforcement learning perspective. By formulating classification as a one-step Markov Decision Process, we derive an Expected Policy Gradient (EPG) method that directly minimizes misclassification error with a low-variance gradient estimation. Our analysis shows that CE can be interpreted as EPG with an additional sample-weighting mechanism: CE encourages exploration by emphasizing low-confidence samples, while EPG prioritizes high-confidence ones. Building on this insight, we propose adaptive entropy annealing (aEPG), a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning. aEPG-based methods out...