[2508.21421] Rethinking Layer-wise Model Merging through Chain of Merges
Summary
This article presents a novel approach to merging pretrained models called Chain of Merges (CoM), which addresses the limitations of existing layer-wise merging techniques by considering inter-layer dependencies to improve model performance.
Why It Matters
As the number of specialized models increases, the ability to merge them effectively without retraining is crucial for efficiency in machine learning. This research proposes a method that mitigates distributional mismatches, enhancing the performance of merged models and contributing to the advancement of model optimization techniques.
Key Takeaways
- Current merging techniques overlook inter-layer dependencies, leading to performance issues.
- The proposed Chain of Merges (CoM) method updates activation statistics to address covariate shift.
- CoM achieves state-of-the-art performance on standard benchmarks, showcasing its effectiveness.
- Understanding internal covariate shift is key to improving model merging techniques.
- This research contributes to the broader field of model optimization in machine learning.
Computer Science > Machine Learning arXiv:2508.21421 (cs) [Submitted on 29 Aug 2025 (v1), last revised 25 Feb 2026 (this version, v3)] Title:Rethinking Layer-wise Model Merging through Chain of Merges Authors:Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara View a PDF of the paper titled Rethinking Layer-wise Model Merging through Chain of Merges, by Pietro Buzzega and 2 other authors View PDF HTML (experimental) Abstract:Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized models increases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques operate at the level of individual layers, thereby overlooking the inter-layer dependencies inherent in deep networks. We show that this simplification leads to distributional mismatches, particularly in methods that rely on intermediate activations, as changes in early layers are not properly propagated to downstream layers during merging. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address this, we propose Chain of Merges (CoM), a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics. By explicitl...