[2602.21233] AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
Summary
AngelSlim introduces a versatile toolkit for large model compression, integrating advanced algorithms for efficient deployment and improved performance in AI applications.
Why It Matters
As AI models grow in size and complexity, efficient model compression becomes crucial for practical deployment. AngelSlim addresses this need by providing a comprehensive toolkit that enhances performance while maintaining output accuracy, making it relevant for researchers and developers in machine learning and AI.
Key Takeaways
- AngelSlim consolidates various model compression techniques into a unified toolkit.
- It achieves significant throughput gains without sacrificing output correctness.
- The toolkit supports multimodal architectures and modern inference engines.
- Innovative pruning strategies optimize performance for vision and audio tokens.
- AngelSlim is designed for both algorithm-focused research and practical deployment.
Computer Science > Machine Learning arXiv:2602.21233 (cs) [Submitted on 7 Feb 2026] Title:AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression Authors:Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu, Xin Luo, Lin Niu, Yifan Tan, Decheng Wu, Linchuan Xie, Rubing Yang, Guanghua Yu, Jianchen Zhu View a PDF of the paper titled AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression, by Rui Cen and 12 other authors View PDF HTML (experimental) Abstract:This technical report introduces AngelSlim, a comprehensive and versatile toolkit for large model compression developed by the Tencent Hunyuan team. By consolidating cutting-edge algorithms, including quantization, speculative decoding, token pruning, and distillation. AngelSlim provides a unified pipeline that streamlines the transition from model compression to industrial-scale deployment. To facilitate efficient acceleration, we integrate state-of-the-art FP8 and INT8 Post-Training Quantization (PTQ) algorithms alongside pioneering research in ultra-low-bit regimes, featuring HY-1.8B-int2 as the first industrially viable 2-bit large model. Beyond quantization, we propose a training-aligned speculative decoding framework compatible with multimodal architectures and modern inference engines, achieving 1.8x to 2.0x throughput gains without compromising output correctness. Furthermore, we develop a training-free sparse attention framework that ...