[2602.15878] IT-OSE: Exploring Optimal Sample Size for Industrial Data Augmentation
Summary
The paper presents IT-OSE, a method for estimating the optimal sample size for data augmentation in industrial settings, improving model performance significantly.
Why It Matters
Understanding optimal sample size in data augmentation is crucial for enhancing model accuracy and efficiency in industrial applications. This research addresses a gap in existing methodologies, providing a theoretical framework and practical solutions that can lead to better resource management and performance in machine learning tasks.
Key Takeaways
- IT-OSE improves accuracy in classification tasks by an average of 4.38%.
- Reduces mean absolute percentage error (MAPE) in regression tasks by an average of 18.80%.
- Achieves optimal sample size estimation while significantly lowering computational and data costs.
- Introduces an interval coverage and deviation (ICD) score for evaluating OSS intuitively.
- Demonstrates generality across various sensor-based industrial scenarios.
Computer Science > Machine Learning arXiv:2602.15878 (cs) [Submitted on 3 Feb 2026] Title:IT-OSE: Exploring Optimal Sample Size for Industrial Data Augmentation Authors:Mingchun Sun, Rongqiang Zhao, Zhennan Huang, Songyu Ding, Jie Liu View a PDF of the paper titled IT-OSE: Exploring Optimal Sample Size for Industrial Data Augmentation, by Mingchun Sun and 4 other authors View PDF Abstract:In industrial scenarios, data augmentation is an effective approach to improve model performance. However, its benefits are not unidirectionally beneficial. There is no theoretical research or established estimation for the optimal sample size (OSS) in augmentation, nor is there an established metric to evaluate the accuracy of OSS or its deviation from the ground truth. To address these issues, we propose an information-theoretic optimal sample size estimation (IT-OSE) to provide reliable OSS estimation for industrial data augmentation. An interval coverage and deviation (ICD) score is proposed to evaluate the estimated OSS intuitively. The relationship between OSS and dominant factors is theoretically analyzed and formulated, thereby enhancing the interpretability. Experiments show that, compared to empirical estimation, the IT-OSE increases accuracy in classification tasks across baseline models by an average of 4.38%, and reduces MAPE in regression tasks across baseline models by an average of 18.80%. The improvements in downstream model performance are more stable. ICDdev in the ICD ...