[2602.13104] Random Forests as Statistical Procedures: Design, Variance, and Dependence

arXiv - Machine Learning February 16, 2026 3 min read Article

Summary

This paper presents a design-based perspective on random forests, emphasizing their statistical properties and variance characteristics, thereby enhancing understanding of their predictive mechanisms.

Why It Matters

Understanding random forests as statistical designs rather than mere algorithms provides deeper insights into their predictive capabilities and limitations. This perspective is crucial for researchers and practitioners in machine learning and statistics, as it clarifies how various factors influence model performance and variance.

Key Takeaways

Random forests can be viewed as finite-sample statistical designs.
The paper introduces a variance identity that separates variability and structural dependence.
Increasing the number of trees does not eliminate predictive variability due to inherent design mechanisms.
Key mechanisms include reuse of training observations and data-adaptive partitioning.
Understanding these mechanisms can improve the design and application of random forests.

Statistics > Machine Learning arXiv:2602.13104 (stat) [Submitted on 13 Feb 2026] Title:Random Forests as Statistical Procedures: Design, Variance, and Dependence Authors:Nathaniel S. O'Connell View a PDF of the paper titled Random Forests as Statistical Procedures: Design, Variance, and Dependence, by Nathaniel S. O'Connell View PDF HTML (experimental) Abstract:Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed dataset. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical des...

Read Original Article

[2602.13104] Random Forests as Statistical Procedures: Design, Variance, and Dependence

Summary

Why It Matters

Key Takeaways

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence

Scientists uncover new method to generate protein datasets for training AI

6 Months Using AI for Actual Work: What's Incredible, What's Overhyped, and What's Quietly Dangerous

Top 10 AI certifications and courses for 2026

No comments

Stay updated with AI News