[2602.13288] Benchmarking Anomaly Detection Across Heterogeneous Cloud Telemetry Datasets
Summary
This paper evaluates various deep learning models for anomaly detection across multiple cloud telemetry datasets, highlighting the importance of calibration and feature-space geometry.
Why It Matters
Anomaly detection is crucial for maintaining the reliability of cloud systems. This study addresses the challenges of evaluating models across diverse datasets, providing insights that can enhance model performance and deployment in real-world scenarios.
Key Takeaways
- Evaluates four deep learning models and a classical baseline for anomaly detection.
- Highlights the impact of calibration stability and feature-space geometry on model performance.
- Introduces a unified training and evaluation pipeline for consistent analysis across datasets.
- Demonstrates the necessity of testing models on heterogeneous datasets for real-world applicability.
- Provides preprocessing pipelines and evaluation artifacts to support reproducibility.
Computer Science > Networking and Internet Architecture arXiv:2602.13288 (cs) [Submitted on 7 Feb 2026] Title:Benchmarking Anomaly Detection Across Heterogeneous Cloud Telemetry Datasets Authors:Mohammad Saiful Islam, Andriy Miranskyy View a PDF of the paper titled Benchmarking Anomaly Detection Across Heterogeneous Cloud Telemetry Datasets, by Mohammad Saiful Islam and Andriy Miranskyy View PDF HTML (experimental) Abstract:Anomaly detection is important for keeping cloud systems reliable and stable. Deep learning has improved time-series anomaly detection, but most models are evaluated on one dataset at a time. This raises questions about whether these models can handle different types of telemetry, especially in large-scale and high-dimensional environments. In this study, we evaluate four deep learning models, GRU, TCN, Transformer, and TSMixer. We also include Isolation Forest as a classical baseline. The models are tested across four telemetry datasets: the Numenta Anomaly Benchmark, Microsoft Cloud Monitoring dataset, Exathlon dataset, and IBM Console dataset. These datasets differ in structure, dimensionality, and labelling strategy. They include univariate time series, synthetic multivariate workloads, and real-world production telemetry with over 100,000 features. We use a unified training and evaluation pipeline across all datasets. The evaluation includes NAB-style metrics to capture early detection behaviour for datasets where anomalies persist over contiguous ...