[2602.15889] Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance
Summary
This article investigates the temporal variability in the performance of the GPT-4o model, revealing significant daily and weekly patterns that challenge the assumption of time-invariant model performance.
Why It Matters
Understanding the periodic variability in LLM performance is crucial for researchers relying on these models for consistent results. This study highlights potential biases in research findings and emphasizes the need for careful consideration of temporal factors in AI applications.
Key Takeaways
- GPT-4o performance shows significant daily and weekly variability.
- Approximately 20% of performance variance can be attributed to these periodic patterns.
- The findings challenge the assumption of time-invariant performance in LLMs.
- Implications for research validity and replicability are discussed.
- Researchers should account for temporal factors when using LLMs.
Statistics > Applications arXiv:2602.15889 (stat) [Submitted on 6 Feb 2026] Title:Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance Authors:Paul Tschisgale, Peter Wulff View a PDF of the paper titled Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance, by Paul Tschisgale and 1 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly used in research both as tools and as objects of investigation. Much of this work implicitly assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant. If average output quality changes systematically over time, this assumption is violated, threatening the reliability, validity, and reproducibility of findings. To empirically examine this assumption, we conducted a longitudinal study on the temporal variability of GPT-4o's average performance. Using a fixed model snapshot, fixed hyperparameters, and identical prompting, GPT-4o was queried via the API to solve the same multiple-choice physics task every three hours for approximately three months. Ten independent responses were generated at each time point and their scores were averaged. Spectral (Fourier) analysis of the resulting time series revealed notable periodic variability in average model performance, accounting for approximately 20% of the total variance. In particular, the observed periodic patterns are well explained by the interactio...