[2602.16037] Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection
Summary
This paper explores optimization instability in autonomous workflows for clinical symptom detection, revealing critical failure modes and proposing effective interventions.
Why It Matters
Understanding optimization instability in AI systems is crucial for improving clinical applications. This research highlights how certain interventions can stabilize performance in low-prevalence tasks, which is vital for accurate symptom detection in healthcare settings.
Key Takeaways
- Optimization instability can degrade classifier performance in autonomous AI systems.
- Low-prevalence symptoms present unique challenges in clinical detection.
- Retrospective selection of iterations outperforms active intervention for stabilization.
- The study demonstrates significant improvements in symptom detection accuracy with proper oversight.
- Understanding these dynamics is essential for developing reliable AI in healthcare.
Computer Science > Artificial Intelligence arXiv:2602.16037 (cs) [Submitted on 17 Feb 2026] Title:Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection Authors:Cameron Cagan, Pedram Fard, Jiazi Tian, Jingya Cheng, Shawn N. Murphy, Hossein Estiri View a PDF of the paper titled Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection, by Cameron Cagan and 5 other authors View PDF HTML (experimental) Abstract:Autonomous agentic workflows that iteratively refine their own behavior hold considerable promise, yet their failure modes remain poorly characterized. We investigate optimization instability, a phenomenon in which continued autonomous improvement paradoxically degrades classifier performance, using Pythia, an open-source framework for automated prompt optimization. Evaluating three clinical symptoms with varying prevalence (shortness of breath at 23%, chest pain at 12%, and Long COVID brain fog at 3%), we observed that validation sensitivity oscillated between 1.0 and 0.0 across iterations, with severity inversely proportional to class prevalence. At 3% prevalence, the system achieved 95% accuracy while detecting zero positive cases, a failure mode obscured by standard evaluation metrics. We evaluated two interventions: a guiding agent that actively redirected optimization, amplifying overfitting rather than correcting it, and a selector agent that retrospectively identified the best-performing iteration s...