[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction
Summary
This study explores automated detection of jailbreak attempts in clinical training large language models (LLMs) using linguistic feature extraction, enhancing scalability and accuracy over previous manual methods.
Why It Matters
As LLMs are increasingly used in clinical settings, ensuring their safety and reliability is crucial. This research offers a scalable solution for identifying unsafe user behavior, which is vital for maintaining the integrity of clinical dialogue systems and improving patient safety.
Key Takeaways
- Automated detection of jailbreak attempts enhances scalability and accuracy.
- The study utilizes expert annotations of key linguistic features for analysis.
- Multiple predictive models were evaluated, demonstrating strong performance.
- Error analysis reveals limitations in current annotation methods.
- Future improvements could include richer annotation schemes and evolving risk assessments.
Computer Science > Artificial Intelligence arXiv:2602.13321 (cs) [Submitted on 10 Feb 2026] Title:Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction Authors:Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen View a PDF of the paper titled Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction, by Tri Nguyen and 4 other authors View PDF HTML (experimental) Abstract:Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including tree-based, linear, probabilistic, and ensemble methods, to determine jailbreak likelihood from ...