Llms Machine Learning Ai Safety Nlp

[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

This study explores automated detection of jailbreak attempts in clinical training large language models (LLMs) using linguistic feature extraction, enhancing scalability and accuracy over previous manual methods.

Why It Matters

As LLMs are increasingly used in clinical settings, ensuring their safety and reliability is crucial. This research offers a scalable solution for identifying unsafe user behavior, which is vital for maintaining the integrity of clinical dialogue systems and improving patient safety.

Key Takeaways

Automated detection of jailbreak attempts enhances scalability and accuracy.
The study utilizes expert annotations of key linguistic features for analysis.
Multiple predictive models were evaluated, demonstrating strong performance.
Error analysis reveals limitations in current annotation methods.
Future improvements could include richer annotation schemes and evolving risk assessments.

Computer Science > Artificial Intelligence arXiv:2602.13321 (cs) [Submitted on 10 Feb 2026] Title:Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction Authors:Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen View a PDF of the paper titled Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction, by Tri Nguyen and 4 other authors View PDF HTML (experimental) Abstract:Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including tree-based, linear, probabilistic, and ensemble methods, to determine jailbreak likelihood from ...

Read Original Article

[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Summary

Why It Matters

Key Takeaways

Related Articles

[D] How's MLX and jax/ pytorch on MacBooks these days?

[R] 94.42% on BANKING77 Official Test Split with Lightweight Embedding + Example Reranking (strict full-train protocol)

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

I tested Gemini on Android Auto and now I can't stop talking to it: 5 tasks it nails

No comments

Stay updated with AI News