[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

[2602.13321] Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

arXiv - Machine Learning 4 min read Article

Summary

This study explores automated detection of jailbreak attempts in clinical training large language models (LLMs) using linguistic feature extraction, enhancing scalability and accuracy over previous manual methods.

Why It Matters

As LLMs are increasingly used in clinical settings, ensuring their safety and reliability is crucial. This research offers a scalable solution for identifying unsafe user behavior, which is vital for maintaining the integrity of clinical dialogue systems and improving patient safety.

Key Takeaways

  • Automated detection of jailbreak attempts enhances scalability and accuracy.
  • The study utilizes expert annotations of key linguistic features for analysis.
  • Multiple predictive models were evaluated, demonstrating strong performance.
  • Error analysis reveals limitations in current annotation methods.
  • Future improvements could include richer annotation schemes and evolving risk assessments.

Computer Science > Artificial Intelligence arXiv:2602.13321 (cs) [Submitted on 10 Feb 2026] Title:Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction Authors:Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen View a PDF of the paper titled Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction, by Tri Nguyen and 4 other authors View PDF HTML (experimental) Abstract:Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including tree-based, linear, probabilistic, and ensemble methods, to determine jailbreak likelihood from ...

Related Articles

Llms

[D] How's MLX and jax/ pytorch on MacBooks these days?

​ So I'm looking at buying a new 14 inch MacBook pro with m5 pro and 64 gb of memory vs m4 max with same specs. My priorities are pro sof...

Reddit - Machine Learning · 1 min ·
Llms

[R] 94.42% on BANKING77 Official Test Split with Lightweight Embedding + Example Reranking (strict full-train protocol)

BANKING77 (77 fine-grained banking intents) is a well-established but increasingly saturated intent classification benchmark. did this wh...

Reddit - Machine Learning · 1 min ·
The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

As more Americans use AI chatbots like ChatGPT to compose their wedding vows, one expert asks: “Is the speech sacred to you?”

AI Tools & Products · 12 min ·
I tested Gemini on Android Auto and now I can't stop talking to it: 5 tasks it nails
Llms

I tested Gemini on Android Auto and now I can't stop talking to it: 5 tasks it nails

I didn't see much benefit for Google's AI - until now. Here are my favorite ways to use the new Gemini integration in my car.

AI Tools & Products · 7 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime