[2602.15001] Boundary Point Jailbreaking of Black-Box LLMs
Summary
The paper introduces Boundary Point Jailbreaking (BPJ), a novel automated attack method that circumvents advanced safeguards in black-box large language models (LLMs), demonstrating its effectiveness against industry-standard defenses.
Why It Matters
As LLMs become integral to various applications, understanding vulnerabilities like those exposed by BPJ is crucial for enhancing AI safety and robustness. This research highlights the ongoing arms race between attack and defense strategies in AI systems, emphasizing the need for improved security measures.
Key Takeaways
- BPJ is a fully automated attack that uses minimal information to bypass LLM defenses.
- The method introduces a curriculum of intermediate targets to optimize attack effectiveness.
- BPJ successfully targets GPT-5's input classifier without human intervention.
- Defending against BPJ requires batch-level monitoring due to its flagging behavior during optimization.
- The research underscores the necessity for continuous improvements in AI defense mechanisms.
Computer Science > Machine Learning arXiv:2602.15001 (cs) [Submitted on 16 Feb 2026] Title:Boundary Point Jailbreaking of Black-Box LLMs Authors:Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal View a PDF of the paper titled Boundary Point Jailbreaking of Black-Box LLMs, by Xander Davies and 5 other authors View PDF Abstract:Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the fi...