[2510.11834] Don't Walk the Line: Boundary Guidance for Filtered Generation
Summary
The paper presents Boundary Guidance, a reinforcement learning method designed to improve the safety and utility of generative models by steering outputs away from harmful classifier margins, thus reducing false positives and negatives.
Why It Matters
As generative models become more prevalent, ensuring their outputs are safe and useful is critical. This research addresses the limitations of current filtering methods, providing a novel approach that enhances both safety and performance, which is essential for responsible AI deployment.
Key Takeaways
- Boundary Guidance improves generative model outputs by avoiding classifier margins.
- The method reduces false positives and negatives in filtered generation.
- Robustness is demonstrated across various model scales and reward designs.
- The approach enhances both safety and utility of generative outputs.
- Evaluations using LLM-as-a-Judge provide strong support for the method's effectiveness.
Computer Science > Machine Learning arXiv:2510.11834 (cs) [Submitted on 13 Oct 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Don't Walk the Line: Boundary Guidance for Filtered Generation Authors:Sarah Ball, Andreas Haupt View a PDF of the paper titled Don't Walk the Line: Boundary Guidance for Filtered Generation, by Sarah Ball and Andreas Haupt View PDF Abstract:Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach. Comments: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2510.11834 [cs.LG] (or arXiv:2510.11834v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.11834 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Andreas Haupt [view email] [...