[2509.26238] Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
Summary
This paper presents Truncated Polynomial Classifiers (TPCs) for dynamic safety monitoring in large language models, enhancing efficiency and interpretability in assessing model outputs.
Why It Matters
As language models become more integrated into applications, ensuring their safety is crucial. This research introduces a flexible monitoring approach that adapts resource use based on input complexity, potentially improving safety without excessive computational costs.
Key Takeaways
- TPCs allow for dynamic monitoring of language models, adjusting resource use based on input difficulty.
- The method provides interpretable safety assessments compared to traditional black-box models.
- Early stopping in evaluations can reduce costs while maintaining safety standards.
- TPCs can serve as both a safety dial and an adaptive cascade for efficient monitoring.
- Performance on large-scale datasets shows TPCs can outperform existing monitoring methods.
Computer Science > Machine Learning arXiv:2509.26238 (cs) [Submitted on 30 Sep 2025 (v1), last revised 26 Feb 2026 (this version, v3)] Title:Beyond Linear Probes: Dynamic Safety Monitoring for Language Models Authors:James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez View a PDF of the paper titled Beyond Linear Probes: Dynamic Safety Monitoring for Language Models, by James Oldfield and 4 other authors View PDF HTML (experimental) Abstract:Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive casca...