[2602.14143] ROAST: Rollout-based On-distribution Activation Steering Technique
Summary
The ROAST technique enhances the control of large language models by utilizing on-distribution rollouts for more effective activation steering, improving performance across various tasks.
Why It Matters
This research addresses the limitations of existing activation steering methods that rely on off-distribution supervision, which can lead to inconsistent results. By introducing ROAST, the authors provide a more robust framework for optimizing model performance, which is crucial for advancing applications in machine learning and AI.
Key Takeaways
- ROAST improves activation steering by using on-distribution rollouts.
- The technique avoids brittle interventions associated with discrete masking.
- Grouped normalization balances contributions across samples for better performance.
- Empirical results show significant performance improvements on various tasks.
- CSS helps maintain activation energy, enhancing model robustness.
Computer Science > Machine Learning arXiv:2602.14143 (cs) [Submitted on 15 Feb 2026] Title:ROAST: Rollout-based On-distribution Activation Steering Technique Authors:Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang View a PDF of the paper titled ROAST: Rollout-based On-distribution Activation Steering Technique, by Xuanbo Su and 3 other authors View PDF HTML (experimental) Abstract:Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on ...