[2602.19519] Ada-RS: Adaptive Rejection Sampling for Selective Thinking
Summary
The paper introduces Ada-RS, an adaptive rejection sampling framework aimed at enhancing selective thinking in large language models (LLMs), improving efficiency and accuracy in cost-sensitive applications.
Why It Matters
As LLMs are increasingly used in environments where cost and latency are critical, Ada-RS offers a novel approach to optimize reasoning processes. By filtering outputs effectively, it can significantly reduce resource consumption while maintaining performance, making it relevant for developers and researchers in AI and machine learning.
Key Takeaways
- Ada-RS improves reasoning efficiency in LLMs by reducing token usage by up to 80%.
- The framework uses adaptive length-penalized rewards to filter high-quality outputs.
- It integrates seamlessly with existing optimization strategies like DPO and DAPO.
- Ada-RS can significantly lower the thinking rate while maintaining or enhancing accuracy.
- This approach highlights the importance of training-signal selection for efficient reasoning.
Computer Science > Artificial Intelligence arXiv:2602.19519 (cs) [Submitted on 23 Feb 2026] Title:Ada-RS: Adaptive Rejection Sampling for Selective Thinking Authors:Yirou Ge, Yixi Li, Alec Chiu, Shivani Shekhar, Zijie Pan, Avinash Thangali, Yun-Shiuan Chuang, Chaitanya Kulkarni, Uma Kona, Linsey Pang, Prakhar Mehrotra View a PDF of the paper titled Ada-RS: Adaptive Rejection Sampling for Selective Thinking, by Yirou Ge and 10 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to 80% and reducing thinking rate by up to 95% while maintaining or improvi...