[2603.14602] PA3: Policy-Aware Agent Alignment through Chain-of-Thought
About this article
Abstract page for arXiv paper 2603.14602: PA3: Policy-Aware Agent Alignment through Chain-of-Thought
Computer Science > Computation and Language arXiv:2603.14602 (cs) [Submitted on 15 Mar 2026 (v1), last revised 21 Mar 2026 (this version, v2)] Title:PA3: Policy-Aware Agent Alignment through Chain-of-Thought Authors:Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao, Chenlei Guo, Ruhi Sarikaya View a PDF of the paper titled PA3: Policy-Aware Agent Alignment through Chain-of-Thought, by Shubhashis Roy Dipta and 6 other authors View PDF Abstract:Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words. Subjects: Computation and Language (cs.CL); A...