[2511.04694] Reasoning Up the Instruction Ladder for Controllable Language Models
Summary
This paper explores the importance of instruction hierarchy in large language models (LLMs) for enhancing their controllability and reliability in decision-making tasks.
Why It Matters
As LLMs are increasingly used in critical applications, ensuring they can prioritize instructions effectively is vital for their safe deployment. This research addresses potential conflicts between user and system instructions, proposing a structured approach to improve model behavior and robustness against adversarial attacks.
Key Takeaways
- Instruction hierarchy (IH) is essential for LLMs to manage conflicting instructions.
- The study introduces VerIH, a dataset for training models on instruction prioritization.
- Lightweight reinforcement learning can enhance reasoning capabilities in LLMs.
- The proposed method shows a 20% improvement in instruction-following tasks.
- The model demonstrates increased robustness against prompt injection attacks.
Computer Science > Computation and Language arXiv:2511.04694 (cs) [Submitted on 30 Oct 2025 (v1), last revised 18 Feb 2026 (this version, v4)] Title:Reasoning Up the Instruction Ladder for Controllable Language Models Authors:Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar View a PDF of the paper titled Reasoning Up the Instruction Ladder for Controllable Language Models, by Zishuo Zheng and 4 other authors View PDF HTML (experimental) Abstract:As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models t...