[2510.03253] Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents
About this article
Abstract page for arXiv paper 2510.03253: Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents
Computer Science > Machine Learning arXiv:2510.03253 (cs) [Submitted on 26 Sep 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents Authors:Heyang Gao, Zexu Sun, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen View a PDF of the paper titled Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents, by Heyang Gao and 6 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates ...