[2406.01914] HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
About this article
Abstract page for arXiv paper 2406.01914: HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
Computer Science > Computer Vision and Pattern Recognition arXiv:2406.01914 (cs) [Submitted on 4 Jun 2024 (v1), last revised 22 Mar 2026 (this version, v3)] Title:HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task Authors:Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu View a PDF of the paper titled HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task, by Yu Tian and 4 other authors View PDF HTML (experimental) Abstract:Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging appro...