[2406.01914] HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

[2406.01914] HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2406.01914: HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Computer Science > Computer Vision and Pattern Recognition arXiv:2406.01914 (cs) [Submitted on 4 Jun 2024 (v1), last revised 22 Mar 2026 (this version, v3)] Title:HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task Authors:Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu View a PDF of the paper titled HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task, by Yu Tian and 4 other authors View PDF HTML (experimental) Abstract:Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging appro...

Originally published on March 24, 2026. Curated by AI News.

Related Articles

[2603.18532] Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds
Llms

[2603.18532] Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

Abstract page for arXiv paper 2603.18532: Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

arXiv - Machine Learning · 4 min ·
[2603.12702] FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning
Llms

[2603.12702] FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

Abstract page for arXiv paper 2603.12702: FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

arXiv - Machine Learning · 4 min ·
[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment
Llms

[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

Abstract page for arXiv paper 2603.12681: Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

arXiv - Machine Learning · 3 min ·
[2602.06098] A Theoretical Analysis of Test-Driven LLM Code Generation
Llms

[2602.06098] A Theoretical Analysis of Test-Driven LLM Code Generation

Abstract page for arXiv paper 2602.06098: A Theoretical Analysis of Test-Driven LLM Code Generation

arXiv - Machine Learning · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime