[2406.01914] HPE-CogVLM: Advancing Vision Language Models with a Head

[2406.01914] HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

arXiv - AI March 24, 2026 4 min read

About this article

Abstract page for arXiv paper 2406.01914: HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Computer Science > Computer Vision and Pattern Recognition arXiv:2406.01914 (cs) [Submitted on 4 Jun 2024 (v1), last revised 22 Mar 2026 (this version, v3)] Title:HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task Authors:Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu View a PDF of the paper titled HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task, by Yu Tian and 4 other authors View PDF HTML (experimental) Abstract:Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging appro...

Originally published on March 24, 2026. Curated by AI News.

Llms

[2603.18532] Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

Abstract page for arXiv paper 2603.18532: Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

arXiv - Machine Learning · 4 min · 7 minutes ago

Llms

[2603.12702] FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

Abstract page for arXiv paper 2603.12702: FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

arXiv - Machine Learning · 4 min · 7 minutes ago

Llms

[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

Abstract page for arXiv paper 2603.12681: Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

arXiv - Machine Learning · 3 min · 7 minutes ago

Llms

[2602.06098] A Theoretical Analysis of Test-Driven LLM Code Generation

Abstract page for arXiv paper 2602.06098: A Theoretical Analysis of Test-Driven LLM Code Generation

arXiv - Machine Learning · 3 min · 7 minutes ago

[2406.01914] HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

About this article

Related Articles

[2603.18532] Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

[2603.12702] FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

[2603.12681] Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

[2602.06098] A Theoretical Analysis of Test-Driven LLM Code Generation

No comments

Stay updated with AI News