[2602.18763] TAG: Thinking with Action Unit Grounding for Facial Expression Recognition
Summary
The paper introduces TAG, a vision-language framework for Facial Expression Recognition (FER) that enhances reasoning by grounding predictions in facial Action Units, improving robustness and reducing hallucinations in outputs.
Why It Matters
This research addresses the limitations of current vision-language models in FER by ensuring that predictions are supported by verifiable visual evidence. By grounding reasoning in facial Action Units, TAG enhances the reliability of FER systems, which is crucial for applications in emotion analysis, human-computer interaction, and AI safety.
Key Takeaways
- TAG improves Facial Expression Recognition by grounding predictions in facial Action Units.
- The model reduces hallucinations and enhances visual faithfulness in outputs.
- It outperforms existing vision-language model baselines on multiple datasets.
- Intermediate reasoning steps are crucial for trustworthy multimodal reasoning.
- The approach demonstrates the importance of structured representations in AI.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18763 (cs) [Submitted on 21 Feb 2026] Title:TAG: Thinking with Action Unit Grounding for Facial Expression Recognition Authors:Haobo Lin, Tianyi Bai, Jiajun Zhang, Xuanhao Chang, Sheng Lu, Fangming Gu, Zengjie Hu, Wentao Zhang View a PDF of the paper titled TAG: Thinking with Action Unit Grounding for Facial Expression Recognition, by Haobo Lin and 7 other authors View PDF Abstract:Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consiste...