[2510.21356] Gaze-VLM:Bridging Gaze and VLMs through Attention

[2510.21356] Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

arXiv - AI March 25, 2026 4 min read

About this article

Abstract page for arXiv paper 2510.21356: Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Computer Science > Computer Vision and Pattern Recognition arXiv:2510.21356 (cs) [Submitted on 24 Oct 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding Authors:Anupam Pani, Yanchao Yang View a PDF of the paper titled Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding, by Anupam Pani and 1 other authors View PDF HTML (experimental) Abstract:Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided tr...

Originally published on March 25, 2026. Curated by AI News.

Machine Learning

[D] ML researcher looking to switch to a product company.

Hey, I am an AI researcher currently working in a deep tech company as a data scientist. Prior to this, I was doing my PhD. My current ro...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

Building behavioural response models of public figures using Brain scan data (Predict their next move using psychological modelling) [P]

Hey guys, I’m the same creator of Netryx V2, the geolocation tool. I’ve been working on something new called COGNEX. It learns how a pers...

Reddit - Machine Learning · 1 min · about 2 hours ago

Machine Learning

[P] bitnet-edge: Ternary-weight CNNs ({-1,0,+1}) on MNIST and CIFAR-10, deployed to ESP32-S3 with zero multiplications

I built a pipeline that takes ternary-quantized CNNs from PyTorch training all the way to bare-metal inference on an ESP32-S3 microcontro...

Reddit - Machine Learning · 1 min · about 2 hours ago

Machine Learning

[D] What surprised us while collecting training data from the public web been pulling training data from public web

been pulling training data from public web sources for a bit now. needed it to scale, not return complete garbage, and not immediately bl...

Reddit - Machine Learning · 1 min · about 2 hours ago

[2510.21356] Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

About this article

Related Articles

[D] ML researcher looking to switch to a product company.

Building behavioural response models of public figures using Brain scan data (Predict their next move using psychological modelling) [P]

[P] bitnet-edge: Ternary-weight CNNs ({-1,0,+1}) on MNIST and CIFAR-10, deployed to ESP32-S3 with zero multiplications

[D] What surprised us while collecting training data from the public web been pulling training data from public web

No comments

Stay updated with AI News