[2604.06934] Multi-modal user interface control detection using

[2604.06934] Multi-modal user interface control detection using cross-attention

arXiv - AI April 09, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.06934: Multi-modal user interface control detection using cross-attention

Computer Science > Computer Vision and Pattern Recognition arXiv:2604.06934 (cs) [Submitted on 8 Apr 2026] Title:Multi-modal user interface control detection using cross-attention Authors:Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari View a PDF of the paper titled Multi-modal user interface control detection using cross-attention, by Milad Moradi and 4 other authors View PDF Abstract:Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguou...

Originally published on April 09, 2026. Curated by AI News.

Llms

What's your "When Language Model AI can do X, I'll be impressed"?

I have two at the top of my mind: When it can read musical notes. I will be mildly impressed when I can paste in a picture of musical not...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

Google's latest upgrade for Gemini will allow the chatbot to generate interactive 3D models and simulations in response to your questions...

The Verge - AI · 4 min · about 7 hours ago

Llms

Moody’s Integrates AI Agents With Anthropic’s Claude

AI Tools & Products · 4 min · about 7 hours ago

Llms

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

AI Tools & Products · 6 min · about 7 hours ago

[2604.06934] Multi-modal user interface control detection using cross-attention

About this article

Related Articles

What's your "When Language Model AI can do X, I'll be impressed"?

Google’s Gemini AI can answer your questions with 3D models and simulations

Moody’s Integrates AI Agents With Anthropic’s Claude

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

No comments

Stay updated with AI News