[2604.06934] Multi-modal user interface control detection using cross-attention
About this article
Abstract page for arXiv paper 2604.06934: Multi-modal user interface control detection using cross-attention
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.06934 (cs) [Submitted on 8 Apr 2026] Title:Multi-modal user interface control detection using cross-attention Authors:Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari View a PDF of the paper titled Multi-modal user interface control detection using cross-attention, by Milad Moradi and 4 other authors View PDF Abstract:Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguou...