[2504.08714] Generating Fine Details of Entity Interactions
About this article
Abstract page for arXiv paper 2504.08714: Generating Fine Details of Entity Interactions
Computer Science > Computer Vision and Pattern Recognition arXiv:2504.08714 (cs) [Submitted on 11 Apr 2025 (v1), last revised 3 Mar 2026 (this version, v2)] Title:Generating Fine Details of Entity Interactions Authors:Xinyi Gu, Jiayuan Mao View a PDF of the paper titled Generating Fine Details of Entity Interactions, by Xinyi Gu and 1 other authors View PDF HTML (experimental) Abstract:Recent text-to-image models excel at generating high-quality object-centric images from instructions. However, images should also encapsulate rich interactions between objects, where existing models often fall short, likely due to limited training data and benchmarks for rare interactions. This paper explores a novel application of Multimodal Large Language Models (MLLMs) to benchmark and enhance the generation of interaction-rich images. We introduce \data, an interaction-focused dataset with 1000 LLM-generated fine-grained prompts for image generation covering (1) functional and action-based interactions, (2) multi-subject interactions, and (3) compositional spatial relationships. To address interaction-rich generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, \model, leverages LLMs to decompose interactions into finer-grained concepts, uses an MLLM to critique generated images, and applies targeted refinements with a partial diffusion denoising process. Automatic and human evaluations show significantly improved image quality, demonstrating the p...