[2602.22624] Instruction-based Image Editing with Planning, Reasoning, and Generation
Summary
This paper presents a novel approach to instruction-based image editing by integrating planning, reasoning, and generation through a multi-modality model, enhancing editing capabilities for complex images.
Why It Matters
As image editing becomes increasingly reliant on AI, this research addresses the limitations of existing models by introducing a multi-modal approach that improves understanding and generation. This could lead to more sophisticated tools for creators and developers in various fields, including design and content creation.
Key Takeaways
- Introduces a multi-modality model for instruction-based image editing.
- Enhances editing quality by bridging understanding and generation.
- Utilizes Chain-of-Thought planning for better instruction interpretation.
- Implements a hint-guided editing network for improved image generation.
- Demonstrates competitive performance on complex real-world images.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.22624 (cs) [Submitted on 26 Feb 2026] Title:Instruction-based Image Editing with Planning, Reasoning, and Generation Authors:Liya Ji, Chenyang Qi, Qifeng Chen View a PDF of the paper titled Instruction-based Image Editing with Planning, Reasoning, and Generation, by Liya Ji and 2 other authors View PDF HTML (experimental) Abstract:Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language ...