[2601.16296] Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing
About this article
Abstract page for arXiv paper 2601.16296: Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.16296 (cs) [Submitted on 22 Jan 2026 (v1), last revised 23 Mar 2026 (this version, v2)] Title:Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing Authors:Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, Hyeonho Jeong View a PDF of the paper titled Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing, by Dohun Lee and 5 other authors View PDF HTML (experimental) Abstract:Video-to-video diffusion models achieve impressive single-turn editing performance, but practical editing workflows are inherently iterative. When edits are applied sequentially, existing models treat each turn independently, often causing previously generated regions to drift or be overwritten. We identify this failure mode as the problem of cross-turn consistency in multi-turn video editing. We introduce Memory-V2V, a memory-augmented framework that treats prior edits as structured constraints for subsequent generations. Memory-V2V maintains an external memory of previous outputs, retrieves task-relevant edits, and integrates them through relevance-aware tokenization and adaptive compression. These technical ingredients enable scalable conditioning without linear growth in computation. We demonstrate Memory-V2V on iterative video novel view synthesis and text-guided long video editing. Memory-V2V substantially enhances cross-turn consistency w...