[2602.14941] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
Summary
AnchorWeave introduces a novel framework for video generation that enhances spatial consistency over long durations by utilizing multiple local geometric memories.
Why It Matters
This research addresses a significant challenge in video generation, where maintaining spatial consistency is crucial for realistic outputs. By improving the quality and coherence of generated videos, this work has implications for various applications in computer vision and AI, including virtual reality and autonomous systems.
Key Takeaways
- AnchorWeave replaces misaligned global memories with multiple clean local geometric memories.
- The framework employs a coverage-driven local memory retrieval aligned with the target trajectory.
- Integration of selected local memories is managed through a multi-anchor weaving controller.
- Extensive experiments show significant improvements in long-term scene consistency and visual quality.
- Ablation studies validate the effectiveness of local geometric conditioning and multi-anchor control.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14941 (cs) [Submitted on 16 Feb 2026] Title:AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories Authors:Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal View a PDF of the paper titled AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories, by Zun Wang and 5 other authors View PDF HTML (experimental) Abstract:Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selecte...