[2501.06336] MEt3R: Measuring Multi-View Consistency in Generated Images
Summary
The paper presents MEt3R, a novel metric for assessing multi-view consistency in generated images, addressing limitations of traditional reconstruction metrics in generative modeling.
Why It Matters
As generative models for multi-view image generation advance, the need for reliable metrics that evaluate image consistency becomes crucial. MEt3R offers a solution to measure quality independently of specific scenes, enhancing the evaluation of generative models in computer vision.
Key Takeaways
- MEt3R is designed to measure multi-view consistency in generated images.
- Traditional metrics are inadequate for evaluating generative model outputs.
- The approach utilizes DUSt3R for dense 3D reconstructions to assess image similarity.
- MEt3R can evaluate various methods for novel view and video generation.
- This metric aids in advancing the evaluation standards in computer vision research.
Computer Science > Computer Vision and Pattern Recognition arXiv:2501.06336 (cs) [Submitted on 10 Jan 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:MEt3R: Measuring Multi-View Consistency in Generated Images Authors:Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, Jan Eric Lenssen View a PDF of the paper titled MEt3R: Measuring Multi-View Consistency in Generated Images, by Mohammad Asim and 4 other authors View PDF HTML (experimental) Abstract:We introduce MEt3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MEt3R, we evaluate the consistency of a large set of previous methods for novel view and video generation, includi...