[2602.19348] MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose
Summary
The paper presents MultiDiffSense, a diffusion-based model for generating visuo-tactile images conditioned on object shape and contact poses, improving data collection efficiency for robotic applications.
Why It Matters
As acquiring aligned visuo-tactile datasets is costly and time-consuming, MultiDiffSense addresses this challenge by enabling scalable and controllable synthetic dataset generation, which is crucial for advancing robotics and tactile sensing technologies.
Key Takeaways
- MultiDiffSense synthesizes multi-modal images for tactile sensors using a unified diffusion model.
- The model significantly outperforms traditional methods like Pix2Pix in generating high-quality images.
- Combining synthetic and real data can reduce the need for extensive real data while maintaining performance.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.19348 (cs) [Submitted on 22 Feb 2026] Title:MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose Authors:Sirine Bhouri, Lan Wei, Jian-Qing Zheng, Dandan Zhang View a PDF of the paper titled MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose, by Sirine Bhouri and 3 other authors View PDF HTML (experimental) Abstract:Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alle...