[2602.23359] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Summary
The paper introduces SeeThrough3D, a model for occlusion-aware 3D control in text-to-image generation, enhancing the realism of synthesized scenes with depth-consistent geometry.
Why It Matters
Occlusion reasoning is crucial for accurately generating 3D scenes, especially when objects overlap. This research addresses a significant gap in existing models, improving the fidelity of generated images and expanding the capabilities of text-to-image synthesis.
Key Takeaways
- SeeThrough3D models occlusions to enhance 3D scene generation.
- The occlusion-aware 3D scene representation (OSCR) allows for better depth and scale consistency.
- Masked self-attention helps bind object descriptions to their corresponding visuals accurately.
- The model generalizes well to unseen object categories.
- A synthetic dataset was created to train the model on diverse multi-object scenes.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.23359 (cs) [Submitted on 26 Feb 2026] Title:SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation Authors:Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu View a PDF of the paper titled SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation, by Vaibhav Agrawal and 4 other authors View PDF HTML (experimental) Abstract:We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accu...