[2602.23359] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[2602.23359] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

arXiv - AI 4 min read Article

Summary

The paper introduces SeeThrough3D, a model for occlusion-aware 3D control in text-to-image generation, enhancing the realism of synthesized scenes with depth-consistent geometry.

Why It Matters

Occlusion reasoning is crucial for accurately generating 3D scenes, especially when objects overlap. This research addresses a significant gap in existing models, improving the fidelity of generated images and expanding the capabilities of text-to-image synthesis.

Key Takeaways

  • SeeThrough3D models occlusions to enhance 3D scene generation.
  • The occlusion-aware 3D scene representation (OSCR) allows for better depth and scale consistency.
  • Masked self-attention helps bind object descriptions to their corresponding visuals accurately.
  • The model generalizes well to unseen object categories.
  • A synthetic dataset was created to train the model on diverse multi-object scenes.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.23359 (cs) [Submitted on 26 Feb 2026] Title:SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation Authors:Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu View a PDF of the paper titled SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation, by Vaibhav Agrawal and 4 other authors View PDF HTML (experimental) Abstract:We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accu...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Machine Learning

[D] Looking for definition of open-world ish learning problem

Hello! Recently I did a project where I initially had around 30 target classes. But at inference, the model had to be able to handle a lo...

Reddit - Machine Learning · 1 min ·
Mystery Shopping Meets Machine Learning: Can Algorithms Become the Ultimate Customer Experience Auditor?
Machine Learning

Mystery Shopping Meets Machine Learning: Can Algorithms Become the Ultimate Customer Experience Auditor?

Customer expectations across Africa are shifting faster than most organisations can track. A single inconsistent interaction can ignite a...

AI News - General · 8 min ·
Machine Learning

GitHub to Use User Data for AI Training by Default

submitted by /u/i-drake [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime