[2602.18882] SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

[2602.18882] SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

arXiv - Machine Learning 3 min read Article

Summary

SceneTok introduces a novel tokenizer that compresses 3D scene representations into a set of diffusable tokens, achieving superior compression and rendering capabilities compared to existing methods.

Why It Matters

This research is significant as it addresses the limitations of current 3D scene representation techniques, offering a more efficient and high-quality alternative. The ability to render scenes from novel trajectories enhances the flexibility and application of 3D models in various fields, including gaming and virtual reality.

Key Takeaways

  • SceneTok encodes 3D scenes into a compressed set of permutation-invariant tokens.
  • It achieves 1-3 orders of magnitude stronger compression than existing representations.
  • The method allows rendering from novel trajectories, enhancing flexibility.
  • The lightweight decoder manages uncertainty effectively.
  • Scene generation is significantly faster, taking only 5 seconds.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18882 (cs) [Submitted on 21 Feb 2026] Title:SceneTok: A Compressed, Diffusable Token Space for 3D Scenes Authors:Mohammad Asim, Christopher Wewer, Jan Eric Lenssen View a PDF of the paper titled SceneTok: A Compressed, Diffusable Token Space for 3D Scenes, by Mohammad Asim and 2 other authors View PDF Abstract:We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous...

Related Articles

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap
Computer Vision

[2602.09678] Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

Abstract page for arXiv paper 2602.09678: Administrative Law's Fourth Settlement: AI and the Capability-Accountability Trap

arXiv - AI · 4 min ·
[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Llms

[2601.13622] CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Abstract page for arXiv paper 2601.13622: CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language...

arXiv - AI · 3 min ·
[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones
Computer Vision

[2603.26551] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Abstract page for arXiv paper 2603.26551: Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

arXiv - AI · 4 min ·
[2603.26292] findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
Llms

[2603.26292] findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Abstract page for arXiv paper 2603.26292: findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

arXiv - AI · 3 min ·
More in Computer Vision: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime