[2602.10551] C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning
Summary
The paper presents C^2ROPE, an advanced positional encoding method for 3D Large Multimodal Models, addressing limitations of existing Rotary Position Embedding in visual processing.
Why It Matters
C^2ROPE enhances the integration of visual features with language models, improving spatial continuity and causal relationships in multimodal processing. This advancement is crucial for applications in 3D scene reasoning and visual question answering, areas that are increasingly relevant in AI development.
Key Takeaways
- C^2ROPE improves upon traditional Rotary Position Embedding by addressing spatial locality loss.
- The method integrates temporal and spatial positional information for enhanced visual processing.
- Chebyshev Causal Masking is introduced to better model causal dependencies in 2D space.
- Evaluation shows C^2ROPE's effectiveness across various benchmarks, indicating its potential in real-world applications.
- The code for C^2ROPE will be made available for further research and development.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.10551 (cs) [Submitted on 11 Feb 2026 (v1), last revised 16 Feb 2026 (this version, v2)] Title:C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning Authors:Guanting Ye, Qiyan Zhao, Wenhao Yu, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Ka-Veng Yuen View a PDF of the paper titled C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning, by Guanting Ye and 6 other authors View PDF HTML (experimental) Abstract:Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional e...