[2602.18639] Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

[2602.18639] Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

arXiv - Machine Learning 4 min read Article

Summary

This paper presents a novel approach to improving the robustness of latent predictive world models in machine learning by addressing the sensitivity to irrelevant visual variations during planning tasks.

Why It Matters

The research addresses a critical limitation in current predictive architectures, enhancing their reliability in real-world applications where visual distractions are common. By improving robustness, this work could lead to more effective AI systems in navigation and decision-making tasks.

Key Takeaways

  • Introduces a bisimulation encoder to enhance predictive model robustness.
  • Demonstrates improved performance in navigation tasks under varying visual conditions.
  • Achieves a reduction in latent space size, making models more efficient.
  • Maintains compatibility with various pretrained visual encoders.
  • Addresses the challenge of slow features affecting model performance.

Computer Science > Machine Learning arXiv:2602.18639 (cs) [Submitted on 20 Feb 2026] Title:Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models Authors:Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, James Anderson View a PDF of the paper titled Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models, by Leonardo F. Toso and 6 other authors View PDF HTML (experimental) Abstract:World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent spac...

Related Articles

Machine Learning

Making an AI native sovereign computational stack

I’ve been working on a personal project that ended up becoming a kind of full computing stack: identity / trust protocol decentralized ch...

Reddit - Artificial Intelligence · 1 min ·
Llms

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I published a paper today on something I've been calling postural manipulation. The short version: ordi...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

What tools are sr MLEs using? (clawdbot, openspec, wispr) [D]

I'm already blasting cursor, but I want to level up my output. I heard that these kind of AI tools and workflows are being asked in SF. W...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] looking for academic collaborators

hey there, i am currently working with a research group at auckland university. we are currently working on neurodegenerative diseases - ...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime