Machine Learning Nlp Computer Vision Ai Agents

[2602.18639] Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

arXiv - Machine Learning February 24, 2026 4 min read Article

Summary

This paper presents a novel approach to improving the robustness of latent predictive world models in machine learning by addressing the sensitivity to irrelevant visual variations during planning tasks.

Why It Matters

The research addresses a critical limitation in current predictive architectures, enhancing their reliability in real-world applications where visual distractions are common. By improving robustness, this work could lead to more effective AI systems in navigation and decision-making tasks.

Key Takeaways

Introduces a bisimulation encoder to enhance predictive model robustness.
Demonstrates improved performance in navigation tasks under varying visual conditions.
Achieves a reduction in latent space size, making models more efficient.
Maintains compatibility with various pretrained visual encoders.
Addresses the challenge of slow features affecting model performance.

Computer Science > Machine Learning arXiv:2602.18639 (cs) [Submitted on 20 Feb 2026] Title:Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models Authors:Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, James Anderson View a PDF of the paper titled Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models, by Leonardo F. Toso and 6 other authors View PDF HTML (experimental) Abstract:World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent spac...

Read Original Article

[2602.18639] Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

Summary

Why It Matters

Key Takeaways

Related Articles

Making an AI native sovereign computational stack

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

What tools are sr MLEs using? (clawdbot, openspec, wispr) [D]

[R] looking for academic collaborators

No comments

Stay updated with AI News