[2602.06130] Self-Improving World Modelling with Latent Actions

[2602.06130] Self-Improving World Modelling with Latent Actions

arXiv - AI 4 min read Article

Summary

The paper presents SWIRL, a framework for self-improving world modeling in machine learning, focusing on latent actions to enhance predictive accuracy without costly action-labeled data.

Why It Matters

This research is significant as it addresses the challenge of learning effective world models in AI systems, particularly for large language models (LLMs) and visual language models (VLMs). By leveraging latent actions, the framework reduces the dependency on labeled data, potentially accelerating advancements in AI reasoning and planning capabilities.

Key Takeaways

  • SWIRL utilizes latent actions to improve world modeling efficiency.
  • The framework alternates between Forward World Modelling and Inverse Dynamics Modelling.
  • It achieves significant performance improvements across various benchmarks.
  • The approach reduces reliance on costly labeled data for training.
  • Theoretical guarantees for learnability enhance the framework's credibility.

Computer Science > Machine Learning arXiv:2602.06130 (cs) [Submitted on 5 Feb 2026 (v1), last revised 15 Feb 2026 (this version, v2)] Title:Self-Improving World Modelling with Latent Actions Authors:Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti View a PDF of the paper titled Self-Improving World Modelling with Latent Actions, by Yifu Qiu and 6 other authors View PDF HTML (experimental) Abstract:Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_\theta(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_\phi(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnabi...

Related Articles

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
Llms

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...

Reddit - Artificial Intelligence · 1 min ·
Llms

Agents that write their own code at runtime and vote on capabilities, no human in the loop

hollowOS just hit v4.4 and I added something that I haven’t seen anyone else do. Previous versions gave you an OS for agents: structured ...

Reddit - Artificial Intelligence · 1 min ·
Google Maps can now write captions for your photos using AI | TechCrunch
Llms

Google Maps can now write captions for your photos using AI | TechCrunch

Gemini can now create captions when users are looking to share a photo or video.

TechCrunch - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime