[2602.16833] VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

[2602.16833] VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

arXiv - AI 4 min read Article

Summary

The paper presents Verbalized Action Masking (VAM), a novel method for enhancing exploration in reinforcement learning (RL) post-training, specifically applied to chess. It demonstrates improved learning efficiency and performance through a structured action sampling process.

Why It Matters

Exploration is a critical challenge in reinforcement learning, especially in large action spaces. VAM offers a practical solution to enhance learning efficiency in RL models, which could have significant implications for AI applications in games and beyond. This research contributes to the ongoing development of more sophisticated RL techniques.

Key Takeaways

  • VAM verbalizes action masks to improve exploration in RL.
  • Iterative action-space pruning enhances learning efficiency.
  • The method shows improved performance in chess-related tasks.

Computer Science > Machine Learning arXiv:2602.16833 (cs) [Submitted on 18 Feb 2026] Title:VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study Authors:Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang View a PDF of the paper titled VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study, by Zhicheng Zhang and 3 other authors View PDF HTML (experimental) Abstract:Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance ove...

Related Articles

Llms

Agents that write their own code at runtime and vote on capabilities, no human in the loop

hollowOS just hit v4.4 and I added something that I haven’t seen anyone else do. Previous versions gave you an OS for agents: structured ...

Reddit - Artificial Intelligence · 1 min ·
Google Maps can now write captions for your photos using AI | TechCrunch
Llms

Google Maps can now write captions for your photos using AI | TechCrunch

Gemini can now create captions when users are looking to share a photo or video.

TechCrunch - AI · 4 min ·
Llms

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

submitted by /u/PatienceHistorical70 [link] [comments]

Reddit - Machine Learning · 1 min ·
Llms

Stop Overcomplicating AI Workflows. This Is the Simple Framework

I’ve been working on building an agentic AI workflow system for business use cases and one thing became very clear very quickly. This is ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime