Llms Machine Learning Ai Infrastructure Ai Agents

[2602.16833] VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

arXiv - AI February 20, 2026 4 min read Article

Summary

The paper presents Verbalized Action Masking (VAM), a novel method for enhancing exploration in reinforcement learning (RL) post-training, specifically applied to chess. It demonstrates improved learning efficiency and performance through a structured action sampling process.

Why It Matters

Exploration is a critical challenge in reinforcement learning, especially in large action spaces. VAM offers a practical solution to enhance learning efficiency in RL models, which could have significant implications for AI applications in games and beyond. This research contributes to the ongoing development of more sophisticated RL techniques.

Key Takeaways

VAM verbalizes action masks to improve exploration in RL.
Iterative action-space pruning enhances learning efficiency.
The method shows improved performance in chess-related tasks.

Computer Science > Machine Learning arXiv:2602.16833 (cs) [Submitted on 18 Feb 2026] Title:VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study Authors:Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang View a PDF of the paper titled VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study, by Zhicheng Zhang and 3 other authors View PDF HTML (experimental) Abstract:Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance ove...

Read Original Article

[2602.16833] VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

Summary

Why It Matters

Key Takeaways

Related Articles

Agents that write their own code at runtime and vote on capabilities, no human in the loop

Google Maps can now write captions for your photos using AI | TechCrunch

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Stop Overcomplicating AI Workflows. This Is the Simple Framework

No comments

Stay updated with AI News