[2407.03888] Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy
Summary
This paper explores continuous-time q-Learning in jump-diffusion models, utilizing Tsallis entropy to derive optimal policies and develop novel algorithms.
Why It Matters
The study advances reinforcement learning by integrating Tsallis entropy, which differs from traditional approaches. This could enhance decision-making in complex systems, particularly in finance and control theory, where jump-diffusion models are prevalent.
Key Takeaways
- Introduces continuous-time q-Learning under Tsallis entropy.
- Establishes a martingale characterization of the q-function.
- Develops two q-learning algorithms based on Lagrange multipliers.
- Demonstrates effective policy characterization in numerical examples.
- Highlights the potential for improved decision-making in jump-diffusion contexts.
Mathematics > Optimization and Control arXiv:2407.03888 (math) [Submitted on 4 Jul 2024 (v1), last revised 13 Feb 2026 (this version, v4)] Title:Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy Authors:Lijun Bo, Yijie Huang, Xiang Yu, Tingting Zhang View a PDF of the paper titled Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy, by Lijun Bo and 3 other authors View PDF HTML (experimental) Abstract:This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization. Contrary to the Shannon entropy, the general form of Tsallis entropy renders the optimal policy not necessarily a Gibbs measure. Herein, the Lagrange multiplier and KKT condition are needed to ensure that the learned policy is a probability density function. As a consequence, the characterization of the optimal policy using the q-function also involves a Lagrange multiplier. In response, we establish the martingale characterization of the q-function and devise two q-learning algorithms depending on whether the Lagrange multiplier can be derived explicitly or not. In the latter case, we consider different parameterizations of the optimal q-function and the optimal policy, and update them alternatively in an Actor-Critic manner. We also study two numerical examples, namely, an optimal liquidation problem in dark pools and a non-LQ co...