TY - GEN
T1 - Reward-Weighted Regression Converges to a Global Optimum
AU - Štrupl, Miroslav
AU - Faccio, Francesco
AU - Ashley, Dylan R.
AU - Srivastava, Rupesh Kumar
AU - Schmidhuber, Juergen
N1 - KAUST Repository Item: Exported on 2022-12-21
Acknowledgements: We would like to thank Sjoerd van Steenkiste and Frantisek ˇ Zˇ ak for their insightful comments. This work was supported ´ by the European Research Council (ERC, Advanced Grant Number 742870), the Swiss National Supercomputing Centre (CSCS, Project s1090), and by the Swiss National Science Foundation (Grant Number 200021 192356, Project NEUSYM). We also thank both the NVIDIA Corporation for donating a DGX-1 as part of the Pioneers of AI Research Award and IBM for donating a Minsky machine.
PY - 2022/2/23
Y1 - 2022/2/23
N2 - Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.
AB - Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.
UR - http://hdl.handle.net/10754/686566
UR - https://arxiv.org/pdf/2107.09088.pdf
U2 - 10.1609/aaai.v36i8.20811
DO - 10.1609/aaai.v36i8.20811
M3 - Conference contribution
SP - 8361
EP - 8369
BT - Proceedings of the AAAI Conference on Artificial Intelligence
PB - arXiv
ER -