REGAL : a regularization based algorithm for reinforcement learning in weakly communicating MDPs


Autoria(s): Bartlett, Peter L.; Tewari, Ambuj
Data(s)

2009

Resumo

We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~ O(HS p AT ). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

Identificador

http://eprints.qut.edu.au/45708/

Relação

http://www.cs.mcgill.ca/~uai2009/index.html

Bartlett, Peter L. & Tewari, Ambuj (2009) REGAL : a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009)), McGill University, Montreal.

Direitos

Copyright 2009 [please consult the authors]

Fonte

Faculty of Science and Technology; Mathematical Sciences

Palavras-Chave #080100 ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING #algorithm, optimal regret rate, Markov Decision Process (MDP)
Tipo

Conference Paper