强化学习算法中启发式回报函数的设计及其收敛性分析
Data(s) |
2005
|
---|---|
Resumo |
回报函数设计的好与坏对学习系统性能有着重要作用,按回报值在状态-动作空间中的分布情况,将回报函数的构建分为两种形式:密集函数和稀疏函数,分析了密集函数和稀疏函数的特点.提出启发式回报函数的基本设计思路,利用基于保守势函数差分形式的附加回报函数,给学习系统提供更多的启发式信息,并对算法的最优策略不变性和迭代收敛性进行了证明.启发式回报函数能够引导学习,加快学习进程,从而可以实现强化学习在实际大型复杂系统应用中的实时控制和调度. The reward function has become the critical component for its effect of evaluating the action and guiding the reinforcement learning (RL) process. According to the distribution of rewards in the space of states, reward func- tions can have two basic forms, dense and sparse. Their effects act on RL algorithm performance differently. Sparse reward functions are more difficult to learn a value function for than dense ones. The idea of designing a heuristic re- ward function is proposed in this paper. The practice of the heuristic reward function in RL consists of supplying addi- tional rewards to a learning system, beyond those supplied by the underlying Marko Decision Process (MDP). We can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential func- tion applied to those states. The additional reward function F, based on a reward for transitions between states, is a difference of conservative potentials. The additional training reward F will provide more heuristic information and be used to guide the learning system to progress rapidly. The gradient inherent in heuristic reward functions tends to give more leverage when learning the value function. The proof of convergence of Q-value iteration is presented under a more general model of MDP, too. The heuristic reward function helps to implement an efficient reinforcement learn- ing system on a real-time control or scheduling system. 中国科学院先进制造基地创新基金(F010120);;973计划课题(2002CB312200) |
Identificador | |
Idioma(s) |
中文 |
Palavras-Chave | #强化学习 #回报函数 #马尔可夫决策过 #策略 #收效性 |
Tipo |
期刊论文 |