通过非参数非政策政策梯度进行批处理增强学习

论文标题

通过非参数非政策政策梯度进行批处理增强学习

Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

论文作者

Tosatto, Samuele, Carvalho, João, Peters, Jan

论文摘要

非政策加固学习（RL）具有更好的数据效率的希望，因为它可以重新使用样本并有可能与环境相互作用。当前的政策政策梯度方法要么具有很高的偏见或较高的差异，因此经常提供不可靠的估计。在现实世界中，诸如互动驱动的机器人学习之类的现实情况下，效率低下的价格变得很明显，在互动驱动的机器人学习中，RL的成功是相当有限的，并且样本成本很高，而成本很高。在本文中，我们提出了一个非参数钟声方程，可以以封闭形式解决。该解决方案是可区分的W.R.T策略参数，并可以访问策略梯度的估计。通过这种方式，我们避免了重要性抽样方法的较高差异，以及半差异方法的高偏差。我们通过经验分析针对最新方法的梯度估计质量，并表明它在经典控制任务上的样本效率方面表现优于基准。

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题