论文标题
时间差异不确定性作为探索的信号
Temporal Difference Uncertainties as a Signal for Exploration
论文作者
论文摘要
强化学习中探索的有效方法是依靠代理商对最佳政策的不确定性,该政策可以在表格环境中产生近乎最佳的探索策略。但是,在涉及函数近似器的非壮大设置中,获得准确的不确定性估计几乎是一个挑战性问题。在本文中,我们强调了价值估计很容易偏向且时间不一致。鉴于此,我们提出了一种新的方法,用于估计依赖于时间差异误差的分布的价值函数的不确定性。该探索信号控制了状态行动过渡,以隔离由于对代理参数的不确定性而导致的价值不确定性。由于我们对国家行动转变的不确定性条件的度量,我们不能直接采取此措施。取而代之的是,我们将其纳入了一种内在的奖励,并将探索视为一个单独的学习问题,这是由代理人的时间差异不确定性引起的。我们介绍了一种独特的探索政策,该政策学会收集具有高估计不确定性的数据,这导致了一项课程,该课程在整个学习过程中平稳变化并消失在完美价值估计的极限上。我们评估了有关艰苦探索任务的方法,包括深海和Atari 2600环境,发现我们提出的勘探形式促进了多样化和深入探索。
An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.