论文标题
拓扑经验重播
Topological Experience Replay
论文作者
论文摘要
最先进的深度Q学习方法使用经验重播缓冲区取样的状态过渡元组更新Q值。这种策略通常会根据时间差异(TD)误差等度量统一和随机示例或确定数据采样的优先级。这样的抽样策略在学习Q功能方面效率可能不佳,因为州的Q值取决于后继状态的Q值。如果数据采样策略忽略了下一个状态的Q值估计值的精度,则可能导致Q值的无用且通常不正确的更新。为了减轻此问题,我们将代理商的经验组织成一个图表,该图表明确跟踪了状态Q值之间的依赖关系。图中的每个边缘通过执行单个动作来表示两个状态之间的过渡。我们通过从终端状态集并依次向后移动的图中扩展图中的广度优先搜索来执行价值备份。我们从经验上表明,我们的方法比在各种目标任务范围内的几个基线要比几个基线要高得多。值得注意的是,所提出的方法还胜过消耗更多培训经验的基准,并从高维观察数据(例如图像)中运行。
State-of-the-art deep Q-learning methods update Q-values using state transition tuples sampled from the experience replay buffer. This strategy often uniformly and randomly samples or prioritizes data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies can be inefficient at learning Q-function because a state's Q-value depends on the Q-value of successor states. If the data sampling strategy ignores the precision of the Q-value estimate of the next state, it can lead to useless and often incorrect updates to the Q-values. To mitigate this issue, we organize the agent's experience into a graph that explicitly tracks the dependency between Q-values of states. Each edge in the graph represents a transition between two states by executing a single action. We perform value backups via a breadth-first search starting from that expands vertices in the graph starting from the set of terminal states and successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of goal-reaching tasks. Notably, the proposed method also outperforms baselines that consume more batches of training experience and operates from high-dimensional observational data such as images.