后观察经验重播Kronecker产品近似曲率

论文标题

后观察经验重播Kronecker产品近似曲率

Hindsight Experience Replay with Kronecker Product Approximate Curvature

论文作者

M, Dhuruva Priyan G, Singla, Abhik, Bhatnagar, Shalabh

论文摘要

事后观察经验重播（她）是解决与稀疏奖励环境有关的增强学习任务的有效算法之一。但是由于其样本效率降低和收敛速度降低，因此她无法有效地执行。自然梯度通过更好地收敛模型参数来解决这些挑战。它避免采取不良行动，使训练表现崩溃。但是，在神经网络中更新参数需要昂贵的计算，从而增加训练时间。我们提出的方法以提高样本效率和更快的收敛速度解决了上述挑战，并提高了成功率。 DDPG的常见故障模式是，学到的Q功能开始显着高估Q值，这会导致策略破坏，因为它利用了Q功能中的错误。我们通过将双胞胎延迟的深层确定性政策梯度（TD3）包括在她身上来解决这个问题。 TD3学习了两个Q功能，而不是一个Q功能，它增加了目标动作的噪声，以使策略更难利用Q功能错误。实验是在Openais Mujoco环境的帮助下完成的。在这些环境中的结果表明，我们的算法（TDHER+KFAC）在场景中表现更好

Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks related to sparse rewarded environments.But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. It avoids taking bad actions that collapse the training performance. However updating parameters in neural networks requires expensive computation and thus increase in training time. Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, because it exploits the errors in the Q-function. We solve this issue by including Twin Delayed Deep Deterministic Policy Gradients(TD3) in HER. TD3 learns two Q-functions instead of one and it adds noise tothe target action, to make it harder for the policy to exploit Q-function errors. The experiments are done with the help of OpenAis Mujoco environments. Results on these environments show that our algorithm (TDHER+KFAC) performs better inmost of the scenarios

下载PDF全文

下载文献需遵守相关版权规定

论文标题