DGPO：发现具有多样性指导政策优化的多种策略

论文标题

DGPO：发现具有多样性指导政策优化的多种策略

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

论文作者

Chen, Wentse, Huang, Shiyu, Chiang, Yuan, Pearce, Tim, Tu, Wei-Wei, Chen, Ting, Zhu, Jun

论文摘要

大多数强化学习算法都寻求解决给定任务的单一最佳策略。但是，学习各种解决方案通常是有价值的，例如，使代理商与用户的互动更具吸引力，或者将政策的鲁棒性提高到意外的扰动。我们提出了多样性引导的政策优化（DGPO），这是一种在政策算法中，发现了解决给定任务的多种策略。与先前的工作不同，它通过在一次运行中训练的共享策略网络来实现这一目标。具体而言，我们根据信息理论多样性目标设计了固有的奖励。我们的最终目标交替限制了策略的多样性和外部奖励。我们通过将其作为概率推理任务施放来解决受约束的优化问题，并使用策略迭代来最大化派生的下限。实验结果表明，我们的方法有效地发现了各种强化学习任务中的各种策略。与基线方法相比，DGPO获得了可比的奖励，同时发现了更多不同的策略，并且通常具有更好的样品效率。

Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题