论文标题

DGPO:发现具有多样性指导政策优化的多种策略

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

论文作者

Chen, Wentse, Huang, Shiyu, Chiang, Yuan, Pearce, Tim, Tu, Wei-Wei, Chen, Ting, Zhu, Jun

论文摘要

大多数强化学习算法都寻求解决给定任务的单一最佳策略。但是,学习各种解决方案通常是有价值的,例如,使代理商与用户的互动更具吸引力,或者将政策的鲁棒性提高到意外的扰动。我们提出了多样性引导的政策优化(DGPO),这是一种在政策算法中,发现了解决给定任务的多种策略。与先前的工作不同,它通过在一次运行中训练的共享策略网络来实现这一目标。具体而言,我们根据信息理论多样性目标设计了固有的奖励。我们的最终目标交替限制了策略的多样性和外部奖励。我们通过将其作为概率推理任务施放来解决受约束的优化问题,并使用策略迭代来最大化派生的下限。实验结果表明,我们的方法有效地发现了各种强化学习任务中的各种策略。与基线方法相比,DGPO获得了可比的奖励,同时发现了更多不同的策略,并且通常具有更好的样品效率。

Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源