对称（乐观的）自然策略梯度，用于与参数融合的多项式学习

论文标题

对称（乐观的）自然策略梯度，用于与参数融合的多项式学习

Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

论文作者

Pattathil, Sarath, Zhang, Kaiqing, Ozdaglar, Asuman

论文摘要

在强化学习的背景下，多代理相互作用越来越重要，政策梯度方法的理论基础吸引了研究的兴趣。我们研究了多项式学习中自然政策梯度（NPG）算法的全球融合。我们首先表明，Vanilla NPG可能没有参数收敛，即，即使成本正规化，也可以参数化策略的向量的收敛（这使文献中的策略空间中可以保证强大的融合）。这种不连贯的参数导致学习中的稳定性问题，这在函数近似设置中尤其重要，在该设置中，我们只能在低维参数上操作，而不是高维策略。然后，我们提出了NPG算法的变体，对于几种标准的多项式学习方案：两人零和马尔可夫游戏，以及具有全球最后一际参数收敛保证的多游戏单调游戏。我们还将结果推广到某些函数近似设置。请注意，在我们的算法中，代理扮演对称角色。我们的结果也可能引起独立的兴趣，以解决某些结构的非Convex-Nonconcave minimax优化问题。还提供了模拟来证实我们的理论发现。

Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题