与不匹配的生成模型的强大马尔可夫决策过程的政策学习

论文标题

与不匹配的生成模型的强大马尔可夫决策过程的政策学习

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

论文作者

Li, Jialian, Ren, Tongzheng, Yan, Dong, Su, Hang, Zhu, Jun

论文摘要

在医疗治疗和自动锻炼等高风险情况下，收集在线实验数据以训练代理商是有风险的，甚至是不可行的。基于模拟的培训可以减轻此问题，但可能会遭受其固有的模拟器和真实环境的不匹配。因此，必须利用模拟器来学习对现实世界部署的强大策略。在这项工作中，我们考虑了强大的马尔可夫决策过程（RMDP）的政策学习，在该过程中，代理商试图就对环境的意外扰动寻求强大的政策。具体而言，我们专注于可以将训练环境表征为生成模型的设置，并且在测试过程中可以将受约束的扰动添加到模型中。我们的目标是确定对扰动测试环境的近乎最佳的强大政策，这引入了其他技术困难，因为我们需要同时估算样本中的培训环境不确定性，并找到测试最坏的扰动。为了解决此问题，我们提出了一种通用方法，该方法将扰动形式化为对手获得两人零和游戏，并进一步表明NASH平衡与强大的策略相对应。我们证明，使用生成模型的多项式样本数量，我们的算法可以找到近乎最佳的鲁棒策略，具有很高的概率。我们的方法能够在某些温和的假设下处理一般的扰动，并且还可以扩展到更复杂的问题，例如可观察到的马尔可夫决策过程，这要归功于游戏理论的表述。

In high-stake scenarios like medical treatment and auto-piloting, it's risky or even infeasible to collect online experimental data to train the agent. Simulation-based training can alleviate this issue, but may suffer from its inherent mismatches from the simulator and real environment. It is therefore imperative to utilize the simulator to learn a robust policy for the real-world deployment. In this work, we consider policy learning for Robust Markov Decision Processes (RMDP), where the agent tries to seek a robust policy with respect to unexpected perturbations on the environments. Specifically, we focus on the setting where the training environment can be characterized as a generative model and a constrained perturbation can be added to the model during testing. Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties as we need to simultaneously estimate the training environment uncertainty from samples and find the worst-case perturbation for testing. To solve this issue, we propose a generic method which formalizes the perturbation as an opponent to obtain a two-player zero-sum game, and further show that the Nash Equilibrium corresponds to the robust policy. We prove that, with a polynomial number of samples from the generative model, our algorithm can find a near-optimal robust policy with a high probability. Our method is able to deal with general perturbations under some mild assumptions and can also be extended to more complex problems like robust partial observable Markov decision process, thanks to the game-theoretical formulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题