通过学习模型有效的离线政策优化

论文标题

通过学习模型有效的离线政策优化

Efficient Offline Policy Optimization with a Learned Model

论文作者

Liu, Zichen, Li, Siyi, Lee, Wee Sun, Yan, Shuicheng, Xu, Zhongwen

论文摘要

Muzero Unplugged提出了一种从记录数据中学习的有前途的策略学习方法。它通过学习模型进行蒙特卡罗树搜索（MCT），并利用重新分析算法以纯粹从离线数据学习。为了良好的性能，MCT需要准确的学习模型和大量的模拟，从而耗费了巨大的计算时间。本文调查了一些假设，其中Muzero未插入在离线RL设置下可能无法正常工作，包括1）学习有限的数据； 2）从随机环境的离线数据中学习； 3）鉴于离线数据的参数化模型不正确； 4）计算预算低。我们建议使用正规的一步外观方法来解决上述问题。我们没有使用昂贵的MCT进行计划，而是使用学习的模型来基于一步推出来构建优势估计。策略改进是指向数据集正规化估计优势的方向。我们使用Bsuite环境进行广泛的经验研究，以验证假设，然后在RL插入的Atari基准测试上运行我们的算法。实验结果表明，即使学习模型不正确，我们提出的方法也可以达到稳定的性能。在大规模的Atari基准测试中，提出的方法的表现优于Muzero，拔掉了43％。最值得注意的是，与Muzero Unplugged（即17.8小时）相比，它仅使用5.6％的壁锁时间（即1小时），可以通过相同的硬件和软件堆栈获得150％的IQM归一化分数。我们的实施是通过https://github.com/sail-sg/rosmo开源的。

MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. Our implementation is open-sourced at https://github.com/sail-sg/rosmo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题