离线学习的安全评估：我们准备好部署了吗？

论文标题

离线学习的安全评估：我们准备好部署了吗？

Safe Evaluation For Offline Learning: Are We Ready To Deploy?

论文作者

Radi, Hager, Hanna, Josiah P., Stone, Peter, Taylor, Matthew E.

论文摘要

当前，世界在多个领域中提供了大量的数据，我们可以从中学习强化学习（RL）政策，而无需与环境进行进一步互动。 RL代理可以从此类数据中脱机学习，但是在安全性至关重要的域中进行学习时部署它们可能很危险。因此，必须找到一种方法来估计如果在实际部署目标环境之前部署在目标环境中，并且没有高估其真实绩效的风险，则必须如何在目标环境中部署。为了实现这一目标，我们使用近似高信心的非货币评估（HCOPE）介绍了一个安全评估离线学习的框架，以估算学习过程中离线策略的性能。在我们的环境中，我们假设一个数据源，我们将其分为火车集，以学习离线策略和测试集，以使用Boottrapping使用非政策评估来估算离线策略的较低限制。一个较低的估计告诉我们，在将目标政策部署在实际环境中之前，新学习的目标政策将如何执行，因此使我们能够决定何时部署我们的学识渊博的政策。

The world currently offers an abundance of data in multiple domains, from which we can learn reinforcement learning (RL) policies without further interaction with the environment. RL agents learning offline from such data is possible but deploying them while learning might be dangerous in domains where safety is critical. Therefore, it is essential to find a way to estimate how a newly-learned agent will perform if deployed in the target environment before actually deploying it and without the risk of overestimating its true performance. To achieve this, we introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation (HCOPE) to estimate the performance of offline policies during learning. In our setting, we assume a source of data, which we split into a train-set, to learn an offline policy, and a test-set, to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrapping. A lower-bound estimate tells us how good a newly-learned target policy would perform before it is deployed in the real environment, and therefore allows us to decide when to deploy our learned policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题