政策优化的不变性和奖励学习中的部分可识别性

论文标题

政策优化的不变性和奖励学习中的部分可识别性

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

论文作者

Skalse, Joar, Farrugia-Roberts, Matthew, Russell, Stuart, Abate, Alessandro, Gleave, Adam

论文摘要

为复杂的现实世界任务手动设计奖励功能通常非常具有挑战性。为了解决这个问题，可以使用奖励学习从数据中推断出奖励功能。但是，即使在无限数据限制中，通常也有多个奖励功能同样适合数据。这意味着奖励函数仅是部分可识别的。在这项工作中，我们正式表征了奖励功能的部分可识别性，鉴于几个流行的奖励学习数据源，包括专家演示和轨迹比较。我们还分析了这种部分可识别性对几个下游任务（例如策略优化）的影响。我们将结果统一在一个框架中，以通过其不断向数据源和下游任务进行比较，对奖励学习的数据源设计和选择的影响。

It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally characterise the partial identifiability of the reward function given several popular reward learning data sources, including expert demonstrations and trajectory comparisons. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题