通过建议蒸馏的可教钢筋学习

论文标题

通过建议蒸馏的可教钢筋学习

Teachable Reinforcement Learning via Advice Distillation

论文作者

Watkins, Olivia, Darrell, Trevor, Abbeel, Pieter, Andreas, Jacob, Gupta, Abhishek

论文摘要

培训自动化的代理在交互式环境中完成复杂的任务具有挑战性：强化学习需要仔细手工设计奖励功能，模仿学习需要专门的基础设施和与人类专家的访问，并且从中间形式的监督（例如二元偏好）中学习时间是耗时，并且从每种人类干预中提取了很少的信息。我们可以通过建立从丰富的互动反馈中学习的代理来克服这些挑战吗？我们为基于“可教”的决策系统的互动学习提出了一个新的监督范式，该系统从外部教师提供的结构化建议中学习。我们首先将一类人类的决策制定问题进行正式化，其中有多种形式的教师建议的建议可以为学习者提供。然后，我们为这些问题描述了一种简单的学习算法，这些算法首先学会解释建议，然后在没有人类监督的情况下从建议中学习以完成任务。在解决拼图，导航和运动域中，我们表明，从建议中学习的代理人可以比标准的强化学习算法获得新的技能，而人类的监督则明显少，而通常比模仿学习少。

Training automated agents to complete complex tasks in interactive environments is challenging: reinforcement learning requires careful hand-engineering of reward functions, imitation learning requires specialized infrastructure and access to a human expert, and learning from intermediate forms of supervision (like binary preferences) is time-consuming and extracts little information from each human intervention. Can we overcome these challenges by building agents that learn from rich, interactive feedback instead? We propose a new supervision paradigm for interactive learning based on "teachable" decision-making systems that learn from structured advice provided by an external teacher. We begin by formalizing a class of human-in-the-loop decision making problems in which multiple forms of teacher-provided advice are available to a learner. We then describe a simple learning algorithm for these problems that first learns to interpret advice, then learns from advice to complete tasks even in the absence of human supervision. In puzzle-solving, navigation, and locomotion domains, we show that agents that learn from advice can acquire new skills with significantly less human supervision than standard reinforcement learning algorithms and often less than imitation learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题