基于技能的强化学习与固有的奖励匹配

论文标题

基于技能的强化学习与固有的奖励匹配

Skill-Based Reinforcement Learning with Intrinsic Reward Matching

论文作者

Adeniji, Ademi, Xie, Amber, Abbeel, Pieter

论文摘要

尽管无监督的技能发现在自主获取行为基础方面表现出了希望，但任务不合时宜的技能和下游，任务感知的芬特呼吸之间仍然存在很大的方法论上的脱节。我们提出了固有的奖励匹配（IRM），该匹配通过$ \ textit {Skill Indiminator} $统一了这两个学习阶段，这是一种经过验证的模型组件，经常在Finetuning过程中丢弃。常规方法直接在政策级别上直接审议的代理，通常依靠昂贵的环境推广来确定最佳技能。但是，通常最简洁但最完整的任务描述是奖励函数本身，而技能学习方法通过与技能策略相对应的歧视者学习$ \ textit {intinsic} $奖励函数。我们建议将技能歧视器利用为$ \ textit {match} $固有和下游任务奖励，并确定没有环境样本的看不见任务的最佳技能，因此以更大的样本效应进行了填充。此外，我们将IRM推广到为复杂，长途任务的序列技能序列，并证明IRM使我们能够比以前的技能选择方法更有效地利用了预审慎的技能，并在FECHA厨房机器人机器人手动操纵基准上使用。

While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the $\textit{skill discriminator}$, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills for complex, long-horizon tasks and demonstrate that IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods on both the Fetch tabletop and Franka Kitchen robot manipulation benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题