自然语言限制的安全加强学习

论文标题

自然语言限制的安全加强学习

Safe Reinforcement Learning with Natural Language Constraints

论文作者

Yang, Tsung-Yen, Hu, Michael, Chow, Yinlam, Ramadge, Peter J., Narasimhan, Karthik

论文摘要

尽管安全的加固学习（RL）对许多实际应用（例如机器人技术或自动驾驶汽车）具有巨大的希望，但当前的方法需要以数学形式指定约束。此类规格要求域专业知识，从而限制了安全RL的采用。在本文中，我们建议学习解释安全RL的自然语言约束。为此，我们首先引入了危害世界，这是一种新的多任务基准，它需要代理来优化奖励，同时不违反自由形式文本中指定的约束。然后，我们开发一个具有模块化体系结构的代理，可以在学习新任务时解释并遵守此类文本约束。我们的模型由（1）一个约束解释器组成，该解释器将文本约束编码为禁止状态的空间和时间表示，以及（2）使用这些表示形式来产生策略实现最小约束违规的策略网络。在危险世界中的不同领域中，我们表明我们的方法获得了更高的奖励（最高为11倍），与现有方法相比，我们的方法（最高为11倍）和更少的约束违规行为（比1.8倍）。但是，就绝对绩效而言，HazardWorld仍然对代理商有效地学习的挑战构成了重大挑战，激发了对未来工作的需求。

While safe reinforcement learning (RL) holds great promise for many practical applications like robotics or autonomous cars, current approaches require specifying constraints in mathematical form. Such specifications demand domain expertise, limiting the adoption of safe RL. In this paper, we propose learning to interpret natural language constraints for safe RL. To this end, we first introduce HazardWorld, a new multi-task benchmark that requires an agent to optimize reward while not violating constraints specified in free-form text. We then develop an agent with a modular architecture that can interpret and adhere to such textual constraints while learning new tasks. Our model consists of (1) a constraint interpreter that encodes textual constraints into spatial and temporal representations of forbidden states, and (2) a policy network that uses these representations to produce a policy achieving minimal constraint violations during training. Across different domains in HazardWorld, we show that our method achieves higher rewards (up to11x) and fewer constraint violations (by 1.8x) compared to existing approaches. However, in terms of absolute performance, HazardWorld still poses significant challenges for agents to learn efficiently, motivating the need for future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题