论文标题
安全的加固学习腿部运动
Safe Reinforcement Learning for Legged Locomotion
论文作者
论文摘要
由于腿部运动不足和不连续的机器人动力学,设计腿部运动的控制策略很复杂。无模型的强化学习提供了有希望的工具来应对这一挑战。但是,在现实世界中应用无模型增强学习的主要瓶颈是安全。在本文中,我们提出了一个安全的加固学习框架,该框架在防止机器人进入不安全状态的安全恢复策略之间切换,以及一项已进行优化以完成任务的学习者策略。当学习者政策违反安全限制时,安全的恢复政策接管了控制权,而在没有将来的安全违规情况下,将移交控制权。我们设计了安全的恢复政策,以确保在最低限度地介入学习过程的同时确保腿部运动的安全性。此外,我们理论上分析了提出的框架并为任务性能提供了上限。我们在模拟和真实的四足动物机器人上验证了四个运动任务中提出的框架:有效步态,时装秀,两腿平衡和起搏。平均而言,与模拟中的基线方法相比,我们的方法平均减少了48.6%的跌倒和可比或更好的奖励。当将其部署到现实世界四倍的机器人上时,我们的培训管道可提高34%的能源效率,以提高高效步态的能源效率,走秀中的脚部位置窄40.9%,两分腿平衡的跳跃持续时间增加了两倍。在115分钟的硬件时间期间,我们的方法跌落不到五次。
Designing control policies for legged locomotion is complex due to the under-actuated and non-continuous robot dynamics. Model-free reinforcement learning provides promising tools to tackle this challenge. However, a major bottleneck of applying model-free reinforcement learning in real world is safety. In this paper, we propose a safe reinforcement learning framework that switches between a safe recovery policy that prevents the robot from entering unsafe states, and a learner policy that is optimized to complete the task. The safe recovery policy takes over the control when the learner policy violates safety constraints, and hands over the control back when there are no future safety violations. We design the safe recovery policy so that it ensures safety of legged locomotion while minimally intervening in the learning process. Furthermore, we theoretically analyze the proposed framework and provide an upper bound on the task performance. We verify the proposed framework in four locomotion tasks on a simulated and real quadrupedal robot: efficient gait, catwalk, two-leg balance, and pacing. On average, our method achieves 48.6% fewer falls and comparable or better rewards than the baseline methods in simulation. When deployed it on real-world quadruped robot, our training pipeline enables 34% improvement in energy efficiency for the efficient gait, 40.9% narrower of the feet placement in the catwalk, and two times more jumping duration in the two-leg balance. Our method achieves less than five falls over the duration of 115 minutes of hardware time.