通过最大化相互信息指标，用于团队到团队多车队追捕的对手感知的强化学习方法

论文标题

通过最大化相互信息指标，用于团队到团队多车队追捕的对手感知的强化学习方法

An Opponent-Aware Reinforcement Learning Method for Team-to-Team Multi-Vehicle Pursuit via Maximizing Mutual Information Indicator

论文作者

Wang, Qinwen, Li, Xinhang, Yuan, Zheng, Yang, Yiying, Xu, Chen, Zhang, Lin

论文摘要

当警车合作追捕可疑车辆时，智能城市的追捕逃避游戏对多车辆追捕（MVP）问题产生了深远的影响。现有关于MVP问题的研究倾向于使逃避的车辆随机移动或以固定的规定途径移动。对手建模方法在解决对手剂引起的非平稳地位方面已证明了巨大的希望。但是，他们中的大多数都专注于两人竞争游戏和轻松的场景，而不会受到环境的干扰。本文认为在复杂的城市交通现场中，团队对团队的多车辆追捕（T2TMVP）问题，逃避车辆采用预先训练的动态策略来聪明地执行决策。为了解决这个问题，我们通过最大化相互信息指标（OARLM2I2）方法提出了一种对手感知的加强学习，以提高复杂环境中的追求效率。首先，提出了一种基于顺序的编码对手联合策略建模（SEOJSM）机制来生成逃避车辆的联合策略模型，该模型有助于基于DEEP Q-NETWORK（DQN）的多代理决策过程。然后，我们设计了一个相互信息的损失，同时考虑了从环境中获得的回报以及对手联合战略模型的有效性，以更新追求车辆的决策过程。基于相扑的广泛实验表明，我们的方法在减少追求时间的平均表现平均优于其他基准。该代码可在\ url {https://github.com/ant-ist/oarlm2i2}中获得。

The pursuit-evasion game in Smart City brings a profound impact on the Multi-vehicle Pursuit (MVP) problem, when police cars cooperatively pursue suspected vehicles. Existing studies on the MVP problems tend to set evading vehicles to move randomly or in a fixed prescribed route. The opponent modeling method has proven considerable promise in tackling the non-stationary caused by the adversary agent. However, most of them focus on two-player competitive games and easy scenarios without the interference of environments. This paper considers a Team-to-Team Multi-vehicle Pursuit (T2TMVP) problem in the complicated urban traffic scene where the evading vehicles adopt the pre-trained dynamic strategies to execute decisions intelligently. To solve this problem, we propose an opponent-aware reinforcement learning via maximizing mutual information indicator (OARLM2I2) method to improve pursuit efficiency in the complicated environment. First, a sequential encoding-based opponents joint strategy modeling (SEOJSM) mechanism is proposed to generate evading vehicles' joint strategy model, which assists the multi-agent decision-making process based on deep Q-network (DQN). Then, we design a mutual information-united loss, simultaneously considering the reward fed back from the environment and the effectiveness of opponents' joint strategy model, to update pursuing vehicles' decision-making process. Extensive experiments based on SUMO demonstrate our method outperforms other baselines by 21.48% on average in reducing pursuit time. The code is available at \url{https://github.com/ANT-ITS/OARLM2I2}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题