BeVformer：通过时空变压器从多摄像机图像中学习鸟类视图的表示

论文标题

BeVformer：通过时空变压器从多摄像机图像中学习鸟类视图的表示

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

论文作者

Li, Zhiqi, Wang, Wenhai, Li, Hongyang, Xie, Enze, Sima, Chonghao, Lu, Tong, Yu, Qiao, Dai, Jifeng

论文摘要

3D视觉感知任务，包括基于多相机图像的3D检测和MAP分割，对于自主驾驶系统至关重要。在这项工作中，我们提出了一个称为BeVformer的新框架，该框架以时空变压器学习统一的BEV表示，以支持多个自主驾驶感知任务。简而言之，BeVormer通过预先定义的网格形BEV查询与空间和时间空间相互作用，从而利用了空间和时间信息。为了汇总空间信息，我们设计了空间交叉注意，每个BEV查询都从相机视图中提取了感兴趣区域的空间特征。对于时间信息，我们提出暂时的自我注意事项，以将历史bev信息反复融合。我们的方法在Nuscenes \ texttt {test}设置上实现了新的最先进的56.9 \％，比以前的最佳艺术高9.0点，并且与基于激光雷尔的碱的表现相当。我们进一步表明，BeVormer明显提高了速度估计的准确性和在低可见性条件下对象的回忆。该代码可在\ url {https://github.com/zhiqi-li/bevformer}中获得。

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题