物体为时空2.5D点

论文标题

物体为时空2.5D点

Objects as Spatio-Temporal 2.5D points

论文作者

Singh, Paridhi, Singh, Gaurav, Kumar, Arun

论文摘要

确定场景中物体和轨道的准确的鸟类视图（BEV）位置对于各种感知任务，包括对象交互映射，方案提取等，至关重要，但是，完成的监督水平非常具有挑战性。我们提出了一种轻巧，弱监督的方法，通过共同学习在网络的单个馈送前传球中回归2D对象检测和场景的深度预测，以估算对象的3D位置。我们提出的方法扩展了一个基于中心点的单光对象检测器，并引入了一个新颖的对象表示，其中每个对象都被建模为BEV点时空时空，而无需在查询时间进行任何3D或BEV注释以训练和激光雷达数据。该方法利用了容易获得的2D对象监督以及LIDAR点云（仅在训练期间使用）来共同训练单个网络，该网络学会了预测与整个场景深度旁边的2D对象检测，以时空模型对象轨迹为BEV中的点。与最近的SOTA方法相比，所提出的方法在计算上效率超过$ \ sim $ 10倍，同时在KITTI跟踪基准上实现了可比精度。

Determining accurate bird's eye view (BEV) positions of objects and tracks in a scene is vital for various perception tasks including object interactions mapping, scenario extraction etc., however, the level of supervision required to accomplish that is extremely challenging to procure. We propose a light-weight, weakly supervised method to estimate 3D position of objects by jointly learning to regress the 2D object detections and scene's depth prediction in a single feed-forward pass of a network. Our proposed method extends a center-point based single-shot object detector, and introduces a novel object representation where each object is modeled as a BEV point spatio-temporally, without the need of any 3D or BEV annotations for training and LiDAR data at query time. The approach leverages readily available 2D object supervision along with LiDAR point clouds (used only during training) to jointly train a single network, that learns to predict 2D object detection alongside the whole scene's depth, to spatio-temporally model object tracks as points in BEV. The proposed method is computationally over $\sim$10x efficient compared to recent SOTA approaches while achieving comparable accuracies on KITTI tracking benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题