向自动驾驶汽车提供命令：视觉接地的多模式推理器

论文标题

向自动驾驶汽车提供命令：视觉接地的多模式推理器

Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding

论文作者

Deruyttere, Thierry, Collell, Guillem, Moens, Marie-Francine

论文摘要

我们为视觉接地（VG）任务提出了一个新的空间内存模块和空间推理器。该任务的目的是基于给定的文本查询在图像中找到某个对象。我们的工作着重于将区域建议网络（RPN）的区域集成到一个新的多步推理模型中，我们将其命名为多模式空间区域推理器（MSRR）。引入的模型使用RPN的对象区域作为2D空间内存的初始化，然后根据查询实现多步推理过程对每个区域进行评分，因此为什么我们称其为多模式推理器。我们在挑战数据集上评估了这种新模型，我们的实验表明，与当前最新模型相比，我们共同考虑图像区域的模型和查询单词大大提高了准确性。

We propose a new spatial memory module and a spatial reasoner for the Visual Grounding (VG) task. The goal of this task is to find a certain object in an image based on a given textual query. Our work focuses on integrating the regions of a Region Proposal Network (RPN) into a new multi-step reasoning model which we have named a Multimodal Spatial Region Reasoner (MSRR). The introduced model uses the object regions from an RPN as initialization of a 2D spatial memory and then implements a multi-step reasoning process scoring each region according to the query, hence why we call it a multimodal reasoner. We evaluate this new model on challenging datasets and our experiments show that our model that jointly reasons over the object regions of the image and words of the query largely improves accuracy compared to current state-of-the-art models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题