Siri：空间关系诱导的空间描述网络分辨率

论文标题

Siri：空间关系诱导的空间描述网络分辨率

SIRI: Spatial Relation Induced Network For Spatial Description Resolution

论文作者

Wang, Peiyao, Luo, Weixin, Xu, Yanyu, Li, Haojie, Xu, Shugong, Yang, Jianyu, Gao, Shenghua

论文摘要

在给定相应的语言描述的情况下，建议在全景街道视图中针对目标位置提出空间描述分辨率作为语言引导的本地化任务。当前不存在蒸馏空间关系的同时，明确表征对象级关系，但对于此任务至关重要。模仿人类，他们依次跨越空间关系的单词和对象具有第一人称观点以定位其目标，我们提出了一种新型的空间关系（SIRI）网络。具体而言，视觉特征首先在投影潜在空间中的隐式对象级别相关联。然后，它们被每个空间关系单词提炼，从而导致每个代表每个空间关系的不同激活的特征。此外，我们介绍了全球职位先验，以确定缺乏位置信息，这可能导致全球位置推理歧义。语言和视觉特征都被串联以最终确定目标定位。达阵的实验结果表明，就精度而言，我们的方法比最先进的方法高24 \％，该方法以80像素半径衡量。我们的方法还可以很好地概括我们使用与达阵相同的设置收集的提议的扩展数据集。

Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced (SIRI) network. Specifically, visual features are firstly correlated at an implicit object-level in a projected latent space; then they are distilled by each spatial relationship word, resulting in each differently activated feature representing each spatial relationship. Further, we introduce global position priors to fix the absence of positional information, which may result in global positional reasoning ambiguities. Both the linguistic and visual features are concatenated to finalize the target localization. Experimental results on the Touchdown show that our method is around 24\% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.

下载PDF全文

下载文献需遵守相关版权规定

论文标题