论文标题
统一参考表达生成和理解
Towards Unifying Reference Expression Generation and Comprehension
论文作者
论文摘要
参考表达生成(REG)和理解(REC)是两个高度相关的任务。同时建模Reg和Rec以利用它们之间的关系是改善两者的有前途方法。但是,独特的输入问题以及单个模型之间建立联系的问题为联合模型的设计和培训带来了挑战。为了解决这些问题,我们为Reg and Rec提出了一个名为UNIRIEF的统一模型。它将这两个任务与精心设计的图像区域融合层(IRTF)统一,该层通过图像交叉注意和区域交叉注意来融合图像,区域和文本。此外,IRTF可以为REC任务生成伪输入区域,以实现一种统一的方式来共享整个REC和REG的相同表示空间。我们进一步提出了视觉条件的蒙版语言建模(VMLM)和文本条件区域预测(TRP),以预先培训联合国多个范围的模型。 VMLM和TRP分别与Reg和Rec直接相关,但可以互相帮助。我们在三个基准数据集,reccoco,refcoco+和reccocog上进行了广泛的实验。实验结果表明,我们的模型在REG和REC上都优于先前的最新方法。
Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM and TRP are directly related to REG and REC, respectively, but could help each other. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms previous state-of-the-art methods on both REG and REC.