基于端到端变压器的图像字幕模型

论文标题

基于端到端变压器的图像字幕模型

End-to-End Transformer Based Model for Image Captioning

论文作者

Wang, Yiyu, Xu, Jungang, Sun, Yingfei

论文摘要

基于CNN-LSTM的架构在图像字幕中发挥了重要作用，但受训练效率和表达能力的限制，研究人员开始探索基于CNN-转变的模型并取得了巨大的成功。同时，几乎所有最近的作品都采用更快的R-CNN作为骨干编码器，从给定图像中提取区域级特征。但是，更快的R-CNN需要在附加数据集上进行预训练，该数据集将图像字幕的任务分为两个阶段，并限制其潜在应用程序。在本文中，我们构建了一个基于变压器的纯模型，该模型将图像字幕集成到一个阶段并实现端到端训练。首先，我们采用swintransformer来替换更快的R-CNN作为骨干编码器，以从给定的图像中提取网格级特征。然后，在参考变压器时，我们构建了一个精炼的编码器和一个解码器。精炼的编码器通过捕获它们之间的内在关系来完善网格特征，而解码器将精制的特征解码为字幕字的字幕。此外，为了增加多模式（视觉和语言）功能之间的相互作用，以增强建模能力，我们计算网格特征作为全局功能的平均汇总，然后将其引入精炼编码器中，以完善网格特征，并添加精制全球功能的预融合过程，并在解码器中添加生成的单词。为了验证我们提出的模型的有效性，我们对MSCOCO数据集进行了实验。与现有已发表的作品相比，实验结果表明，我们的模型在“ karpathy”离线测试拆分和136.0％（C5）和138.3％（C40）（C40）（C40）CIDER CIDER SCORE上的新的最先进表现（单型）和141.0％（单型4型组合）和141.0％（4个型号的集合）。训练有素的模型和源代码将发布。

CNN-LSTM based architectures have played an important role in image captioning, but limited by the training efficiency and expression ability, researchers began to explore the CNN-Transformer based models and achieved great success. Meanwhile, almost all recent works adopt Faster R-CNN as the backbone encoder to extract region-level features from given images. However, Faster R-CNN needs a pre-training on an additional dataset, which divides the image captioning task into two stages and limits its potential applications. In this paper, we build a pure Transformer-based model, which integrates image captioning into one stage and realizes end-to-end training. Firstly, we adopt SwinTransformer to replace Faster R-CNN as the backbone encoder to extract grid-level features from given images; Then, referring to Transformer, we build a refining encoder and a decoder. The refining encoder refines the grid features by capturing the intra-relationship between them, and the decoder decodes the refined features into captions word by word. Furthermore, in order to increase the interaction between multi-modal (vision and language) features to enhance the modeling capability, we calculate the mean pooling of grid features as the global feature, then introduce it into refining encoder to refine with grid features together, and add a pre-fusion process of refined global feature and generated words in decoder. To validate the effectiveness of our proposed model, we conduct experiments on MSCOCO dataset. The experimental results compared to existing published works demonstrate that our model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models) CIDEr scores on `Karpathy' offline test split and 136.0% (c5) and 138.3% (c40) CIDEr scores on the official online test server. Trained models and source code will be released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题