利用视觉问题回答以改善文本对图像的综合

论文标题

利用视觉问题回答以改善文本对图像的综合

Leveraging Visual Question Answering to Improve Text-to-Image Synthesis

论文作者

Frolov, Stanislav, Jolly, Shailza, Hees, Jörn, Dengel, Andreas

论文摘要

从文本描述中生成图像最近引起了很多兴趣。虽然当前的模型可以生成单个物体（例如鸟类和人脸）的照片真实图像，但与多个物体合成图像仍然非常困难。在本文中，我们提出了一种有效的方法，将文本对图像（T2I）合成与视觉询问答案（VQA）相结合，以通过利用VQA 2.0数据集来改善生成的图像的图像质量和图像文本对齐。我们通过串联问答（QA）对创建其他训练样本，并采用标准VQA模型为T2I模型提供辅助学习信号。我们鼓励从质量检查对产生的图像看起来逼真，并减少外部VQA损失。我们的方法将FID从27.84降低到25.38，并增加了R-PREC。与基线相比，从83.82％到84.79％，这表明可以使用标准VQA模型成功改善T2i合成。

Generating images from textual descriptions has recently attracted a lot of interest. While current models can generate photo-realistic images of individual objects such as birds and human faces, synthesising images with multiple objects is still very difficult. In this paper, we propose an effective way to combine Text-to-Image (T2I) synthesis with Visual Question Answering (VQA) to improve the image quality and image-text alignment of generated images by leveraging the VQA 2.0 dataset. We create additional training samples by concatenating question and answer (QA) pairs and employ a standard VQA model to provide the T2I model with an auxiliary learning signal. We encourage images generated from QA pairs to look realistic and additionally minimize an external VQA loss. Our method lowers the FID from 27.84 to 25.38 and increases the R-prec. from 83.82% to 84.79% when compared to the baseline, which indicates that T2I synthesis can successfully be improved using a standard VQA model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题