ISAAQ-掌握教科书问题，并具有预先训练的变压器以及自下而上和自上而下的注意力

论文标题

ISAAQ-掌握教科书问题，并具有预先训练的变压器以及自下而上和自上而下的注意力

ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention

论文作者

Gomez-Perez, Jose Manuel, Ortega, Raul

论文摘要

教科书问题回答是机器理解和视觉问题的交集中的一项复杂任务，该任务需要使用文本和图表中的多模式信息进行推理。本文首次利用了变压器语言模型的潜力以及自下而上和自上而下的关注，以应对该任务所带来的语言和视觉理解挑战。与其从头开始训练语言 - 视觉变压器，我们依靠预先训练的变压器，微调和结合。我们增加了自下而上的注意力和自上而下的注意力，以确定与图表成分及其关系相对应的感兴趣区域，从而改善了每个问题和答案选项的相关视觉信息的选择。我们的系统ISAAQ在所有TQA问题类型中报告了前所未有的成功，精度为81.36％，71.11％和55.12％，在True/false，仅文本和图表多项选择问题上。 Isaaq还展示了其广泛的适用性，在其他苛刻的数据集中获得了最先进的结果。

Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails. Rather than training a language-visual transformer from scratch we rely on pre-trained transformers, fine-tuning and ensembling. We add bottom-up and top-down attention to identify regions of interest corresponding to diagram constituents and their relationships, improving the selection of relevant visual information for each question and answer options. Our system ISAAQ reports unprecedented success in all TQA question types, with accuracies of 81.36%, 71.11% and 55.12% on true/false, text-only and diagram multiple choice questions. ISAAQ also demonstrates its broad applicability, obtaining state-of-the-art results in other demanding datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题