通过多模式时间对比度学习的长期视频预训练

论文标题

通过多模式时间对比度学习的长期视频预训练

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

论文作者

Sun, Yuchong, Xue, Hongwei, Song, Ruihua, Liu, Bei, Yang, Huan, Fu, Jianlong

论文摘要

大规模的视频预培训显示，视频语言理解任务有了显着改善。先前对视频预处理的研究主要集中在短形式视频（即30秒内）和句子上，而长期视频语言的预培训很少探索。直接从长篇视频和语言中学习表示形式可能会使许多长形式的视频理解任务受益。但是，由于难以对长期关系建模以及由更多框架造成的沉重计算负担，这是具有挑战性的。在本文中，我们介绍了一个长形式的视频语言预训练模型（LF-VILA），并在现有公共数据集构建的大规模长形视频和段落数据集上进行训练。为了有效地捕获丰富的时间动态，并以有效的端到端方式更好地对齐视频和语言，我们在LF-VILA模型中介绍了两种新颖的设计。我们首先提出了多模式的时间对比度（MTC）损失，以通过鼓励长期视频和段落之间的细粒度对齐方式来学习跨不同模式的时间关系。其次，我们提出了分层时间窗口注意（HTWA）机制，以有效捕获长期依赖性，同时降低变压器的计算成本。我们将预先训练的LF-VILA模型微调了七个下游长期视频语言理解段落 - 视频检索和长期视频询问的任务，并实现新的最先进的表演。具体而言，我们的模型可在活动网段到视频检索任务上获得16.1％的相对改善，分别在2QA任务上获得2.4％。我们在https://github.com/microsoft/xpretrain上发布代码，数据集和预训练的模型。

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题