视频表示学习的可控增强

论文标题

视频表示学习的可控增强

Controllable Augmentations for Video Representation Learning

论文作者

Qian, Rui, Lin, Weiyao, See, John, Li, Dian

论文摘要

本文着重于自我监督的视频表示学习。大多数现有方法遵循对比度学习管道，通过对不同的剪辑进行采样，以构建正面和负面对。但是，这种表述倾向于偏向静态背景，并且难以建立全球时间结构。主要原因是正对，即从同一视频中采样的不同剪辑，具有有限的时间接收场，通常具有相似的背景，但动作不同。为了解决这些问题，我们提出了一个框架，以共同利用本地剪辑和全球视频，以从详细的区域级信函以及一般的长期时间关系中学习。基于一组可控的增强，我们通过软时空区域的对比度实现了准确的外观和运动模式对齐。我们的表述能够通过最小化以改善概括来避免低级冗余快捷方式。我们还引入了局部 - 全球时间顺序依赖性，以进一步弥合剪辑级别和视频级表示之间的差距，以实现鲁棒的时间建模。广泛的实验表明，我们的框架在动作识别和视频检索中的三个视频基准上是优越的，从而捕获了更准确的时间动态。

This paper focuses on self-supervised video representation learning. Most existing approaches follow the contrastive learning pipeline to construct positive and negative pairs by sampling different clips. However, this formulation tends to bias to static background and have difficulty establishing global temporal structures. The major reason is that the positive pairs, i.e., different clips sampled from the same video, have limited temporal receptive field, and usually share similar background but differ in motions. To address these problems, we propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as general long-term temporal relations. Based on a set of controllable augmentations, we achieve accurate appearance and motion pattern alignment through soft spatio-temporal region contrast. Our formulation is able to avoid the low-level redundancy shortcut by mutual information minimization to improve the generalization. We also introduce local-global temporal order dependency to further bridge the gap between clip-level and video-level representations for robust temporal modeling. Extensive experiments demonstrate that our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题