及时努力：多任务提示学习使用综合场景数据改进视频变压器

论文标题

及时努力：多任务提示学习使用综合场景数据改进视频变压器

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

论文作者

Herzig, Roei, Abramovich, Ofir, Ben-Avraham, Elad, Arbelle, Assaf, Karlinsky, Leonid, Shamir, Ariel, Darrell, Trevor, Globerson, Amir

论文摘要

动作识别模型通过合并场景级注释，例如对象，其关系，3D结构等，取得了令人印象深刻的结果。但是，获取视频场景结构的注释需要大量的努力来收集和注释，这使这些方法的训练昂贵。相反，图形引擎生成的合成数据集为在多个任务中生成场景级注释提供了强大的替代方案。在这项工作中，我们提出了一种方法来利用合成场景数据来改善视频理解。我们为视频变形金刚提供了一种多任务及时的学习方法，其中共享的视频变压器主链通过为每个任务的一小部分专业参数增强。具体来说，我们添加了一组“任务提示”，每个提示都与其他任务相对应，并让每个提示预测与任务相关的注释。该设计使模型可以捕获合成场景任务中共享的信息，以及整个网络中合成场景任务和一个真实视频下游任务之间共享的信息。由于提示与任务相关的结构，我们将此方法称为“及时工程学”。我们建议使用“及时工程学”方法从合成数据中合并各种类型的场景级信息，这是一种视频变压器，该模型Promptonomyvit模型（PVIT）。 PVIT在多个视频理解任务和数据集上显示出强烈的性能改进。项目页面：\ url {https://ofir1080.github.io/promptonomyVit}

Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of "task prompts", each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as "Promptonomy", since the prompts model task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the "Promptonomy" approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets. Project page: \url{https://ofir1080.github.io/PromptonomyViT}

下载PDF全文

下载文献需遵守相关版权规定

论文标题