以egipentric视频语言预测 @ epic-kitchens-100多现实检索挑战2022

论文标题

以egipentric视频语言预测 @ epic-kitchens-100多现实检索挑战2022

Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

论文作者

Lin, Kevin Qinghong, Wang, Alex Jinpeng, Yan, Rui, Xu, Eric Zhongcong, Tu, Rongcheng, Zhu, Yanru, Zhao, Wenzhe, Kong, Weijie, Cai, Chengfei, Wang, Hongfa, Liu, Wei, Shou, Mike Zheng

论文摘要

在本报告中，我们为Epic-kitchens-100多实体检索（miR）挑战提出了一个基于视频的训练（VLP）解决方案\ cite {kevin202222222222egovlp}。尤其是，我们将最近发布的EGO4D数据集\ cite {grauman2021ego4d}从验证数据集，预处理目标和开发集中从egecentric vlp中进行了先驱。基于上述三种设计，我们开发了一个预验证的视频语言模型，该模型能够将其以自我为中心的视频文本表示为mir基准。此外，我们设计了一种自适应的多企业最大利润率损失，以有效地微调模型并为可靠的推理配备双softmax技术。我们最好的单个模型在挑战测试集上获得了强劲的表现，其中47.39％的地图和61.44％的NDCG。该代码可在https://github.com/showlab/egovlp上找到。

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to MIR benchmark. Furthermore, we devise an adaptive multi-instance max-margin loss to effectively fine-tune the model and equip the dual-softmax technique for reliable inference. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG. The code is available at https://github.com/showlab/EgoVLP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题