论文标题
以egipentric视频语言预测 @ epic-kitchens-100多现实检索挑战2022
Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
论文作者
论文摘要
在本报告中,我们为Epic-kitchens-100多实体检索(miR)挑战提出了一个基于视频的训练(VLP)解决方案\ cite {kevin202222222222egovlp}。尤其是,我们将最近发布的EGO4D数据集\ cite {grauman2021ego4d}从验证数据集,预处理目标和开发集中从egecentric vlp中进行了先驱。基于上述三种设计,我们开发了一个预验证的视频语言模型,该模型能够将其以自我为中心的视频文本表示为mir基准。此外,我们设计了一种自适应的多企业最大利润率损失,以有效地微调模型并为可靠的推理配备双softmax技术。我们最好的单个模型在挑战测试集上获得了强劲的表现,其中47.39%的地图和61.44%的NDCG。该代码可在https://github.com/showlab/egovlp上找到。
In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to MIR benchmark. Furthermore, we devise an adaptive multi-instance max-margin loss to effectively fine-tune the model and equip the dual-softmax technique for reliable inference. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG. The code is available at https://github.com/showlab/EgoVLP.