论文标题
知识集成网络用于行动识别
Knowledge Integration Networks for Action Recognition
论文作者
论文摘要
在这项工作中,我们建议知识集成网络(称为kinet)进行视频动作识别。 Kinet能够汇总有意义的上下文特征,这些特征对于识别人类信息和场景环境等动作至关重要。我们设计了一个三个分支机构的建筑,该建筑由一个主要的分支组成,以进行行动识别,并设计了两个用于人类解析和场景识别的辅助分支,允许该模型编码人类和场景的知识以识别行动识别。我们探索了两个预训练的模型作为教师网络,以精炼人类和场景的知识,以培训Kinet的辅助任务。此外,我们提出了一种两级知识编码机制,该机制包含一个跨分支集成(CBI)模块,用于将辅助知识编码为中级卷积特征,以及一个动作知识图(AKG),以有效融合高级上下文信息。这导致了一个端到端的可训练框架,可以协作培训这三个任务,从而使模型有效地计算强大的上下文知识。拟议的Kinet在大规模的动作识别基准Kinetics-400上实现了最先进的性能,前1位准确性为77.8%。我们进一步证明了我们的动力学通过将动力学训练的模型转移到UCF-101的情况下具有很强的能力,在该模型中获得了97.8%的TOP-1准确性。
In this work, we propose Knowledge Integration Networks (referred as KINet) for video action recognition. KINet is capable of aggregating meaningful context features which are of great importance to identifying an action, such as human information and scene context. We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition. We explore two pre-trained models as teacher networks to distill the knowledge of human and scene for training the auxiliary tasks of KINet. Furthermore, we propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively, allowing the model to compute strong context knowledge efficiently. The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8%. We further demonstrate that our KINet has strong capability by transferring the Kinetics-trained model to UCF-101, where it obtains 97.8% top-1 accuracy.