论文标题
视频对象分割的分层传播中的分离功能
Decoupling Features in Hierarchical Propagation for Video Object Segmentation
论文作者
论文摘要
本文着重于为半监督视频对象分割(VOS)开发更有效的分层传播方法。基于视觉变压器,最近开发的与变压器方法(AOT)方法的对象将层次传播引入VOS中,并显示出令人鼓舞的结果。分层传播可以逐渐将信息从过去的框架传播到当前帧,并将当前帧功能从对象敏捷转移到特定于对象。但是,特定于对象的信息的增加将不可避免地导致深层传播层中对象不合时宜的视觉信息的丢失。为了解决此类问题并进一步促进视觉嵌入的学习,本文提出了分层传播(DEAOT)方法的分离特征。首先,DEAOT通过将它们在两个独立的分支中处理,将对象敏捷和特定于对象的嵌入对象不合时宜地嵌入。其次,为了弥补双分支传播中的额外计算,我们提出了一个有效的模块来构建层次传播的模块,即封闭式传播模块,该模块是仔细设计的,以单头关注。广泛的实验表明,DEAOT在准确性和效率方面都显着胜过AOT。在YouTube-VOS上,DEAOT可以在22.4fps的情况下获得86.0%的速度,而在53.4fps的82.0%可以获得82.0%。没有测试时间的增加,我们可以在四个基准上实现新的最新性能,即YouTube-VOS(86.2%),戴维斯2017(86.2%),戴维斯2016年(92.9%)和Dot 2020(0.622)。项目页面:https://github.com/z-x-yang/aot。
This paper focuses on developing a more effective method of hierarchical propagation for semi-supervised Video Object Segmentation (VOS). Based on vision transformers, the recently-developed Associating Objects with Transformers (AOT) approach introduces hierarchical propagation into VOS and has shown promising results. The hierarchical propagation can gradually propagate information from past frames to the current frame and transfer the current frame feature from object-agnostic to object-specific. However, the increase of object-specific information will inevitably lead to the loss of object-agnostic visual information in deep propagation layers. To solve such a problem and further facilitate the learning of visual embeddings, this paper proposes a Decoupling Features in Hierarchical Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propagation of object-agnostic and object-specific embeddings by handling them in two independent branches. Secondly, to compensate for the additional computation from dual-branch propagation, we propose an efficient module for constructing hierarchical propagation, i.e., Gated Propagation Module, which is carefully designed with single-head attention. Extensive experiments show that DeAOT significantly outperforms AOT in both accuracy and efficiency. On YouTube-VOS, DeAOT can achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations, we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622). Project page: https://github.com/z-x-yang/AOT.