可变形的Vistr：视频实例分割的时空时间可变形的关注

论文标题

可变形的Vistr：视频实例分割的时空时间可变形的关注

Deformable VisTR: Spatio temporal deformable attention for video instance segmentation

论文作者

Yarram, Sudhir, Wu, Jialian, Ji, Pan, Xu, Yi, Yuan, Junsong

论文摘要

视频实例细分（VIS）任务需要在视频剪辑中的所有帧上对对象实例进行分类，分割和跟踪。最近，Vistr被提议为基于端到端变压器的VIS框架，同时展示了最先进的性能。但是，由于其变压器注意模块的高计算成本，VISTR在训练过程中的收敛缓慢，需要大约1000 GPU小时。为了提高训练效率，我们提出了可变形的Vistr，利用时空可变形的注意模块，该模块仅参考参考点附近的一小部分固定的关键时空抽样点。这使可变形的Vistr能够在时空特征图的大小中实现线性计算。此外，它可以作为原始VISTR的PAR性能实现，并以10 $ \ times $ $ $ $ GPU的培训时间来实现。我们验证方法对YouTube-VIS基准测试的有效性。代码可在https://github.com/skrya/defvis上找到。

Video instance segmentation (VIS) task requires classifying, segmenting, and tracking object instances over all frames in a video clip. Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance. However, VisTR is slow to converge during training, requiring around 1000 GPU hours due to the high computational cost of its transformer attention module. To improve the training efficiency, we propose Deformable VisTR, leveraging spatio-temporal deformable attention module that only attends to a small fixed set of key spatio-temporal sampling points around a reference point. This enables Deformable VisTR to achieve linear computation in the size of spatio-temporal feature maps. Moreover, it can achieve on par performance as the original VisTR with 10$\times$ less GPU training hours. We validate the effectiveness of our method on the Youtube-VIS benchmark. Code is available at https://github.com/skrya/DefVIS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题