通过视觉变压器学习不平衡的数据

论文标题

通过视觉变压器学习不平衡的数据

Learning Imbalanced Data with Vision Transformers

论文作者

Xu, Zhengzhuo, Liu, Ruikang, Yang, Shuo, Chai, Zenghao, Yuan, Chun

论文摘要

现实世界中的数据往往会严重失衡，并严重偏向数据驱动的深神经网络，这使长尾识别（LTR）成为巨大的挑战性任务。现有的LTR方法很少有具有长尾（LT）数据的火车视觉变压器（VIT），而现成的VIT的预处理重量总是会导致不公平的比较。在本文中，我们系统地研究了VIT在LTR中的性能，并建议LIVT仅使用LT数据从头开始训练VIT。通过观察到VIT遭受更严重的LTR问题的观察，我们进行了掩盖的生成预处理（MGP）以学习广义特征。有了充分和可靠的证据，我们表明MGP比受到监督的举止更强大。此外，二进制跨熵（BCE）损失显示出具有VIT的显着性能，在LTR中遇到了困境。我们进一步提出平衡的公元前，以强大的理论基础来改善它。特别是，我们得出了sigmoid的公正扩展，并补偿了额外的logit利润来部署它。我们的BAL-BCE仅在几个时代就有助于VIT的快速收敛。广泛的实验表明，LIVT在MGP和BAL-BCE中成功训练VITS，没有任何其他数据，并且表现优于可比较的最新方法，例如，我们的VIT-B可以在无铃铛和吹哨的情况下达到Inaturalist 2018中的Inaturalist 2018中81.0％的TOP-1 TOP-1准确性。代码可在https://github.com/xuzhengzhuo/livt上找到。

The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. In addition, Binary Cross Entropy (BCE) loss, which shows conspicuous performance with ViTs, encounters predicaments in LTR. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins to deploy it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题