论文标题
大批量优化,以进行密集的视觉预测
Large-batch Optimization for Dense Visual Predictions
论文作者
论文摘要
在大规模数据集中培训大规模的深神经网络是具有挑战性且耗时的。最近对大批量优化的突破是应对这一挑战的一种有希望的方法。但是,尽管当前的高级算法(例如LARS和LAMB)在分类模型中成功了,但复杂的视觉预测的管道(例如对象检测和分割)仍然遭受大批量训练制度的大量性能下降。为了应对这一挑战,我们提出了一种简单而有效的算法,称为自适应梯度方差调节器(AGVM),该算法可以训练具有非常大的批量大小的密集视觉预测变量,从而使几种优势比以前的艺术更具吸引力。首先,AGVM可以将密集视觉预测变量的不同模块之间的梯度方差对齐,例如骨干,特征金字塔网络(FPN),检测和分割头。我们表明,批量较大的训练可能会失败,而它们之间的梯度差异未对准,这是以前工作中主要忽略的现象。其次,AGVM是一个插件模块,可以很好地推广到许多不同的体系结构(例如CNN和Transformers)和不同的任务(例如,对象检测,实例分段,语义分段和泛型细分)。它也与不同的优化器(例如SGD和ADAMW)兼容。第三,提供了AGVM的理论分析。可可和ADE20K数据集的广泛实验证明了AGVM的优势。例如,它可以在4分钟内更快地训练R-CNN+RESNET50,而不会失去性能。 AGVM可以在仅3.5小时内使用10亿个参数的对象检测器进行培训,从而将训练时间降低20.9倍,同时在可可获得62.2的地图上。可交付成果在https://github.com/sense-x/agvm上发布。
Training a large-scale deep neural network in a large-scale dataset is challenging and time-consuming. The recent breakthrough of large-batch optimization is a promising way to tackle this challenge. However, although the current advanced algorithms such as LARS and LAMB succeed in classification models, the complicated pipelines of dense visual predictions such as object detection and segmentation still suffer from the heavy performance drop in the large-batch training regime. To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts. Firstly, AGVM can align the gradient variances between different modules in the dense visual predictors, such as backbone, feature pyramid network (FPN), detection, and segmentation heads. We show that training with a large batch size can fail with the gradient variances misaligned among them, which is a phenomenon primarily overlooked in previous work. Secondly, AGVM is a plug-and-play module that generalizes well to many different architectures (e.g., CNNs and Transformers) and different tasks (e.g., object detection, instance segmentation, semantic segmentation, and panoptic segmentation). It is also compatible with different optimizers (e.g., SGD and AdamW). Thirdly, a theoretical analysis of AGVM is provided. Extensive experiments on the COCO and ADE20K datasets demonstrate the superiority of AGVM. For example, it can train Faster R-CNN+ResNet50 in 4 minutes without losing performance. AGVM enables training an object detector with one billion parameters in just 3.5 hours, reducing the training time by 20.9x, whilst achieving 62.2 mAP on COCO. The deliverables are released at https://github.com/Sense-X/AGVM.