论文标题
通过对齐迷你批次改善SGD培训
Improve SGD Training via Aligning Mini-batches
论文作者
论文摘要
可以将用于监督学习的深神经网络(DNN)视为特征提取器的管道(即最后一个隐藏层)和线性分类器(即输出层),该管道是通过随机渐变下降(SGD)共同训练的。在SGD的每次迭代中,对训练数据的微型批量进行了采样,并且损失函数的真实梯度估计为在此迷你批次上计算的嘈杂梯度。从功能学习的角度来看,应更新功能提取器,以了解有关整个数据的有意义的功能,并减少在迷你批次中的噪声。通过这种动机,我们提出了训练分布匹配(ITDM),以改善DNN训练并减少过度拟合。具体而言,与损耗函数一起,ITDM通过匹配SGD的每次迭代中不同小批次的分布矩来使特征提取器正规化,这可以通过最小化最大平均差异来实现。因此,ITDM不假定潜在特征空间中数据分布的任何明确参数形式。进行了广泛的实验,以证明我们提出的策略的有效性。
Deep neural networks (DNNs) for supervised learning can be viewed as a pipeline of a feature extractor (i.e. last hidden layer) and a linear classifier (i.e. output layer) that is trained jointly with stochastic gradient descent (SGD). In each iteration of SGD, a mini-batch from the training data is sampled and the true gradient of the loss function is estimated as the noisy gradient calculated on this mini-batch. From the feature learning perspective, the feature extractor should be updated to learn meaningful features with respect to the entire data, and reduce the accommodation to noise in the mini-batch. With this motivation, we propose In-Training Distribution Matching (ITDM) to improve DNN training and reduce overfitting. Specifically, along with the loss function, ITDM regularizes the feature extractor by matching the moments of distributions of different mini-batches in each iteration of SGD, which is fulfilled by minimizing the maximum mean discrepancy. As such, ITDM does not assume any explicit parametric form of data distribution in the latent feature space. Extensive experiments are conducted to demonstrate the effectiveness of our proposed strategy.