近乎稀疏的稀疏艾尔德斯，用于分布深度学习

论文标题

近乎稀疏的稀疏艾尔德斯，用于分布深度学习

Near-Optimal Sparse Allreduce for Distributed Deep Learning

论文作者

Li, Shigang, Hoefler, Torsten

论文摘要

沟通开销是大规模训练大型深度学习模型的主要障碍之一。梯度稀疏是一种减少通信量的有前途的技术。但是，由于（1）难以实现可扩展有效的稀疏Alleduce算法以及（2）稀疏开销的难度，因此获得实际的性能提高非常具有挑战性。本文提出了o $ k $ -top $ k $，这是一种用于稀疏梯度分布式培训的计划。 o $ k $ -top $ k $集成了一种新型的稀疏Allreduce算法（少于6 $ k $的通信量，这在渐近上是最佳的）与分散的平行随机梯度下降（SGD）优化器，并且证明了其收敛性。为了减少稀疏开销，o $ k $ -top $ k $有效地根据估计的阈值选择了顶部$ k $梯度值。对来自不同深度学习领域的神经网络模型的PIZ daint Supercupter进行了评估。经验结果表明，o $ k $ -top $ k $达到了与密集的Alleduce相似的模型精度。与优化的密度和最先进的稀疏异常相比，o $ k $ -top $ k $更可扩展，并显着改善了训练吞吐量（例如，256 GPU的BERT提高了3.29x-12.95倍。

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).

下载PDF全文

下载文献需遵守相关版权规定

论文标题