用于非凸优化的混合级分布式SGD方法，以平衡通信开销，计算复杂性和收敛速率

论文标题

用于非凸优化的混合级分布式SGD方法，以平衡通信开销，计算复杂性和收敛速率

A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to Balance Communication Overhead, Computational Complexity, and Convergence Rate

论文作者

Omidvar, Naeimeh, Maddah-Ali, Mohammad Ali, Mahdavi, Hamed

论文摘要

在本文中，我们提出了一种分布式随机梯度下降（SGD）的方法，具有低通信负载和计算复杂性，并且仍然快速收敛。为了减少算法的每次迭代的通信负载，工人节点计算并传达一些缩放器，这些定量器是样本函数的定向衍生物，以某些\ emph {preshared Directions}中的方向函数。但是，为了保持准确性，在每个特定数量的迭代次数之后，它们会传达随机梯度的向量。为了降低每次迭代中的计算复杂性，工人节点通过仅进行两次函数评估而不是计算一阶梯度向量来近似具有零阶随机梯度估计的定向衍生物。所提出的方法高度提高了零阶方法的收敛速率，从而确保订单更快地收敛。此外，与著名的模型平均沟通效率方法相比（执行本地模型更新和梯度的定期交流以同步本地模型），我们证明，对于非convex随机问题的一般类别，与参数的合理选择相比，提议的方法可以保证相同的通信负载率和转换率相同，同时具有相同的订购率，同时具有相同的订单率。关于神经网络应用中各种学习问题的实验结果证明了与各种最新分布式SGD方法相比，该方法的有效性。

In this paper, we propose a method of distributed stochastic gradient descent (SGD), with low communication load and computational complexity, and still fast convergence. To reduce the communication load, at each iteration of the algorithm, the worker nodes calculate and communicate some scalers, that are the directional derivatives of the sample functions in some \emph{pre-shared directions}. However, to maintain accuracy, after every specific number of iterations, they communicate the vectors of stochastic gradients. To reduce the computational complexity in each iteration, the worker nodes approximate the directional derivatives with zeroth-order stochastic gradient estimation, by performing just two function evaluations rather than computing a first-order gradient vector. The proposed method highly improves the convergence rate of the zeroth-order methods, guaranteeing order-wise faster convergence. Moreover, compared to the famous communication-efficient methods of model averaging (that perform local model updates and periodic communication of the gradients to synchronize the local models), we prove that for the general class of non-convex stochastic problems and with reasonable choice of parameters, the proposed method guarantees the same orders of communication load and convergence rate, while having order-wise less computational complexity. Experimental results on various learning problems in neural networks applications demonstrate the effectiveness of the proposed approach compared to various state-of-the-art distributed SGD methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题