论文标题
部分可观测时空混沌系统的无模型预测
Non-Convergence and Limit Cycles in the Adam optimizer
论文作者
论文摘要
深度神经网络最受欢迎的培训算法之一是Kingma和BA引入的自适应力矩估计(ADAM)。尽管在许多应用程序中取得了成功,但没有令人满意的收敛分析:在超参数的某些限制下,只能显示批处理模式的局部收敛,但对于增量模式存在反例。最近的结果表明,对于简单的二次目标函数,限制了批处理模式下的周期2的周期,但仅用于非典型超参数,仅用于无偏差校正的算法。 %更笼统,还有几种自适应梯度方法试图估算拟合学习率和 /或搜索方向从训练数据中进行拟合,以改善学习过程,与固定学习量的纯梯度下降相比。我们在批处理模式下扩展了ADAM的收敛分析,并表明,即使是二次目标函数,也存在凸函数的最简单情况,即2-limit-cycles的最简单情况,对于超参数的所有选择,也存在。我们分析了这些极限循环的稳定性,并将我们的分析与显示近似收敛的其他结果相关联,但在不适用于二次函数的有界梯度的附加假设下。由于方程式的复杂性,该调查在很大程度上依赖于使用计算机代数。
One of the most popular training algorithms for deep neural networks is the Adaptive Moment Estimation (Adam) introduced by Kingma and Ba. Despite its success in many applications there is no satisfactory convergence analysis: only local convergence can be shown for batch mode under some restrictions on the hyperparameters, counterexamples exist for incremental mode. Recent results show that for simple quadratic objective functions limit cycles of period 2 exist in batch mode, but only for atypical hyperparameters, and only for the algorithm without bias correction. %More general there are several more adaptive gradient methods which try to estimate a fitting learning rate and / or search direction from the training data to improve the learning process compared to pure gradient descent with fixed learningrate. We extend the convergence analysis for Adam in the batch mode with bias correction and show that even for quadratic objective functions as the simplest case of convex functions 2-limit-cycles exist, for all choices of the hyperparameters. We analyze the stability of these limit cycles and relate our analysis to other results where approximate convergence was shown, but under the additional assumption of bounded gradients which does not apply to quadratic functions. The investigation heavily relies on the use of computer algebra due to the complexity of the equations.