论文标题
揭露彩票假设:获胜票的面具中的编码是什么?
Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?
论文作者
论文摘要
现代深度学习涉及培训代价高昂,高度参数化的网络,从而激发了对仍然可以训练与完整网络相同准确性的稀疏网络的搜索(即匹配)。迭代幅度修剪(IMP)是一种最先进的算法,可以找到如此稀疏的匹配子网,被称为获奖门票。 IMP通过迭代训练的迭代循环,掩盖最小的重量,重新回到早期训练点并重复进行。尽管它很简单,但何时以及如何找到获胜门票的基本原则仍然难以捉摸。特别是,在培训结束时发现了哪些有用的IMP面膜传达给培训开始附近的重返网络的信息? SGD如何允许网络提取此信息?为什么需要迭代修剪?我们根据误差景观的几何形状开发答案。首先,我们发现,在较高的稀疏度处的$ \ unicode {x2014} $ $ \ unicode {x2014} $在连续的修剪迭代时,在且仅在匹配的情况下,连续的修剪迭代均通过零错误屏障连接。这表明在训练结束时发现的掩模传达了轴向子空间的身份,该子空间与匹配的Sublevel集合的所需线性连接模式相交。其次,我们表明SGD由于强大的鲁棒性而可以利用此信息:尽管在训练的早期就扰动,但它仍可以恢复到这种模式。第三,我们展示了训练结束时误差景观的平坦度如何确定在IMP的每次迭代时可以修剪的权重的限制。最后,我们表明,Retring在IMP中的作用是找到一个具有新的小重量的网络。总体而言,这些结果通过揭示错误景观几何形状的基本作用来揭开赢得门票的存在的进展。
Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of training, masking smallest magnitude weights, rewinding back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? How does SGD allow the network to extract this information? And why is iterative pruning needed? We develop answers in terms of the geometry of the error landscape. First, we find that$\unicode{x2014}$at higher sparsities$\unicode{x2014}$pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training convey the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. Second, we show SGD can exploit this information due to a strong form of robustness: it can return to this mode despite strong perturbations early in training. Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP. Finally, we show that the role of retraining in IMP is to find a network with new small weights to prune. Overall, these results make progress toward demystifying the existence of winning tickets by revealing the fundamental role of error landscape geometry.