论文标题
量化神经网络的垂直分层
Vertical Layering of Quantized Neural Networks for Heterogeneous Inference
论文作者
论文摘要
尽管在神经网络量化中取得了很大的进展以进行有效的推理,但现有方法无法扩展到异质设备,因为需要对一个特定的硬件设置进行培训,传输和存储一个专用的模型,从而在模型培训和维护中产生相当大的成本。在本文中,我们研究了神经网络权重的新的垂直层状表示,以将所有量化模型封装为一个模型。通过这种表示,我们可以从理论上实现任何需要训练和维护一个模型的按需服务的精确网络。为此,我们提出了一种简单的量化感知训练(QAT)方案,用于获得高性能的垂直模型。我们的设计结合了级联反采样机制,该机制使我们能够通过将较高的精度权重逐步映射到其相邻的较低精度对应物中,从一个完整的精确源模型获得多个量化的网络。然后,使用一个源模型的不同位宽度的网络,使用多目标优化来训练共享的源模型权重,以便考虑到所有网络的性能,可以同时更新它们。通过这样做,将优化共享权重以平衡不同量化模型的性能,从而使权重之间可以在不同的位宽度之间转移。实验表明,一旦QAT方案有效地将多个量化网络体现为一个单个训练,并提供了一次性训练,并提供了可比的性能,与针对任何特定的位宽度量身定制的量化模型的性能有效。代码将可用。
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.