论文标题
具有尖峰重复的赢家全部网络的生物学上合理的变化政策梯度
Biologically Plausible Variational Policy Gradient with Spiking Recurrent Winner-Take-All Networks
论文作者
论文摘要
一系列强化学习研究是探索生物学上合理的模型和算法,以模拟生物智能并拟合神经形态硬件。其中,奖励调制的峰值依赖性可塑性(R-STDP)是最近具有良好能效潜力的分支。但是,当前的R-STDP方法依赖于本地学习规则的启发式设计,因此需要特定于任务的专家知识。在本文中,我们考虑了一个尖峰的赢家赢家全部网络,并提出了一种新的R-STDP方法,即尖峰变化策略梯度(SVPG),其本地学习规则是从全球策略梯度中得出的,因此消除了对启发式设计的需求。在MNIST分类和健身房倒置实验中,我们的SVPG实现了良好的训练性能,并且比传统方法更适合各种噪声。
One stream of reinforcement learning research is exploring biologically plausible models and algorithms to simulate biological intelligence and fit neuromorphic hardware. Among them, reward-modulated spike-timing-dependent plasticity (R-STDP) is a recent branch with good potential in energy efficiency. However, current R-STDP methods rely on heuristic designs of local learning rules, thus requiring task-specific expert knowledge. In this paper, we consider a spiking recurrent winner-take-all network, and propose a new R-STDP method, spiking variational policy gradient (SVPG), whose local learning rules are derived from the global policy gradient and thus eliminate the need for heuristic designs. In experiments of MNIST classification and Gym InvertedPendulum, our SVPG achieves good training performance, and also presents better robustness to various kinds of noises than conventional methods.