论文标题
监督学习:没有损失没有哭泣
Supervised Learning: No Loss No Cry
论文作者
论文摘要
监督学习需要规范损失功能才能最小化。虽然从计算和统计角度来看,可接受损失的理论既发达又发达,但这些理论提供了不同选择的套餐。实际上,此选择通常是以\ emph {ad hoc}方式做出的。为了使此过程更有原则性,\ emph {学习损失函数}的问题(例如,分类)引起了最近的兴趣。但是,在这一领域的工作通常是经验性的。 在本文中,我们重新访问了Kakade等人的{\ sc slisotron}算法。 (2011)通过一种新型的镜头,得出基于布雷格曼分歧的概括,并展示了它如何提供学习损失的原则过程。详细说明,我们将{\ sc slisotron}作为从复合方形损失的家族中学习的损失。通过通过\ emph {适当损失}的镜头来解释这一点,我们基于Bregman Divergences得出了{\ sc slisotron}的概括。所得的{\ sc bregmantron}算法与分类器共同学习损失。它配备了简单的融合保证,可以为其所学到的损失提供,其一组可能的输出均由贝叶斯规则的不可知论近似性保证。实验表明,{\ sc bregmantron}基本上要比{\ sc slisotron}大大优于{\ sc slisotron},并且可以通过其他算法将其所学的损失用于不同任务,从而打开了域之间有趣的\ textit {损失传递}的有趣问题。
Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an \emph{ad hoc} manner. In hopes of making this procedure more principled, the problem of \emph{learning the loss function} for a downstream task (e.g., classification) has garnered recent interest. However, works in this area have been generally empirical in nature. In this paper, we revisit the {\sc SLIsotron} algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. In detail, we cast {\sc SLIsotron} as learning a loss from a family of composite square losses. By interpreting this through the lens of \emph{proper losses}, we derive a generalisation of {\sc SLIsotron} based on Bregman divergences. The resulting {\sc BregmanTron} algorithm jointly learns the loss along with the classifier. It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule. Experiments indicate that the {\sc BregmanTron} substantially outperforms the {\sc SLIsotron}, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of \textit{loss transfer} between domains.