自动机器学习来自学术大数据

论文标题

自动机器学习来自学术大数据

Automatic Machine Learning Derived from Scholarly Big Data

论文作者

Greenstein-Messica, Asnat, Vainshtein, Roman, Katz, Gilad, Shapira, Bracha, Rokach, Lior

论文摘要

应用机器学习的挑战性方面之一是需要识别最适合给定数据集的算法。这个过程可能很困难，耗时，并且通常需要大量的领域知识。我们提出了侍酒师，这是一种专家系统，用于推荐机器学习算法，该算法应应用于以前看不见的数据集。侍酒师基于从大量学术出版物中提取的领域知识的单词嵌入表示。当提供新的数据集及其问题描述时，Sommelier利用了对单词嵌入表示表示的建议模型，以提供在数据集中使用的最相关算法的排名列表。我们通过对121个公开数据集和53种分类算法进行广泛评估来证明侍酒师的有效性。 Sommelier为每个数据集推荐的最高算法能够平均达到所有被调查算法的最佳精度的97.7％。

One of the challenging aspects of applying machine learning is the need to identify the algorithms that will perform best for a given dataset. This process can be difficult, time consuming and often requires a great deal of domain knowledge. We present Sommelier, an expert system for recommending the machine learning algorithms that should be applied on a previously unseen dataset. Sommelier is based on word embedding representations of the domain knowledge extracted from a large corpus of academic publications. When presented with a new dataset and its problem description, Sommelier leverages a recommendation model trained on the word embedding representation to provide a ranked list of the most relevant algorithms to be used on the dataset. We demonstrate Sommelier's effectiveness by conducting an extensive evaluation on 121 publicly available datasets and 53 classification algorithms. The top algorithms recommended for each dataset by Sommelier were able to achieve on average 97.7% of the optimal accuracy of all surveyed algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题