论文标题
表征分类和回归问题中实例硬度
Characterizing instance hardness in classification and regression problems
论文作者
论文摘要
机器学习(ML)文献中最近的一些作品证明了评估哪些观察结果很难准确预测其标签的有用性。通过确定此类情况,可以检查他们是否有任何质量问题应解决。还可以设计基于观察的难度水平的学习策略。本文提出了一组元功能,旨在表征哪些数据集的实例最难准确地预测其标签,以及为什么它们是这样的,也就是实例硬度措施。都考虑了分类和回归问题。建立和分析具有不同级别复杂性的合成数据集。还提供了包含所有实现的Python软件包。
Some recent pieces of work in the Machine Learning (ML) literature have demonstrated the usefulness of assessing which observations are hardest to have their label predicted accurately. By identifying such instances, one may inspect whether they have any quality issues that should be addressed. Learning strategies based on the difficulty level of the observations can also be devised. This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately and why they are so, aka instance hardness measures. Both classification and regression problems are considered. Synthetic datasets with different levels of complexity are built and analyzed. A Python package containing all implementations is also provided.