数据集清洁 - 使用面部识别的大型面部数据集的交叉验证方法

论文标题

数据集清洁 - 使用面部识别的大型面部数据集的交叉验证方法

Dataset Cleaning -- A Cross Validation Methodology for Large Facial Datasets using Face Recognition

论文作者

Varkarakis, Viktor, Corcoran, Peter

论文摘要

近年来，已发布大型“野外”面部数据集，以促进诸如面部检测，面部识别和其他任务等任务的进展。这些数据集中的大多数都是从具有自动过程的网页中获取的。结果，经常发现嘈杂的数据。此外，在这些大面部数据集中，身份注释很重要，因为它们用于训练面部识别算法。但是，由于收集这些数据集的自动方式以及由于它们的尺寸较大，因此许多身份文件夹包含标签错误的样本，从而恶化了数据集的质量。在这项工作中，提出了一种半自动方法，用于使用面部识别来清洁嘈杂的大面部数据集。该方法用于清洁Celeba数据集并显示其有效性。此外，还提供了Celeba数据集中带有错误标记的样本的列表。

In recent years, large "in the wild" face datasets have been released in an attempt to facilitate progress in tasks such as face detection, face recognition, and other tasks. Most of these datasets are acquired from webpages with automatic procedures. As a consequence, noisy data are often found. Furthermore, in these large face datasets, the annotation of identities is important as they are used for training face recognition algorithms. But due to the automatic way of gathering these datasets and due to their large size, many identities folder contain mislabeled samples which deteriorates the quality of the datasets. In this work, it is presented a semi-automatic method for cleaning the noisy large face datasets with the use of face recognition. This methodology is applied to clean the CelebA dataset and show its effectiveness. Furthermore, the list with the mislabelled samples in the CelebA dataset is made available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题