论文标题
基于越南名称的性别预测与机器学习技术
Gender Prediction Based on Vietnamese Names with Machine Learning Techniques
论文作者
论文摘要
由于生物学性别是呈现个人人类的方面之一,因此基于人名的性别分类已经做了很多工作。英语和中文的建议是巨大的。尽管如此,到目前为止,很少有越南人的作品。我们为基于越南名称的性别预测提出了一个新的数据集。该数据集包含26,000多个带有性别注释的全名。该数据集可在我们的网站上用于研究目的。此外,本文介绍了六种机器学习算法(支持向量机,多项式幼稚的贝叶斯,伯努利天真贝叶斯,决策树,随机福雷斯特和逻辑回归)和一个深度学习模型(LSTM)(LSTM),具有fastText Word嵌入以嵌入越南人名称上的性别预测。我们创建一个数据集并研究每个名称组件对检测性别的影响。结果,LSTM型号的最佳F1得分最高为96%,我们基于训练有素的模型生成了Web API。
As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This dataset is available on our website for research purposes. In addition, this paper describes six machine learning algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names. We create a dataset and investigate the impact of each name component on detecting gender. As a result, the best F1-score that we have achieved is up to 96% on LSTM model and we generate a web API based on our trained model.