论文标题
使用机器学习算法预测真人秀电视约会节目$ \ textit {the Bachelor} $
Predicting Winners of the Reality TV Dating Show $\textit{The Bachelor}$ Using Machine Learning Algorithms
论文作者
论文摘要
$ \ textit {the the bachelor} $是一部真人秀电视约会节目,其中一个单身人士从八周的拍摄中从大约30名女选手那里选出了他的妻子(美国广播公司2002)。我们收集了参与第11至25季的所有422名参赛者的以下数据:他们的年龄,家乡,职业,比赛,一周他们获得了第一个一对一的约会,无论他们是否获得了第一印象,以及最终获得的“地点”。然后,我们培训了三个机器学习模型,以预测$ \ textit {the bachelor} $上成功参赛者的理想特征。我们测试的三种算法是:随机森林分类,神经网络和线性回归。尽管神经网络的整体表现最佳,但我们发现了所有三个模型的一致性。我们的模型发现,如果她是:26岁,白色,从西北部作为舞者,在$ \ textit {the the bachelor} $上取得远距离的可能性最高的可能性最高。 Our methodology is broadly applicable to all romantic reality television, and our results will inform future $\textit{The Bachelor}$ production and contestant strategies.尽管我们的模型相对成功,但我们仍然遇到很高的错误分类率。这可能是因为:(1)我们的培训数据集少于400分或(2)我们的模型太简单,无法在一个赛季中参与参赛者的复杂浪漫联系。
$\textit{The Bachelor}$ is a reality TV dating show in which a single bachelor selects his wife from a pool of approximately 30 female contestants over eight weeks of filming (American Broadcasting Company 2002). We collected the following data on all 422 contestants that participated in seasons 11 through 25: their Age, Hometown, Career, Race, Week they got their first 1-on-1 date, whether they got the first impression rose, and what "place" they ended up getting. We then trained three machine learning models to predict the ideal characteristics of a successful contestant on $\textit{The Bachelor}$. The three algorithms that we tested were: random forest classification, neural networks, and linear regression. We found consistency across all three models, although the neural network performed the best overall. Our models found that a woman has the highest probability of progressing far on $\textit{The Bachelor}$ if she is: 26 years old, white, from the Northwest, works as an dancer, received a 1-on-1 in week 6, and did not receive the First Impression Rose. Our methodology is broadly applicable to all romantic reality television, and our results will inform future $\textit{The Bachelor}$ production and contestant strategies. While our models were relatively successful, we still encountered high misclassification rates. This may be because: (1) Our training dataset had fewer than 400 points or (2) Our models were too simple to parameterize the complex romantic connections contestants forge over the course of a season.