InforeCaster：通过异常检测镜头预测流感的血凝素突变

论文标题

InforeCaster：通过异常检测镜头预测流感的血凝素突变

InForecaster: Forecasting Influenza Hemagglutinin Mutations Through the Lens of Anomaly Detection

论文作者

Garjani, Ali, Chegini, Atoosa Malemir, Salehi, Mohammadreza, Tabibzadeh, Alireza, Yousefi, Parastoo, Razizadeh, Mohammad Hossein, Esghaei, Moein, Esghaei, Maryam, Rohban, Mohammad Hossein

论文摘要

流感病毒血凝素是病毒附着在宿主细胞上的重要组成部分。血凝素蛋白是该病毒的遗传区域之一，其突变潜力很高。由于预测突变在产生有效和低成本疫苗方面的重要性，试图解决此问题的解决方案最近引起了重大关注。突变的历史记录已用于在这种解决方案中训练预测模型。但是，对于需要解决的这种模型的开发，突变与保存的蛋白质之间的不平衡是一个巨大的挑战。在这里，我们建议通过异常检测（AD）来应对这一挑战。 AD是机器学习（ML）中建立的一个领域，它试图仅使用正常训练样本将看不见的异常与正常模式区分开。通过将突变视为异常行为，我们可以使该领域现有的丰富解决方案受益。这种方法还符合未分离的与突变训练样本的数量之间的极端失衡问题。在这种表述的动机上，我们的方法试图找到未经未成的样品的紧凑表示，同时迫使异常与正常样品分开。这有助于该模型尽可能多地学习普通训练样本之间的共享独特表示，从而改善了在测试时未分解的样本中突变样品的可见性和可检测性。我们在四个公开可用的数据集上进行了大量实验，这些数据集由3个不同的血凝素蛋白数据集和一个SARS-COV-2数据集组成，并通过不同的标准标准显示了我们方法的有效性。

The influenza virus hemagglutinin is an important part of the virus attachment to the host cells. The hemagglutinin proteins are one of the genetic regions of the virus with a high potential for mutations. Due to the importance of predicting mutations in producing effective and low-cost vaccines, solutions that attempt to approach this problem have recently gained a significant attention. A historical record of mutations have been used to train predictive models in such solutions. However, the imbalance between mutations and the preserved proteins is a big challenge for the development of such models that needs to be addressed. Here, we propose to tackle this challenge through anomaly detection (AD). AD is a well-established field in Machine Learning (ML) that tries to distinguish unseen anomalies from the normal patterns using only normal training samples. By considering mutations as the anomalous behavior, we could benefit existing rich solutions in this field that have emerged recently. Such methods also fit the problem setup of extreme imbalance between the number of unmutated vs. mutated training samples. Motivated by this formulation, our method tries to find a compact representation for unmutated samples while forcing anomalies to be separated from the normal ones. This helps the model to learn a shared unique representation between normal training samples as much as possible, which improves the discernibility and detectability of mutated samples from the unmutated ones at the test time. We conduct a large number of experiments on four publicly available datasets, consisting of 3 different hemagglutinin protein datasets, and one SARS-CoV-2 dataset, and show the effectiveness of our method through different standard criteria.

下载PDF全文

下载文献需遵守相关版权规定

论文标题