论文标题
虚拟筛查的深度学习:使用ROC成本功能的五个原因
Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost Functions
论文作者
论文摘要
计算机辅助药物发现是现代药物开发的重要组成部分。在其中,深度学习已成为快速筛选硅中数十亿个分子的重要工具,以含有所需的化学特征的潜在命中。尽管它很重要,但在培训这些模型中仍然存在重大挑战,例如严重的阶级失衡,高决策阈值以及某些数据集中缺乏地面真相标签。在这项工作中,我们认为在这种情况下,支持直接优化接收器的操作特征(ROC),因为它稳健地对阶级不平衡,其在不同决策阈值中妥协的能力,在这种损害中影响相对权重的某些自由,对典型的基准测量指标的忠诚度以及对正面的基准测量的忠诚,以及对正面/无存储的学习。我们还提出了基于ROC的成本功能的新培训方案(连贯的微型批量安排和使用隔离样本),以及基于成本函数,基于logauc度量的成本函数,以促进早期富集(即提高高决策阈值的绩效,在合成预测的预测率高时都可以提高性能)。我们证明,这些方法在一系列PubChem高通量筛选数据集上超过了标准深度学习方法,这些数据集代表了对主要药物目标家庭的现实和多样化的药物发现运动。
Computer-aided drug discovery is an essential component of modern drug development. Therein, deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features. Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets. In this work we argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance, its ability to compromise over different decision thresholds, certain freedom to influence the relative weights in this compromise, fidelity to typical benchmarking measures, and equivalence to positive/unlabeled learning. We also propose new training schemes (coherent mini-batch arrangement, and usage of out-of-batch samples) for cost functions based on the ROC, as well as a cost function based on the logAUC metric that facilitates early enrichment (i.e. improves performance at high decision thresholds, as often desired when synthesizing predicted hit compounds). We demonstrate that these approaches outperform standard deep learning approaches on a series of PubChem high-throughput screening datasets that represent realistic and diverse drug discovery campaigns on major drug target families.