原子神经网络预测的不确定性量化

论文标题

原子神经网络预测的不确定性量化

Uncertainty quantification for predictions of atomistic neural networks

论文作者

Vazquez-Salazar, Luis Itza, Boittier, Eric D., Meuwly, M.

论文摘要

定量探索了量子化学参考数据对训练的神经网络（NNS）预测的不确定性量化的价值。为此，适当地修改了Physnet NN的体系结构，并使用不同的指标评估所得模型，以量化校准，预测质量以及预测误差和预测的不确定性是否可以相关。 QM9数据库培训的结果以及分布内外的测试集的数据表明，错误和不确定性与线性无关。结果阐明了噪声和冗余使分子的性质预测复杂化，即使在发生变化的情况下，例如两个原本相同的分子中的双键迁移 - 很小。然后将模型应用于互变异反应的真实数据库。分析特征空间中成员之间的距离与其他参数结合的分析表明，训练数据集中的冗余信息会导致较大的差异和较小的错误，而存在相似但非特定的信息会返回大错误，但差异很小。例如，这是对含硝基的脂肪族链的观察到的，尽管训练集包含了与芳族分子结合的硝基组的几个例子，但这些预测很困难。这强调了训练数据组成的重要性，并为这如何影响ML模型的预测能力提供了化学见解。最后，提出的方法可用于通过主动学习优化的基于信息的化学数据库改进目标应用程序。

The value of uncertainty quantification on predictions for trained neural networks (NNs) on quantum chemical reference data is quantitatively explored. For this, the architecture of the PhysNet NN was suitably modified and the resulting model was evaluated with different metrics to quantify calibration, quality of predictions, and whether prediction error and the predicted uncertainty can be correlated. The results from training on the QM9 database and evaluating data from the test set within and outside the distribution indicate that error and uncertainty are not linearly related. The results clarify that noise and redundancy complicate property prediction for molecules even in cases for which changes - e.g. double bond migration in two otherwise identical molecules - are small. The model was then applied to a real database of tautomerization reactions. Analysis of the distance between members in feature space combined with other parameters shows that redundant information in the training dataset can lead to large variances and small errors whereas the presence of similar but unspecific information returns large errors but small variances. This was, e.g., observed for nitro-containing aliphatic chains for which predictions were difficult although the training set contained several examples for nitro groups bound to aromatic molecules. This underlines the importance of the composition of the training data and provides chemical insight into how this affects the prediction capabilities of a ML model. Finally, the approach put forward can be used for information-based improvement of chemical databases for target applications through active learning optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题