论文标题

快速贝叶斯记录链接与特定于记录的分歧参数

Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters

论文作者

Stringham, Thomas

论文摘要

研究人员通常有兴趣将缺乏常见唯一标识符的两个数据集链接。匹配过程通常很难将记录与通用名称,出生地或其他现场值匹配。计算可行性也是一个挑战,尤其是在链接大数据集时。我们开发了一种用于自动概率记录链接的贝叶斯方法,并证明它恢复了超过50%以上的真实匹配(保持准确性常数),比在军事招聘数据匹配到1900 US PECSUS的匹配中的可比较方法,可为其提供专家标记的比赛。我们的方法以最新的最新贝叶斯方法为基础,它完善了比较数据的建模,从而使分歧概率参数以非匹配状态为条件,可以在两个数据集中的较小的情况下进行记录。当许多记录共享共同的字段值时,这种灵活性会显着提高匹配。我们表明,尽管复杂性增加了,但在实践中,我们的方法在实践中是可行的,并且R/C ++实现实现了,可以显着提高速度,而不是最近的方法。我们还建议一种轻巧的方法来处理非常通用的名称,并显示如何在不可用的匹配状态时估计真正的正速率和正预测价值。

Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 US Census for which expert-labelled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modelling of comparison data, allowing disagreement probability parameters conditional on non-match status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源