RedHot：社交媒体上的注释医学问题，经验和主张的语料库

论文标题

RedHot：社交媒体上的注释医学问题，经验和主张的语料库

RedHOT: A Corpus of Annotated Medical Questions, Experiences, and Claims on Social Media

论文作者

Wadhwa, Somin, Khetan, Vivek, Amir, Silvio, Wallace, Byron

论文摘要

我们提出了Reddit Health在线谈话（REDHOT），这是一个跨越24个健康状况的Reddit的22,000个注释社交媒体帖子的语料库。注释包括与医疗索赔，个人经验和问题相对应的跨度的分界。我们收集有关确定索赔的其他颗粒注释。具体而言，我们标记了描述患者人群，干预措施和结果（PIO元素）的摘要。使用此语料库，我们介绍了检索与社交媒体上给定主张相关的可信赖证据的任务。我们提出了一种新方法，以自动为此任务提供（嘈杂）监督，以训练该任务的密集检索模型；这表现优于基线模型。手动评估医生进行的检索结果表明，尽管我们的系统性能是有希望的，但仍有相当大的改进空间。可以在https://github.com/sominw/redhot上获得收集的注释（以及用于组装数据集的脚本）。

We present Reddit Health Online Talk (RedHOT), a corpus of 22,000 richly annotated social media posts from Reddit spanning 24 health conditions. Annotations include demarcations of spans corresponding to medical claims, personal experiences, and questions. We collect additional granular annotations on identified claims. Specifically, we mark snippets that describe patient Populations, Interventions, and Outcomes (PIO elements) within these. Using this corpus, we introduce the task of retrieving trustworthy evidence relevant to a given claim made on social media. We propose a new method to automatically derive (noisy) supervision for this task which we use to train a dense retrieval model; this outperforms baseline models. Manual evaluation of retrieval results performed by medical doctors indicate that while our system performance is promising, there is considerable room for improvement. Collected annotations (and scripts to assemble the dataset), are available at https://github.com/sominw/redhot.

下载PDF全文

下载文献需遵守相关版权规定

论文标题